Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5358
George Bebis Richard Boyle Bahram Parvin Darko Koracin Paolo Remagnino Fatih Porikli Jörg Peters James Klosowski Laura Arns Yu Ka Chun Theresa-Marie Rhyne Laura Monroe (Eds.)
Advances in Visual Computing 4th International Symposium, ISVC 2008 Las Vegas, NV, USA, December 1-3, 2008 Proceedings, Part I
13
Volume Editors George Bebis, E-mail:
[email protected] Richard Boyle, E-mail:
[email protected] Bahram Parvin, E-mail:
[email protected] Darko Koracin, E-mail:
[email protected] Paolo Remagnino, E-mail:
[email protected] Fatih Porikli, E-mail:
[email protected] Jörg Peters, E-mail:
[email protected]fl.edu James Klosowski, E-mail:
[email protected] Laura Arns, E-mail:
[email protected] Yu Ka Chun, E-mail:
[email protected] Theresa-Marie Rhyne, E-mail:
[email protected] Laura Monroe, E-mail:
[email protected]
Library of Congress Control Number: 2008939872 CR Subject Classification (1998): I.4, I.5, I.2.10, I.3.3, I.3.5, I.3.7, I.2.6, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-89638-4 Springer Berlin Heidelberg New York 978-3-540-89638-8 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12568406 06/3180 543210
Preface
It is with great pleasure that we present the proceedings of the 4th International Symposium on Visual Computing (ISVC 2008) in Las Vegas, Nevada. ISVC offers a common umbrella for the four main areas of visual computing including vision, graphics, visualization, and virtual reality. Its goal is to provide a forum for researchers, scientists, engineers and practitioners throughout the world to present their latest research findings, ideas, developments and applications in the broader area of visual computing. This year, ISVC grew significantly; the program consisted of 15 oral sessions, 1 poster session, 8 special tracks, and 6 keynote presentations. The response to the call for papers was very strong; we received over 340 submissions for the main symposium from which we accepted 102 papers for oral presentation and 70 papers for poster presentation. Special track papers were solicited separately through the Organizing and Program Committees of each track. A total of 56 papers were accepted for oral presentation and 8 papers for poster presentation in the special tracks. All papers were reviewed with an emphasis on potential to contribute to the state of the art in the field. Selection criteria included accuracy and originality of ideas, clarity and significance of results, and presentation quality. The review process was quite rigorous, involving two to three independent blind reviews followed by several days of discussion. During the discussion period we tried to correct anomalies and errors that might have existed in the initial reviews. Despite our efforts, we recognize that some papers worthy of inclusion may have not been included in the program. We offer our sincere apologies to authors whose contributions might have been overlooked. We wish to thank everybody who submitted their work to ISVC 2008 for review. It was because of their contributions that we succeeded in having a technical program of high scientific quality. In particular, we would like to thank the ISVC 2008 Area Chairs, the organizing institutions (UNR, DRI, LBNL, and NASA Ames), the industrial sponsors (Intel, DigitalPersona, Equinox, Ford, Siemens, Hewlett Packard, MERL, UtopiaCompression, iCore), the international Program Committee, the special track organizers and their Program Committees, the keynote speakers, the reviewers, and especially the authors that contributed their work to the symposium. In particular, we would like to thank Siemens, MERL, and iCore who kindly offered three “best paper awards” this year. September 2008
ISVC 2008 Steering Committee and Area Chairs
Organization
ISVC 2008 Steering Committee Bebis George Boyle Richard Parvin Bahram Koracin Darko
University of Nevada, Reno, USA NASA Ames Research Center, USA Lawrence Berkeley National Laboratory, USA Desert Research Institute, USA
ISVC 2008 Area Chairs Computer Vision Remagnino Paolo Porikli Fatih
Kingston University, UK Mitsubishi Electric Research Labs, USA
Computer Graphics Peters Jorg Klosowski James
University of Florida, USA IBM, USA
Virtual Reality Arns Laura Yu Ka Chun
Purdue University, USA Denver Museum of Nature and Science, USA
Visualization Rhyne Theresa-Marie Monroe Laura
North Carolina State University, USA Los Alamos National Labs, USA
Publicity Li Wenjing
STI Medical Systems, USA
Local Arrangements Veropoulos Kostas
Desert Research Institute, USA
Publications Wang Junxian
UtopiaCompression, USA
VIII
Organization
ISVC 2008 Keynote Speakers Pavlidis Ioannis Medioni Gerard Gaither Kelly Aggarwal J.K. Kaufman Arie Grimson Eric
University of Houston, USA University of Southern California, USA University of Texas at Austin, USA University of Texas at Austin, USA Stony Brook University (SUNY), USA Massachusetts Institute of Technology, USA
ISVC 2008 International Program Committee (Area 1) Computer Vision Abidi Besma Aggarwal J. K. Agouris Peggy Anagnostopoulos George Argyros Antonis Asari Vijayan Basu Anup Bebis George Belyaev Alexander Bhatia Sanjiv Bioucas Jose Birchfield Stan Goh Wooi-Boon Bourbakis Nikolaos Brimkov Valentin Cavallaro Andrea Chellappa Rama Cheng Hui Chung, Chi-Kit Ronald Darbon Jerome Davis James Debrunner Christian Duan Ye El-Gammal Ahmed Eng How Lung Erol Ali Fan Guoliang Ferri Francesc Fisher Robert Foresti GianLuca Gandhi Tarak Georgescu Bogdan
University of Tennessee, USA University of Texas, Austin, USA George Mason University, USA Florida Institute of Technology, USA University of Crete, Greece Old Dominion University, USA University of Alberta, Canada University of Nevada at Reno, USA Max-Planck-Institut fuer Informatik, Germany University of Missouri-St. Louis, USA Instituto Superior Tecnico, Lisbon, Portugal Clemson University, USA Nanyang Technological University, Singapore Wright State University, USA State University of New York, USA Queen Mary, University of London, UK University of Maryland, USA Sarnoff Corporation, USA The Chinese University of Hong Kong, Hong Kong UCLA, USA Ohio State University, USA Colorado School of Mines, USA University of Missouri-Columbia, USA University of New Jersey, USA Institute for Infocomm Research, Singapore Ocali Information Technology, Turkey Oklahoma State University, USA Universitat de Valencia, Spain University of Edinburgh, UK University of Udine, Italy University of California at San Diego, USA Siemens, USA
Organization
Gleason, Shaun Guerra-Filho Gutemberg Hammoud Riad Harville Michael He Xiangjian Heikkilä Janne Heyden Anders Hou Zujun Kamberov George Kamberova Gerda Kakadiaris Ioannis Kettebekov Sanzhar Kimia Benjamin Kisacanin Branislav Klette Reinhard Kollias Stefanos Komodakis Nikos Kuno Yoshinori Lee D.J. Lee Seong-Whan Leung Valerie Li Wenjing Liu Jianzhuang Little Jim Ma Yunqian Maeder Anthony Maltoni Davide Maybank Steve McGraw Tim Medioni Gerard Metaxas Dimitris Miller Ron Mirmehdi Majid Monekosso Dorothy Mueller Klaus Mulligan Jeff Nachtegael Mike Nait-Charif Hammadi Nefian Ara Nicolescu Mircea Nixon Mark Nolle Lars Ntalianis Klimis
Oak Ridge National Laboratory, USA University of Texas Arlington, USA Delphi Corporation, USA Hewlett Packard Labs, USA University of Technology, Australia University of Oulu, Finland Malmo University, Sweden Institute for Infocomm Research, Singapore Stevens Institute of Technology, USA Hofstra University, USA University of Houston, USA Keane inc., USA Brown University, USA Texas Instruments, USA Auckland University, New Zeland National Technical University of Athens, Greece Ecole Centrale de Paris, France Saitama University, Japan Brigham Young University, USA Korea University, Korea Kingston University, UK STI Medical Systems, USA The Chinese University of Hong Kong, Hong Kong University of British Columbia, Canada Honyewell Labs, USA CSIRO ICT Centre, Australia University of Bologna, Italy Birkbeck College, UK West Virginia University, USA University of Southern California, USA Rutgers University, USA Ford Motor Company, USA Bristol University, UK Kingston University, UK SUNY Stony Brook, USA NASA Ames Research Center, USA Ghent University, Belgium Bournemouth University, UK Nokia, USA University of Nevada, Reno, USA University of Southampton, UK The Nottingham Trent University, UK National Technical University of Athens, Greece
IX
X
Organization
Pantic Maja Papadourakis George Papanikolopoulos Nikolaos Parvin Bharam Pati Peeta Basa Patras Ioannis Petrakis Euripides Peyronnet Sylvain Piccardi Massimo Pietikäinen Matti Pitas Ioannis Prabhakar Salil Prati Andrea Qian Gang Raftopoulos Kostas Reed Michael Regazzoni Carlo Ribeiro Eraldo Robles-Kelly Antonio Ross Arun Samal Ashok Schaefer Gerald Shi Pengcheng Salgian Andrea Samir Tamer Sarkar Sudeep Sarti Augusto Scalzo Fabien Shah Mubarak Singh Rahul Skurikhin Alexei Su Chung-Yen Sugihara Kokichi Sun Zehang Tan Kar Han Tan Tieniu Tavares, Joao Teoh Eam Khwang Thiran Jean-Philippe Trucco Emanuele Tsechpenakis Gabriel
Imperial College, UK Technological Education Institute, Greece University of Minnesota, USA Lawerence Berkeley National Lab, USA First Indian Corp., India Queen Mary University, London, UK Technical University of Crete, Greece LRDE/EPITA, France University of Technology, Australia LRDE/University of Oulu, Finland University of Thessaloniki, Greece DigitalPersona Inc., USA University of Modena, Italy Arizona State University, USA National Technical University of Athens, Greece Columbia University, USA University of Genoa, Italy Florida Institute of Technology, USA National ICT Australia (NICTA), Australia West Virginia University, USA University of Nebraska, USA Aston University, UK The Hong Kong University of Science and Technology, Hong Kong The College of New Jersey, USA Ingersoll Rand Security Technologies, USA University of South Florida, USA DEI, Politecnico di Milano, Italy University of Nevada, Reno, USA University of Central Florida, USA San Francisco State University, USA Los Alamos National Laboratory, USA National Taiwan Normal University, Taiwan University of Tokyo, Japan eTreppid Technologies, USA Hewlett Packard, USA Chinese Academy of Sciences, China Universidade do Porto, Portugal Nanyang Technological University, Singapore Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland University of Dundee, UK University of Miami, USA
Organization
Tubaro Stefano Uhl Andreas Velastin Sergio Veropoulos Kostas Verri Alessandro Wang Song Wang Junxian Wang Yunhong Webster Michael Wolff Larry Wong Kenneth Xiang Tao Xu Meihe Yeasin Mohammed Yi Lijun Yu Ting Yuan Chunrong Zhang Yan Zhang Yongmian
DEI, Politecnico di Milano, Italy Salzburg University, Austria Kingston University London, UK Desert Research Institute, USA Università di Genova, Italy University of South Carolina, USA UtopiaCompression, USA Beihang University, China University of Nevada, Reno, USA Equinox Corporation, USA The University of Hong Kong, Hong Kong Queen Mary, University of London, UK University of California at Los Angeles, USA Memphis University, USA SUNY at Binghampton, USA GE Global Research, USA University of Tuebingen, Germany Delphi Corporation, USA eTreppid Technologies, USA
(Area 2) Computer Graphics Abram Greg Andres Eric Baciu George Barneva Reneta Bartoli Vilanova Anna Belyaev Alexander Benes Bedrich Bilalis Nicholas Bohez Erik Bouatouch Kadi Brimkov Valentin Brown Ross Callahan Steven Chen Min Cheng Irene Chiang Yi-Jen Choi Min Comba Joao Coming Daniel Cremer Jim
IBM T.J. Watson Reseach Center, USA Laboratory XLIM-SIC, University of Poitiers, France Hong Kong PolyU, Hong Kong State University of New York, USA Eindhoven University of Technology, The Netherlands Max-Planck-Institut fuer Informatik, Germany Purdue University, USA Technical University of Crete, Greece Asian Institute of Technology, Thailand University of Rennes I, IRISA, France State University of New York, USA Queensland University of Technology, Australia University of Utah, USA University of Wales Swansea, UK University of Alberta, Canada Polytechnic University, USA University of Colorado at Denver, USA Univ. Fed. do Rio Grande do Sul, Brazil Desert Research Institute, USA University of Iowa, USA
XI
XII
Organization
Crosa Pere Brunet Debled-Rennesson Isabelle Damiand Guillaume Deng Zhigang Dingliana John El-Sana Jihad Entezari Alireza Fiorio Christophe Floriani Leila De Gaither Kelly Geiger Christian
Universitat Politecnica de Catalunya, Spain
University of Nancy I, France SIC Laboratory, France University of Houston, USA Trinity College, Ireland Ben Gurion University of The Negev, Israel University of Florida, USA LIRMM, France University of Maryland, USA University of Texas at Austin, USA Duesseldorf University of Applied Sciences, Germany Gotz David IBM, USA Gu David State University of New York at Stony Brook, USA Guerra-Filho Gutemberg University of Texas Arlington, USA Hadwiger Helmut Markus VRVis Research Center, Austria Haller Michael Upper Austria University of Applied Sciences, Austria Hamza-Lup Felix Armstrong Atlantic State University, USA Han JungHyun Korea University, Korea Hao Xuejun Columbia University and NYSPI, USA Hernandez Jose Tiberio Universidad de los Andes, Colombia Hinkenjann Andre Bonn Rhein Sieg University of Applied Sciences, Germany Huang Zhiyong Institute for Infocomm Research, Singapore Ju Tao Washington University, USA Julier Simon J. University College London, UK Kakadiaris Ioannis University of Houston, USA Kamberov George Stevens Institute of Technology, USA Kim Young Ewha Womans University, Korea Kobbelt Leif RWTH Aachen, Germany Lai Shuhua Virginia State University, USA Lakshmanan Geetika IBM T.J. Watson Reseach Center, USA Lee Chang Ha Chung-Ang University, Korea Lee Seungyong Pohang University of Science and Technology (POSTECH), Korea Lee Tong-Yee National Cheng-Kung University, Taiwan Levine Martin McGill University, Canada Lindstrom Peter Lawrence Livermore National Laboratory, USA Linsen Lars Jacobs University, Germany Liu Zicheng Microsoft, USA Lok Benjamin University of Florida, USA
Organization
Loviscach Jorn Martin Ralph McGraw Tim Meenakshisundaram Gopi Mendoza Cesar Metaxas Dimitris Moorhead Robert Myles Ashish Nait-Charif Hammadi Noma Tsukasa Oliveira Manuel M. Pajarola Renato Palanque Philippe Pascucci Valerio Pattanaik Sumanta Qin Hong Reed Michael Reif Ulrich Renner Gabor Sapidis Nickolas Sarfraz Muhammad Schaefer Scott Sequin Carlo Shamir Arik Silva Claudio Snoeyink Jack Sourin Alexei Tan Kar Han Teschner Matthias Umlauf Georg Vinacua Alvar Wald Ingo Wylie Brian Ye Duan Yi Beifang Yin Lijun Yoo Terry Yuan Xiaoruv Zhang Eugene
University of Applied Sciences, Bremen, Germany Cardiff University, UK West Virginia University, USA University of California-Irvine, USA NaturalMotion Ltd., USA Rutgers University, USA Mississippi State University, USA University of Florida, USA University of Dundee, UK Kyushu Institute of Technology, Japan Univ. Fed. do Rio Grande do Sul, Brazil University of Zurich, Switzerland University of Paul Sabatier, France Lawrence Livermore National Laboratory, USA University of Central Florida, USA State University of New York at Stony Brook, USA Columbia University, USA Darmstadt University of Technology, Germany Computer and Automation Research Institute, Hungary University of the Aegean, Greece Kuwait University, Kuwait Texas A&M University, USA University of California-Berkeley, USA The Interdisciplinary Center, Herzliya, Israel University of Utah, USA University of North Carolina at Chapel Hill, USA Nanyang Technological University, Singapore Hewlett Packard, USA University of Freiburg, Germany University of Kaiserslautern, Germany Universitat Politecnica de Catalunya, Spain University of Utah, USA Sandia National Laboratory, USA University of Missouri-Columbia, USA Salem State College, USA Binghamton University, USA National Institutes of Health, USA Peking University, China Oregon State University, USA
XIII
XIV
Organization
(Area 3) Virtual Reality Alcañiz Mariano Behringer Reinhold Benes Bedrich Bilalis Nicholas Blach Roland
Technical University of Valencia, Spain Leeds Metropolitan University, UK Purdue University, USA Technical University of Crete, Greece Fraunhofer Institute for Industrial Engineering, Germany Blom Kristopher University of Hamburg, Germany Boyle Richard NASA Ames Research Center, USA Brady Rachael Duke University, USA Brega Jos Remo Ferreira Universidade Estadual Paulista, Brazil Brown Ross Queensland University of Technology, Australia Chen Jian Brown University, USA Cheng Irene University of Alberta, Canada Coming Daniel Desert Research Institute, USA Coquillart Sabine INRIA, France Craig Alan NCSA University of Illinois at Urbana-Champaign, USA Crawfis Roger Ohio State University, USA Cremer Jim University of Iowa, USA Crosa Pere Brunet Universitat Politecnica de Catalunya, Spain Encarnacao L. Miguel Imedia Labs, USA Dachselt Raimund Otto-von-Guericke-Universität Magdeburg, Germany Figueroa Pablo Universidad de los Andes, Colombia Friedman Doron IDC, Israel Geiger Christian Duesseldorf University of Applied Sciences, Germany Gregory Michelle Pacific Northwest National Lab, USA Gupta Satyandra K. University of Maryland, USA Haller Michael FH Hagenberg, Austria Hamza-Lup Felix Armstrong Atlantic State University, USA Harders Matthias ETH Zürich, Switzerland Hinkenjann Andre Bonn-Rhein-Sieg University of Applied Sciences, Germany Hollerer Tobias University of California at Santa Barbara, USA Julier Simon J. University College London, UK Klinger Evelyne Arts et Metiers ParisTech, France Klinker Gudrun Technische Universität München, Germany Klosowski James IBM T.J. Watson Research Center, USA Kuhlen Torsten RWTH Aachen University, Germany Liere Robert van CWI, The Netherlands Lindt Irma Fraunhofer FIT, Germany
Organization
Lok Benjamin Luo Gang Malzbender Tom Molineros Jose Moorhead Robert Muller Stefan Paelke Volker Papka Michael Peli Eli Pugmire Dave Qian Gang Raffin Bruno Reiners Dirk Richir Simon Rodello Ildeberto Rolland Jannick Santhanam Anand Sapidis Nickolas Schmalstieg Dieter Slavik Pavel Sourin Alexei Srikanth Manohar Stefani Oliver Varsamidis Thomas Wald Ingo Yuan Chunrong Zachmann Gabriel Zara Jiri Zyda Michael
University of Florida, USA Harvard Medical School, USA Hewlett Packard Labs, USA Teledyne Scientific and Imaging, USA Mississippi State University, USA University of Koblenz, Germany Leibniz Universität Hannover, Germany Argonne National Laboratory, USA Harvard University, USA Los Alamos National Lab, USA Arizona State University, USA INRIA, France University of Louisiana, USA Arts et Metiers ParisTech, France UNIVEM, PPGCC, Brazil University of Central Florida, USA MD Anderson Cancer Center Orlando, USA University of the Aegean, Greece Graz University of Technology, Austria Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore Indian Institute of Science, India COAT-Basel, Switzerland Bangor University, UK University of Utah, USA University of Tuebingen, Germany Clausthal University, Germany Czech Technical University in Prague, Czech Republic University of Southern California, USA
(Area 4) Visualization Andrienko Gennady Apperley Mark Avila Lisa Balázs Csébfalvi Bartoli Anna Vilanova Brady Rachael Brandes Ulrik Benes Bedrich Bertino Elisa Bilalis Nicholas
Fraunhofer Institut, Germany University of Waikato, New Zealand Kitware, USA Budapest University of Technology and Economics, Hungary Eindhoven University of Technology, The Netherlands Duke University, USA Konstanz University, Germany Purdue University, USA Purdue University, USA Technical University of Crete, Greece
XV
XVI
Organization
Bonneau Georges-Pierre Grenoble Université, France Brodlie Ken University of Leeds, UK Brown Ross Queensland University of Technology, Australia Callahan Steven University of Utah, USA Chen Jian Brown University, USA Chen Min University of Wales Swansea, UK Cheng Irene University of Alberta, Canada Chiang Yi-Jen Polytechnic University, USA Crosa Pere Brunet Universitat Politecnica de Catalunya, Spain Doleisch Helmut VRVis Research Center, Austria Duan Ye University of Missouri-Columbia, USA Dwyer Tim Monash University, Australia Ebert David Purdue University, USA Encarnasao James Miguel Imedia Labs, USA Entezari Alireza University of Florida, USA Ertl Thomas University of Stuttgart, Germany Floriani Leila De University of Maryland, USA Fujishiro Issei Tohoku University, Japan Geiger Christian Duesseldorf University of Applied Sciences, Germany Gotz David IBM, USA Grinstein Georges University of Massachusetts Lowell, USA Goebel Randy University of Alberta, Canada Gregory Michelle Pacific Northwest National Lab, USA Hadwiger Helmut Markus VRVis Research Center, Austria Hagen Hans Technical University of Kaiserslautern, Germany Ham van Frank IBM, USA Hamza-Lup Felix Armstrong Atlantic State University, USA Heer Jeffrey Armstrong University of California at Berkeley, USA Hege Hans-Christian Zuse Institute Berlin, Germany Hochheiser Harry Towson State University, USA Hollerer Tobias University of California at Santa Barbara, USA Hotz Ingrid Zuse Institute Berlin, Germany Julier Simon J. University College London, UK Kao David J. NASA Ames, USA Kohlhammer Jörn Fraunhofer Institut, Germany Koracin Darko Desert Research Institute, USA Kosara Robert University of North Carolina at Charlotte, USA Laidlaw David Brown University, USA
Organization
Laramee Robert Lee Chang Ha Liere Robert van Lim Ik Soo Linsen Lars Liu Zhanping Ma Kwan-Liu Maeder Anthony Malpica Jose Masutani Yoshitaka McGraw Tim Melançon Guy Miksch Silvia Mueller Klaus Museth Ken Paelke Volker Papka Michael Pugmire Dave Rabin Robert Raffin Bruno Rolland Jannick Santhanam Anand Scheuermann Gerik Shen Han-Wei Silva Claudio Sips Mike Slavik Pavel Snoeyink Jack Sourin Alexei Theisel Holger Thiele Olaf Tory Melanie Tricoche Xavier Umlauf Georg Viegas Fernanda Viola Ivan Wald Ingo Wan Ming Ward Matt Weinkauf Tino
Swansea University, UK Chung-Ang University, Korea CWI, The Netherlands Bangor University, UK Jacobs University, Germany Mississippi State University, USA University of California-Davis, USA CSIRO ICT Centre, Australia Alcala University, Spain The University of Tokyo Hospital, Japan West Virginia University, USA CNRS UMR 5800 LaBRI and INRIA Bordeaux Sud-Ouest, France Vienna University of Technology, Austria SUNY Stony Brook, USA Linköping University, Sweden Leibniz Universität Hannover, Germany Argonne National Laboratory, USA Los Alamos National Lab, USA University of Wisconsin at Madison, USA INRIA, France University of Central Florida, USA MD Anderson Cancer Center Orlando, USA University of Leipzig, Germany Ohio State University, USA University of Utah, USA Stanford University, USA Czech Technical University in Prague, Czech Republic University of North Carolina at Chapel Hill, USA Nanyang Technological University, Singapore Bielefeld University, Germany University of Mannheim, Germany University of Victoria, Canada Purdue University, USA University of Kaiserslautern, Germany IBM, USA University of Bergen, Norway University of Utah, USA Boeing Phantom Works, USA Worcester Polytechnic Institute, USA ZIB Berlin, Germany
XVII
XVIII
Organization
Weiskopf Daniel Wetering van de Huub Wijk van Jarke Wylie Brian Yeasin Mohammed Yuan Xiaoru Zachmann Gabriel Zhang Eugene Zhukov Leonid
University of Stuttgart, Germany Technische Universiteit Eindhoven, The Netherlands Technische Universiteit Eindhoven, The Netherlands Sandia National Laboratory, USA Memphis University, USA Peking University, China Clausthal University, Germany Oregon State University, USA Caltech, USA
ISVC 2008 Special Tracks 1. Object Recognition Organizers Andrea Salgian Fabien Scalzo
The College of New Jersey, USA University of Nevada, Reno, USA
Program Committee Boris Epshtein Svetlana Lazebnik Bastian Leibe Vincent Lepetit Ales Leonardis Bogdan Matei Raphael Maree Randal Nelson Justus Piater Bill Triggs Tinne Tuytelaars Michel Vidal-Naquet
The Weizmann Institute of Science, Israel University of North Carolina at Chapel Hill, USA ETH Zurich, Switzerland EPFL, Switzerland University of Ljubljana, Slovenia Sarnoff Corporation, USA Universite de Liege, Belgium University of Rochester, USA Universite de Liege, Belgium INRIA, France Katholieke Universiteit Leuven, Belgium RIKEN Brain Science Institute, Japan
2. Real-Time Vision Algorithm Implementation and Application Organizers D.J. Lee James Archibald Brent Nelson Doran Wilde
Brigham Young Brigham Young Brigham Young Brigham Young
University, University, University, University,
USA USA USA USA
Organization
Program Committee Jiun-Jian Liaw Che-Yen Wen Yuan-Liang Tang Hsien-Chou Liao
Chaoyang University of Technology, Taiwan Central Police University, Taiwan Chaoyang University of Technology, Taiwan Chaoyang University of Technology, Taiwan
3. Visualization and Simulation on Immersive Display Devices Organizers Daniel Coming Darko Koracin Laura Monroe Rachael Brady
Desert Research Institute, USA Desert Research Institute, USA Los Alamos National Lab, USA Duke University, USA
Program Committee Andy Forsberg Bernd Hamann Arie Kaufman Phil McDonald Dave Modl Patrick O’Leary Dirk Reiners Bill Sherman Steve Smith Oliver Staadt
Brown University, USA University of California, Davis, USA Stony Brook University (SUNY), USA Desert Research Institute, USA LANL/LAVA/Worldscape, USA Desert Research Institute, USA LITE, USA Desert Research Institute, USA LANL/LAVA/Worldscape, USA University of Rostock, Germany
4. Analysis and Visualization of Biomedical Visual Data Organizers Irene Cheng Anthony Maeder
University of Alberta, Canada University of Western Sydney, Australia
Program Committee Walter Bischof Pierre Boulanger Ross Brown Pablo Figueroa Carlos Flores Paul Jackway Shoo Lee Tom Malzbender Mrinal Mandal Steven Miller Jiambo Shi
University of Alberta, Canada University of Alberta, Canada Queensland University of Technology, Australia Universidad de los Andes, Colombia University of Alberta, Canada CSIRO, Australia iCARE, Capital Health, Canada HP Labs, USA University of Alberta, Canada University of British Columbia, Canada University of Pennsylvania, USA
XIX
XX
Organization
Claudio Silva Dimitris Gramenos Lijun Yin Xenophon Zabulis Jeffrey Zou
University of Utah, USA Institute of Computer Science-FORTH, Greece University of Utah, USA Institute of Computer Science-FORTH, Greece University of Western Sydney, Australia
5. Soft Computing in Image Processing and Computer Vision Organizers Gerald Schaefer Mike Nachtegael Aboul-Ella Hassanien
Nottingham Trent University, UK Ghent University, Belgium Cairo University, Egypt
Program Committee Hüseyin Çakmak Emre Celebi Kevin Curran Mostafa A. El-Hosseini Hajime Nobuhara Samuel Morillas Daniel Sanchez Mayank Vatsa Ioannis Vlachos Huiyou Zhou
Forschungszentrum Karlsruhe, Germany Louisiana State University, USA University of Ulster, UK Mubarak City for Science and Technology, Egypt Tokyo Institute of Technology, Japan Technical University of Valencia, Spain University of Granada, Spain University of Virginia, USA Aristotle University of Thessaloniki, Greece Brunel University, UK
6. Computational Bioimaging and Visualization Organizers João Manuel R.S. Tavares Renato Natal Jorge Goswami Samrat
University of Porto, Portugal University of Porto, Portugal University of Texas at Austin, USA
Program Committee Alberto De Santis
Università degli Studi di Roma “La Sapienza”, Italy Ana Mafalda Reis Instituto de Ciencias Biomedicas Abel Salazar, Portugal Arrate Muñoz Barrutia University of Navarra, Spain Chang-Tsun Li University of Warwick, UK Christos E. Constantinou Stanford University School of Medicine, USA
Organization
XXI
Mrinal Mandal Daniela Iacoviello
University of Alberta, Canada Università degli Studi di Roma “La Sapienza”, Italy Dinggang Shen University of Pennsylvania, USA Eduardo Borges Pires Instituto Superior Tecnico, Portugal Enrique Alegre Gutiérrez University of León, Spain Filipa Sousa University of Porto, Portugal Gerhard A. Holzapfel Royal Institute of Technology, Sweden Hélder C. Rodrigues Instituto Superior Tecnico, Portugal Hemerson Pistori Dom Bosco Catholic University, Brazil Jorge M.G. Barbosa University of Porto, Portugal Jorge S. Marques Instituto Superior Tecnico, Portugal Jose M. García Aznar University of Zaragoza, Spain Luís Paulo Reis University of Porto, Portugal Manuel González Hidalgo Balearic Islands University, Spain Michel A. Audette University of Leipzig, Germany Patrick Dubois Institut de Technologie Medicale, France Reneta P. Barneva State University of New York, USA Roberto Bellotti University of Bari, Italy Sabina Tangaro University of Bari, Italy Sónia I. Gonçalves-Verheij VU University Medical Centre, The Netherlands Valentin Brimkov State University of New York, USA Yongjie Zhan Carnegie Mellon University, USA Xavier Roca Marvà Autonomous University of Barcelona, Spain 7. Discrete and Computational Geometry and Their Applications in Visual Computing Organizers Valentin Brimkov Reneta Barneva
State University of New York, USA State University of New York, USA
Program Committee K. Joost Batenburg Bedrich Benes Isabelle Debled-Rennesson Christophe Fiorio Gisela Klette Reinhard Klette Kostadin Koroutchev Benedek Nagy
University of Antwerp, Belgium Purdue University, USA Institut Univ de Formation des Maitres de Lorraine, France LIRMM, University Montpellier II, France University of Auckland, New Zealand University of Auckland, New Zealand Universidad Autonoma de Madrid, Spain University of Debrecen, Hungary
XXII
Organization
Kalman Palagyi Arun Ross K.G. Subramanian João Manuel R.S. Tavares
University of Szeged, Hungary West Virginia University, USA Universiti Sains, Malaysia University of Porto, Portugal
8. Image Analysis for Remote Sensing Data Organizers Jose A. Malpica Maria A. Sanz Maria C. Alonso
Alcala University, Spain Technical University of Madrid, Spain Alcala University, Spain
Program Committee Hossein Arefi Manfred Ehlers María J. García-Rodríguez John L. van Genderen Radja Khedam José L. Lerma Qingquan Li Dimitris Manolakis Farid Melgani Jon Mills Francisco Papí Karel Pavelka William D. Philpot Daniela Poli Mohammad-Reza Saradjian Sriparna Saha
Stuttgart University of Applied Sciences, Germany University of Osnabrueck, Germany University of Madrid, Spain ITC, The Netherlands Technology and Sciences University, Algeria Technical University of Valencia, Spain Wuhan University, China MIT Lincoln Laboratory, USA University of Trento, Italy University of Newcastle, UK IGN, Spain Technical University in Prague, Czech Republic Cornell University, USA Swiss Federal Institute of Technology, Switzerland University of Tehran, Iran Indian Statistical Institute, India
Additional Reviewers Shawn Hempel Chris Town Steffen Koch Cliff Lindsay Florian Bingel
RTT, USA Cambridge University, UK University of Stuttgart, Germany Worcester Polytechnic Institute, USA University of Applied Sciences Bonn Rhein Sieg, Germany
Organization
Ingo Feldmann Javier Civera Vitor F. Pamplona Yong-wei Miao Giacinto Donvito Vincenzo Spinoso Michel Verleysen Mark Keck Arthur Szlam Karthik Sankaranarayanan
HHI, Germany University of Zaragoza, Spain Federal University of Rio Grande do Sul (UFRGS), Brazil Zhejiang University of Technology, China Istituto Nazionale di Fisica Nucleare, Italy Istituto Nazionale di Fisica Nucleare, Italy Université catholique de Louvain, Belgium Ohio State University, USA University of California at Los Angeles, USA Ohio State University, USA
XXIII
XXIV
Organization
Organizing Institutions and Sponsors
Table of Contents – Part I
ST: Object Recognition Detection of a Large Number of Overlapping Ellipses Immersed in Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Armando Manuel Fernandes Recognizing Ancient Coins Based on Local Features . . . . . . . . . . . . . . . . . . Martin Kampel and Maia Zaharieva
1 11
Learning Pairwise Dissimilarity Profiles for Appearance Recognition in Visual Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhe Lin and Larry S. Davis
23
Edge-Based Template Matching and Tracking for Perspectively Distorted Planar Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Hofhauser, Carsten Steger, and Nassir Navab
35
Enhancing Boundary Primitives Using a Multiscale Quadtree Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Bergevin and Vincent Bergeron
45
3D Object Modeling and Segmentation Based on Edge-Point Matching with Local Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masahiro Tomono
55
Computer Graphics I Cumulus Cloud Synthesis with Similarity Solution and Particle/Voxel Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bei Wang, Jingliang Peng, and C.-C. Jay Kuo
65
An Efficient Wavelet-Based Framework for Articulated Human Motion Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao-Hua Lee and Joan Lasenby
75
On Smooth Bicubic Surfaces from Quad Meshes . . . . . . . . . . . . . . . . . . . . . Jianhua Fan and J¨ org Peters
87
Simple Steps for Simply Stepping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chun-Chih Wu, Jose Medina, and Victor B. Zordan
97
Fairing of Discrete Surfaces with Boundary That Preserves Size and Qualitative Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇara, and Martina Mat´yskov´ Jana Kostliv´ a, Radim S´ a
107
XXVI
Table of Contents – Part I
Fast Decimation of Polygonal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muhammad Hussain
119
Visualization I Visualizing Argument Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Sbarski, Tim van Gelder, Kim Marriott, Daniel Prager, and Andy Bulka
129
Visualization of Industrial Structures with Implicit GPU Primitives . . . . Rodrigo de Toledo and Bruno Levy
139
Cartesian vs. Radial – A Comparative Evaluation of Two Visualization Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Burch, Felix Bott, Fabian Beck, and Stephan Diehl
151
VoxelBars: An Informative Interface for Volume Visualization . . . . . . . . . Wai-Ho Mak, Ming-Yuen Chan, Yingcai Wu, Ka-Kei Chung, and Huamin Qu
161
Wind Field Retrieval and Display for Doppler Radar Data . . . . . . . . . . . . Shyh-Kuang Ueng and Yu-Chong Chiang
171
Dual Marching Tetrahedra: Contouring in the Tetrahedronal Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gregory M. Nielson
183
ST: Real-Time Vision Algorithm Implementation and Application Vision-Based Localization for Mobile Robots Using a Set of Known Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pablo Frank-Bolton, Alicia Montserrat Alvarado-Gonz´ alez, Wendy Aguilar, and Yann Frauel On the Advantages of Asynchronous Pixel Reading and Processing for High-Speed Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando Pardo, Jose A. Boluda, Francisco Vegara, and Pedro Zuccarello An Optimized Software-Based Implementation of a Census-Based Stereo Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Zinner, Martin Humenberger, Kristian Ambrosch, and Wilfried Kubinger Mutual Information Based Semi-Global Stereo Matching on the GPU . . . Ines Ernst and Heiko Hirschm¨ uller
195
205
216
228
Table of Contents – Part I
Accurate Optical Flow Sensor for Obstacle Avoidance . . . . . . . . . . . . . . . . Zhaoyi Wei, Dah-Jye Lee, Brent E. Nelson, and Kirt D. Lillywhite A Novel 2D Marker Design and Application for Object Tracking and Event Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xu Liu, David Doermann, Huiping Li, K.C. Lee, Hasan Ozdemir, and Lipin Liu
XXVII
240
248
Segmentation Automatic Lung Segmentation of Volumetric Low-Dose CT Scans Using Graph Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asem M. Ali and Aly A. Farag
258
A Continuous Labeling for Multiphase Graph Cut Image Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Ben Salah, Amar Mitiche, and Ismail Ben Ayed
268
A Graph-Based Approach for Image Segmentation . . . . . . . . . . . . . . . . . . . Thang V. Le, Casimir A. Kulikowski, and Ilya B. Muchnik Active Contours Driven by Supervised Binary Classifiers for Texture Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Olivier, Romuald Bon´e, Jean-Jacques Rousselle, and Hubert Cardot Proximity Graphs Based Multi-scale Image Segmentation . . . . . . . . . . . . . Alexei N. Skurikhin Improved Adaptive Spatial Information Clustering for Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhi Min Wang, Qing Song, Yeng Chai Soh, and Kang Sim Stable Image Descriptions Using Gestalt Principles . . . . . . . . . . . . . . . . . . . Yi-Zhe Song and Peter M. Hall
278
288
298
308 318
Shape/Recognition I A Fast and Effective Dichotomy Based Hash Algorithm for Image Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhoucan He and Qing Wang
328
Evaluation of Gradient Vector Flow for Interest Point Detection . . . . . . . Julian St¨ ottinger, Ren´e Donner, Lech Szumilas, and Allan Hanbury
338
Spatially Enhanced Bags of Words for 3D Shape Retrieval . . . . . . . . . . . . Xiaolan Li, Afzal Godil, and Asim Wagan
349
XXVIII
Table of Contents – Part I
Image Matching Using High Dynamic Range Images and Radial Feature Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krishnaprasad Jagadish and Eric Sinzinger
359
Random Subwindows for Robust Peak Recognition in Intracranial Pressure Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabien Scalzo, Peng Xu, Marvin Bergsneider, and Xiao Hu
370
A New Shape Benchmark for 3D Object Retrieval . . . . . . . . . . . . . . . . . . . . Rui Fang, Afzal Godil, Xiaolan Li, and Asim Wagan
381
Shape Extraction through Region-Contour Stitching . . . . . . . . . . . . . . . . . . Elena Bernardis and Jianbo Shi
393
Video Analysis and Event Recognition Difference of Gaussian Edge-Texture Based Background Modeling for Dynamic Traffic Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amit Satpathy, How-Lung Eng, and Xudong Jiang A Sketch-Based Approach for Detecting Common Human Actions . . . . . . Evan A. Suma, Christopher Walton Sinclair, Justin Babbs, and Richard Souvenir Multi-view Video Analysis of Humans and Vehicles in an Unconstrained Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.M. Hansen, P.T. Duizer, S. Park, T.B. Moeslund, and M.M. Trivedi Self-Organizing Maps for the Automatic Interpretation of Crowd Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Zhan, P. Remagnino, N. Monekosso, and S.A. Velastin A Visual Tracking Framework for Intent Recognition in Videos . . . . . . . . Alireza Tavakkoli, Richard Kelley, Christopher King, Mircea Nicolescu, Monica Nicolescu, and George Bebis Unsupervised Video Shot Segmentation Using Global Color and Texture Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuchou Chang, Dah-Jye Lee, Yi Hong, and James Archibald Progressive Focusing: A Top Down Attentional Vision System . . . . . . . . . Roland Chapuis, Frederic Chausse, and Noel Trujillo
406 418
428
440 450
460 468
Virtual Reality I The Benefits of Co-located Collaboration and Immersion on Assembly Modeling in Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David d’Angelo, Gerold Wesche, Maxim Foursa, and Manfred Bogen
478
Table of Contents – Part I
Simple Feedforward Control for Responsive Motion Capture-Driven Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rubens F. Nunes, Creto A. Vidal, Joaquim B. Cavalcante-Neto, and Victor B. Zordan Markerless Vision-Based Tracking of Partially Known 3D Scenes for Outdoor Augmented Reality Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . Fakhreddine Ababsa, Jean-Yves Didier, Imane Zendjebil, and Malik Mallem Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition in Immersive Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anuraag Sridhar and Arcot Sowmya
XXIX
488
498
508
Augmented Reality Using Projective Invariant Patterns . . . . . . . . . . . . . . . Lucas Teixeira, Manuel Loaiza, Alberto Raposo, and Marcelo Gattass
520
Acquisition of High Quality Planar Patch Features . . . . . . . . . . . . . . . . . . . Harald Wuest, Folker Wientapper, and Didier Stricker
530
ST: Computational Bioimaging and Visualization Level Set Segmentation of Cellular Images Based on Topological Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weimiao Yu, Hwee Kuan Lee, Srivats Hariharan, Wenyu Bu, and Sohail Ahmed A Novel Algorithm for Automatic Brain Structure Segmentation from MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qing He, Kevin Karsch, and Ye Duan Brain Lesion Segmentation through Physical Model Estimation . . . . . . . . Marcel Prastawa and Guido Gerig Calibration of Bi-planar Radiography with a Rangefinder and a Small Calibration Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel C. Moura, Jorge G. Barbosa, Jo˜ ao Manuel R.S. Tavares, and Ana M. Reis Identification of Cell Nucleus Using a Mumford-Shah Ellipse Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Choon Kong Yap and Hwee Kuan Lee Evaluation of Brain MRI Alignment with the Robust Hausdorff Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andriy Fedorov, Eric Billet, Marcel Prastawa, Guido Gerig, Alireza Radmanesh, Simon K. Warfield, Ron Kikinis, and Nikos Chrisochoides
540
552 562
572
582
594
XXX
Table of Contents – Part I
Computer Graphics II User Driven Two-Dimensional Computer-Generated Ornamentation . . . . Dustin Anderson and Zo¨e Wood
604
Efficient Schemes for Monte Carlo Markov Chain Algorithms in Global Illumination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-Chi Lai, Feng Liu, Li Zhang, and Charles Dyer
614
Adaptive CPU Scheduling to Conserve Energy in Real-Time Mobile Graphics Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan Wu, Emmanuel Agu, and Clifford Lindsay
624
A Quick 3D-to-2D Points Matching Based on the Perspective Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Songxiang Gu, Clifford Lindsay, Michael A. Gennert, and Michael A. King Deformation-Based Animation of Snake Locomotion . . . . . . . . . . . . . . . . . . Yeongho Seol and Junyong Noh GPU-Supported Image Compression for Remote Visualization – Realization and Benchmarking . . . . . . . . . . . . . . . . . . . . . . . Stefan Lietsch and Paul Hermann Lensing
634
646
658
ST: Discrete and Computational Geometry I Linear Time Constant-Working Space Algorithm for Computing the Genus of a Digital Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valentin E. Brimkov and Reneta Barneva
669
Offset Approach to Defining 3D Digital Lines . . . . . . . . . . . . . . . . . . . . . . . . Valentin E. Brimkov, Reneta P. Barneva, Boris Brimkov, and Fran¸cois de Vieilleville
678
Curvature and Torsion Estimators for 3D Curves . . . . . . . . . . . . . . . . . . . . Thanh Phuong Nguyen and Isabelle Debled-Rennesson
688
Threshold Selection for Segmentation of Dense Objects in Tomograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. van Aarle, K.J. Batenburg, and J. Sijbers
700
Comparison of Discrete Curvature Estimators and Application to Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Kerautret, J.-O. Lachaud, and B. Naegel
710
Computing and Visualizing Constant-Curvature Metrics on Hyperbolic 3-Manifolds with Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaotian Yin, Miao Jin, Feng Luo, and Xianfeng David Gu
720
Table of Contents – Part I
XXXI
ST: Soft Computing in Image Processing and Computer Vision Iris Recognition: A Method to Segment Visible Wavelength Iris Images Acquired On-the-Move and At-a-Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . Hugo Proen¸ca 3D Textural Mapping and Soft-Computing Applied to Cork Quality Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Beatriz Paniagua, Miguel A. Vega-Rodr´ıguez, Mike Chantler, Juan A. G´ omez-Pulido, and Juan M. S´ anchez-P´erez Analysis of Breast Thermograms Based on Statistical Image Features and Hybrid Fuzzy Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gerald Schaefer, Tomoharu Nakashima, and Michal Zavisek Efficient Facial Feature Detection Using Entropy and SVM . . . . . . . . . . . . Qiong Wang, Chunxia Zhao, and Jingyu Yang
731
743
753 763
Type-2 Fuzzy Mixture of Gaussians Model: Application to Background Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fida El Baf, Thierry Bouwmans, and Bertrand Vachon
772
Unsupervised Clustering Algorithm for Video Shots Using Spectral Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Zhong, Chao Li, Huan Li, and Zhang Xiong
782
Reconstruction Noise Analysis of a SFS Algorithm Formulated under Various Imaging Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amal A. Farag, Shireen Y. Elhabian, Abdelrehim H. Ahmed, and Aly A. Farag
793
Shape from Texture Via Fourier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabio Galasso and Joan Lasenby
803
Full Camera Calibration from a Single View of Planar Scene . . . . . . . . . . Yisong Chen, Horace Ip, Zhangjin Huang, and Guoping Wang
815
Robust Two-View External Calibration by Combining Lines and Scale Invariant Point Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaolong Zhang, Jin Zhou, and Baoxin Li Stabilizing Stereo Correspondence Computation Using Delaunay Triangulation and Planar Homography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chao-I Chen, Dusty Sargent, Chang-Ming Tsai, Yuan-Fang Wang, and Dan Koppel
825
836
XXXII
Table of Contents – Part I
ST: Visualization and Simulation on Immersive Display Devices Immersive Visualization and Analysis of LiDAR Data . . . . . . . . . . . . . . . . Oliver Kreylos, Gerald W. Bawden, and Louise H. Kellogg VR Visualisation as an Interdisciplinary Collaborative Data Exploration Tool for Large Eddy Simulations of Biosphere-Atmosphere Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gil Bohrer, Marcos Longo, David J. Zielinski, and Rachael Brady User Experience of Hurricane Visualization in an Immersive 3D Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Sanyal, P. Amburn, S. Zhang, J. Dyer, P.J. Fitzpatrick, and R.J. Moorhead Immersive 3d Visualizations for Software-Design Prototyping and Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthony Savidis, Panagiotis Papadakos, and George Zargianakis
846
856
867
879
Enclosed Five-Wall Immersive Cabin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feng Qiu, Bin Zhang, Kaloian Petkov, Lance Chong, Arie Kaufman, Klaus Mueller, and Xianfeng David Gu
891
Environment-Independent VR Development . . . . . . . . . . . . . . . . . . . . . . . . . Oliver Kreylos
901
ST: Discrete and Computational Geometry II Combined Registration Methods for Pose Estimation . . . . . . . . . . . . . . . . . Dong Han, Bodo Rosenhahn, Joachim Weickert, and Hans-Peter Seidel Local Non-planarity of Three Dimensional Surfaces for an Invertible Reconstruction: k-Cuspal Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ´ Marc Rodr´ıguez, Ga¨elle Largeteau-Skapin, and Eric Andres
913
925
A New Variant of the Optimum-Path Forest Classifier . . . . . . . . . . . . . . . . Jo˜ ao P. Papa and Alexandre X. Falc˜ ao
935
Results on Hexagonal Tile Rewriting Grammars . . . . . . . . . . . . . . . . . . . . . D.G. Thomas, F. Sweety, and T. Kalyani
945
Lloyd’s Algorithm on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cristina N. Vasconcelos, Asla S´ a, Paulo Cezar Carvalho, and Marcelo Gattass
953
Computing Fundamental Group of General 3-Manifold . . . . . . . . . . . . . . . . Junho Kim, Miao Jin, Qian-Yi Zhou, Feng Luo, and Xianfeng Gu
965
Table of Contents – Part I
XXXIII
Virtual Reality II OmniMap: Projective Perspective Mapping API for Non-planar Immersive Display Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clement Shimizu, Jim Terhorst, and David McConville
975
Two-Handed and One-Handed Techniques for Precise and Efficient Manipulation in Immersive Virtual Environments . . . . . . . . . . . . . . . . . . . . Noritaka Osawa
987
Automotive Spray Paint Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Konieczny, John Heckman, Gary Meyer, Mark Manyen, Marty Rabens, and Clement Shimizu
998
Using Augmented Reality and Interactive Simulations to Realize Hybrid Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008 Florian Niebling, Rita Griesser, and Uwe Woessner Immersive Simulator for Fluvial Combat Training . . . . . . . . . . . . . . . . . . . . 1018 Diego A. Hincapi´e Ossa, Sergio A. Ord´ on ˜ez Medina, Carlos Francisco Rodr´ıguez, and Jos´e Tiberio Hern´ andez A Low-Cost, Linux-Based Virtual Environment for Visualizing Vascular Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028 Thomas Wischgoll
ST: Analysis and Visualization of Biomedical Visual Data Visualization of Dynamic Connectivity in High Electrode-Density EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1040 Alfonso Alba and Edgar Arce-Santana Generation of Unit-Width Curve Skeletons Based on Valence Driven Spatial Median (VDSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051 Tao Wang and Irene Cheng Intuitive Visualization and Querying of Cell Motion . . . . . . . . . . . . . . . . . . 1061 Richard Souvenir, Jerrod P. Kraftchick, and Min C. Shin Registration of 2D Histological Images of Bone Implants with 3D SRµCT Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1071 Hamid Sarve, Joakim Lindblad, and Carina B. Johansson Measuring an Animal Body Temperature in Thermographic Video Using Particle Filter Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 Atousa Torabi, Guillaume-Alexandre Bilodeau, Maxime Levesque, J.M. Pierre Langlois, Pablo Lema, and Lionel Carmant
XXXIV
Table of Contents – Part I
A New Parallel Approach to Fuzzy Clustering for Medical Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1092 Huynh Van Luong and Jong Myon Kim
Computer Graphics III Tracking Data Structures Coherency in Animated Ray Tracing: Kalman and Wiener Filters Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1102 Sajid Hussain and H˚ akan Grahn Hardware Accelerated Per-Texel Ambient Occlusion Mapping . . . . . . . . . . 1115 Tim McGraw and Brian Sowers Comics Stylization from Photographs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125 Catherine Sauvaget and Vincent Boyer Leaking Fluids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135 Kiwon Um and JungHyun Han Automatic Structure-Aware Inpainting for Complex Image Content . . . . 1144 Patrick Ndjiki-Nya, Martin K¨ oppel, Dimitar Doshkov, and Thomas Wiegand Multiple Aligned Characteristic Curves for Surface Fairing . . . . . . . . . . . . 1157 Janick Martinez Esturo, Christian R¨ ossl, and Holger Theisel Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167
Table of Contents – Part II
Visualization II SUNVIZ: A Real-Time Visualization Environment for Space Physics Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Eliuk, P. Boulanger, and K. Kabin
1
An Efficient Quality-Based Camera Path Planning Method for Volume Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming-Yuen Chan, Wai-Ho Mak, and Huamin Qu
12
A Fast and Simple Heuristic for Metro Map Path Simplification . . . . . . . . Tim Dwyer, Nathan Hurst, and Damian Merrick
22
Visual Verification of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thorsten May and Joern Kohlhammer
31
LDR-LLE: LLE with Low-Dimensional Neighborhood Representation . . . Yair Goldberg and Ya’acov Ritov
43
SudokuVis: How to Explore Relationships of Mutual Exclusion . . . . . . . . Gudrun Klinker
55
ST: Image Analysis for Remote Sensing Data Identification of Oceanic Eddies in Satellite Images . . . . . . . . . . . . . . . . . . . Armando Manuel Fernandes
65
Multi-image Fusion in Remote Sensing: Spatial Enhancement vs. Spectral Characteristics Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manfred Ehlers
75
Classification of Multispectral High-Resolution Satellite Imagery Using LIDAR Elevation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mar´ıa C. Alonso and Jos´e A. Malpica
85
Semi-supervised Edge Learning for Building Detection in Aerial Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fenglei Yang, Ye Duan, and Yue Lu
95
High Resolution Satellite Classification with Graph Cut Algorithms . . . . Adrian A. L´ opez and Jos´e A. Malpica
105
XXXVI
Table of Contents – Part II
Satellite Image Segmentation Using Wavelet Transforms Based on Color and Texture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ricardo Dutra da Silva, Rodrigo Minetto, William Robson Schwartz, and Helio Pedrini
113
Shape/Recognition II A System for Rapid Interactive Training of Object Detectors . . . . . . . . . . Nathaniel Roman and Robert Pless An Integrated Method for Multiple Object Detection and Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dipankar Das, Al Mansur, Yoshinori Kobayashi, and Yoshinori Kuno
123
133
A Context Dependent Distance Measure for Shape Clustering . . . . . . . . . Rolf Lakaemper and JingTing Zeng
145
A New Accumulator-Based Approach to Shape Recognition . . . . . . . . . . . Karthik Krish and Wesley Snyder
157
Multi-dimensional Scale Saliency Feature Extraction Based on Entropic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Suau and F. Escolano
170
Merging Active Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ismail Ben Ayed and Amar Mitiche
181
Contour Extraction Using Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . ChengEn Lu, Longin Jan Latecki, and Guangxi Zhu
192
Motion Combining Line and Point Correspondences for Homography Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elan Dubrofsky and Robert J. Woodham
202
Integration of Local Image Cues for Probabilistic 2D Pose Recovery . . . . Paul Kuo, Dimitrios Makris, Najla Megherbi, and Jean-Christophe Nebel
214
Indirect Tracking to Reduce Occlusion Problems . . . . . . . . . . . . . . . . . . . . . Peter Keitler, Michael Schlegel, and Gudrun Klinker
224
Real-Time Lip Contour Extraction and Tracking Using an Improved Active Contour Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jingying Chen, Bernard Tiddeman, and Gang Zhao
236
Table of Contents – Part II XXXVII
Particle Filter Based Object Tracking with Discriminative Feature Extraction and Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yao Shen, Parthasarathy Guturu, Thyagaraju Damarla, and Bill P. Buckles A New Global Alignment Method for Feature Based Image Mosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Elibol, R. Garcia, O. Delaunoy, and N. Gracias An Effective Active Vision System for Gaze Control . . . . . . . . . . . . . . . . . . Yann Ducrocq, Shahram Bahrami, Luc Duvieubourg, and Fran¸cois Cabestaing
246
257 267
Face/Gesture Face Recognition Based on Normalization and the Phase Spectrum of the Local Part of an Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jesus Olivares-Mercado, Kazuhiro Hotta, Haruhisa Takahashi, Hector Perez-Meana, Mariko Nakano Miyatake, and Gabriel Sanchez-Perez A Novel Shape Registration Framework and Its Application to 3D Face Recognition in the Presence of Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . Rachid Fahmi and Aly A. Farag
278
287
Frontal Face Recognition from Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Angshul Majumdar and Panos Nasiopoulos
297
Real Time Hand Based Robot Control Using 2D/3D Images . . . . . . . . . . . Seyed Eghbal Ghobadi, Omar Edmond Loepprich, Farid Ahmadov, Jens Bernshausen, Klaus Hartmann, and Otmar Loffeld
307
Facial Trait Code and Its Application to Face Recognition . . . . . . . . . . . . Ping-Han Lee, Gee-Sern Hsu, Tsuhan Chen, and Yi-Ping Hung
317
Using Multiple Masks to Improve End-to-End Face Recognition Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher A. Neylan and Andrea Salgian Sparse Representation for Ear Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . Imran Naseem, Roberto Togneri, and Mohammed Bennamoun
329 336
Computer Vision Applications Image-Based Information Guide on Mobile Devices . . . . . . . . . . . . . . . . . . . Jimmy Addison Lee, Kin-Choong Yow, and Andrzej Sluzek
346
Estimating Atmospheric Visibility Using General-Purpose Cameras . . . . . Ling Xie, Alex Chiu, and Shawn Newsam
356
XXXVIII Table of Contents – Part II
Numismatic Object Identification Using Fusion of Shape and Local Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Huber-M¨ ork, M. Zaharieva, and H. Czedik-Eysenberg
368
Personalized News Video Recommendation Via Interactive Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jianping Fan, Hangzai Luo, Aoying Zhou, and Daniel A. Keim
380
Browsing a Large Collection of Community Photos Based on Similarity on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Grant Strong and Minglun Gong
390
Security Analysis for Spread-Spectrum Watermarking Incorporating Statistics of Natural Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dong Zhang, Jiangqun Ni, and Dah-Jye Lee
400
Multi-view Feature Matching and Image Grouping from Multiple Unordered Wide-Baseline Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiuyuan Zeng, Heng Yang, and Qing Wang
410
Stitching Video from Webcams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mai Zheng, Xiaolin Chen, and Li Guo
420
Poster GpuCV: A GPU-Accelerated Framework for Image Processing and Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yannick Allusse, Patrick Horain, Ankit Agarwal, and Cindula Saipriyadarshan
430
A Comparison Study on Two Multi-scale Shape Matching Schemes . . . . . Bo Li and Henry Johan
440
PAD Model Based Facial Expression Analysis . . . . . . . . . . . . . . . . . . . . . . . Jie Cao, Hong Wang, Po Hu, and Junwei Miao
450
Calibration and Pose Estimation of a Pox-slits Camera from a Single Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. Martins and H. Ara´ ujo
460
Covariance Matrices for Crowd Behaviour Monitoring on the Escalator Exits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Haidar Sharif, Nacim Ihaddadene, and Chabane Djeraba
470
User Verification by Combining Speech and Face Biometrics in Video . . . Imran Naseem and Ajmal Mian A Gibbsian Kohonen Network for Online Arabic Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Neila Mezghani and Amar Mitiche
482
493
Table of Contents – Part II
XXXIX
Shading through Defocus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jos´e R.A. Torre˜ ao and Jo˜ ao L. Fernandes A Gabor Quotient Image for Face Recognition under Varying Illumination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanun Srisuk and Amnart Petpon
501
511
Personal Identification Using Palmprint and Contourlet Transform . . . . . Atif Bin Mansoor, M. Mumtaz, H. Masood, M. Asif A. Butt, and Shoab A. Khan
521
Generating Reflection Transparent Image Using Image Fusion Space . . . . Satoru Morita and Yasutoshi Sugiman
531
Fingerprint Images Enhancement in Curvelet Domain . . . . . . . . . . . . . . . . Gholamreza Amayeh, Soheil Amayeh, and Mohammad Taghi Manzuri
541
Effective Frame Rate Decision by Lagrange Optimization for Frame Skipping Video Transcoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ching-Ting Hsu, Chia-Hung Yeh, and Mei-Juan Chen
551
Symmetry of Shapes Via Self-similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xingwei Yang, Nagesh Adluru, Longin Jan Latecki, Xiang Bai, and Zygmunt Pizlo
561
Robust Estimation Approach for NL-Means Filter . . . . . . . . . . . . . . . . . . . . J. Dinesh Peter, V.K. Govindan, and Abraham T. Mathew
571
View-Invariant Pose Recognition Using Multilinear Analysis and the Universum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bo Peng, Gang Qian, and Yunqian Ma
581
Scaling Up a Metric Learning Algorithm for Image Recognition and Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adrian Perez-Suay and Francesc J. Ferri
592
Smile Detection for User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . O. Deniz, M. Castrillon, J. Lorenzo, L. Anton, and G. Bueno A Novel Segmentation Algorithm for Digital Subtraction Angiography Images: First Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Danilo Franchi, Pasquale Gallo, and Giuseppe Placidi Image Representation in Differential Space . . . . . . . . . . . . . . . . . . . . . . . . . . Shengzhi Du, Barend Jacobus van Wyk, M. Antonie van Wyk, Guoyuan Qi, Xinghui Zhang, and Chunling Tu A Four Point Algorithm for Fast Metric Cone Reconstruction from a Calibrated Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jin Zhou and Baoxin Li
602
612 624
634
XL
Table of Contents – Part II
Texture-Based Shadow Removal from a Single Color Image . . . . . . . . . . . . Qiang He and Chee-Hung Henry Chu
644
Multi-source Airborne IR and Optical Image Fusion and Its Application to Target Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fenghui Yao and Ali Sekmen
651
A New Adaptive Combination Approach to Score Level Fusion for Face and Iris Biometrics Combining Wavelets and Statistical Moments . . . . . . Nicolas Morizet and J´erˆ ome Gilles
661
Medical Image Zooming Algorithm Based on Bivariate Rational Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shanshan Gao, Caiming Zhang, Yunfeng Zhang, and Yuanfeng Zhou
672
2D Shape Decomposition Based on Combined Skeleton-Boundary Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JingTing Zeng, Rolf Lakaemper, XingWei Yang, and Xin Li
682
Removing Pose from Face Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Se´ an Begley, John Mallon, and Paul F. Whelan A Real Time Fingers Detection by Symmetry Transform Using a Two Cameras System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rachid Belaroussi and Maurice Milgram High Resolution and High Dynamic Range Image Reconstruction from Differently Exposed Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroyuki Nakai, Shuhei Yamamoto, Yasuhiro Ueda, and Yoshihide Shigeyama
692
703
713
PDE-Based Facial Animation: Making the Complex Simple . . . . . . . . . . . Yun Sheng, Phil Willis, Gabriela Gonzalez Castro, and Hassan Ugail
723
A Variational Level Set Method for Multiple Object Detection . . . . . . . . . Zhenkuan Pan, Hua Li, Weibo Wei, and Shuhua Xu
733
Detecting Thalamic Abnormalities in Autism Using Cylinder Conformal Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qing He, Ye Duan, Xiaotian Yin, Xianfeng Gu, Kevin Karsch, and Judith Miles
743
Extraction of Illumination Effects from Natural Images with Color Transition Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroaki Nishihara and Tomoharu Nagao
752
A Novel Macroblock-Level Rate-Distortion Optimization Scheme for H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-jun Wang, Chang Sun, and Hua Li
762
Table of Contents – Part II
Automatic Segmentation of the Apparent Contour for 3D Modeling of Cutting Tools from Single View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xi Zhang, Waiming Tsang, Xiaodong Tian, Kazuo Yamazaki, and Masahiko Mori On Semantic Object Detection with Salient Feature . . . . . . . . . . . . . . . . . . Zhidong Li and Jing Chen
XLI
772
782
A Generic and Parallel Algorithm for 2D Image Discrete Contour Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillaume Damiand and David Coeurjolly
792
Spatial Filtering with Multi-scale Segmentation Based on Gaussian Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi-Fan Chen and Chia-Hsin Liang
802
Visibility-Based Test Scene Understanding by Real Plane Search . . . . . . . Jae-Kyu Lee, Seongjin Ahn, and Jin Wook Chung
813
Real-Time Face Verification for Mobile Platforms . . . . . . . . . . . . . . . . . . . . Sung-Uk Jung, Yun-Su Chung, Jang-Hee Yoo, and Ki-Young Moon
823
3D Human Motion Tracking Using Progressive Particle Filter . . . . . . . . . . Shih-Yao Lin and I-Cheng Chang
833
Visual Servoing for Patient Alignment in ProtonTherapy . . . . . . . . . . . . . . Rachid Belaroussi and Guillaume Morel
843
Improving Recognition through Object Sub-categorization . . . . . . . . . . . . Al Mansur and Yoshinori Kuno
851
Similarity Measure of the Visual Features Using the Constrained Hierarchical Clustering for Content Based Image Retrieval . . . . . . . . . . . . Sang Min Yoon and Holger Graf An Experimental Study of Reconstruction of Tool Cutting Edge Features Using Space Carving Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wai Ming Tsang, Xi Zhang, Kazuo Yamazaki, Xiaodong Tian, and Masahiko Mori Real Time Object Tracking in a Video Sequence Using a Fixed Point DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Syed Aamir Ali Shah, Tahir Jamil Khattak, Muhammad Farooq, Yahya M. Khawaja, Abdul Bais, Asim Anees, and Muhammad U.K. Khan Gesture Recognition for a Webcam-Controlled First Person Shooter . . . . Robert W. Wilson and Andrea Salgian
860
869
879
889
XLII
Table of Contents – Part II
3D Line Reconstruction of a Road Environment Using an In-Vehicle Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toshihiro Asai, Koichiro Yamaguchi, Yoshiko Kojima, Takashi Naito, and Yoshiki Ninomiya
897
Braille Document Parameters Estimation for Optical Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhenfei Tai, Samuel Cheng, and Pramode Verma
905
Bio-imaging Toolkit for Indexing, Searching, Navigation, Discovery and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Afzal Godil, Benny Cheung, Asim Wagan, and Xiaolan Li
915
Stereoscopic View Synthesis by View Morphing . . . . . . . . . . . . . . . . . . . . . . Seon-Min Rhee, Jongmoo Choi, and Ulrich Neumann Edge Detection from Global and Local Views Using an Ensemble of Multiple Edge Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuchou Chang, Dah-Jye Lee, Yi Hong, and James Archibald
924
934
An Effective and Fast Lane Detection Algorithm . . . . . . . . . . . . . . . . . . . . . Chung-Yen Su and Gen-Hau Fan
942
Towards Real-Time Monocular Video-Based Avatar Animation . . . . . . . . Utkarsh Gaur, Amrita Jain, and Sanjay Goel
949
Temporal Computational Objects: A Process for Dynamic Surface Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kurt W. Swanson, Kenneth A. Brakke, and David E. Breen Hardware-Accelerated Particle-Based Volume Rendering for Multiple Irregular Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naohisa Sakamoto, Ding Zhongming, Takuma Kawamura, and Koji Koyamada
959
970
Immersive Visualization of Casting Flow and Solidification . . . . . . . . . . . . Jiyoung Park, Sang-Hyun Cho, Jung-Gil Choi, and Myoung-Hee Kim
980
Graph-Based Visual Analytic Tools for Parallel Coordinates . . . . . . . . . . . Kai Lun Chung and Wei Zhuo
990
Modeling and Visualization Approaches for Time-Varying Volumetric Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000 Kenneth Weiss and Leila De Floriani Ubiquitous Interactive Visualization of 3-D Mantle Convection through Web Applications Using Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1011 Jonathan C. Mc Lane, Wojciech W. Czech, David A. Yuen, Michael R. Knox, James B.S.G. Greensky, M. Charley Kameyama, Vincent M. Wheeler, Rahul Panday, and Hiroki Senshu
Table of Contents – Part II
XLIII
Streaming Mesh Optimization for CAD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022 Tian Xia and Eric Shaffer An Iterative Method for Fast Mesh Denoising . . . . . . . . . . . . . . . . . . . . . . . 1034 Shuhua Lai and Fuhua (Frank) Cheng On the Performance and Scalability of a GPU-Limited Commodity Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044 Jorge Luis Williams and Robert E. Hiromoto Algorithms for the Automatic Design of Non-formal Urban Parks . . . . . . 1056 Soon Tee Teoh Hybrid Shading Model Based on Device Performance for LOD Adaptive Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066 Hakran Kim and Hwajin Park Incremental Texture Compression for Real-Time Rendering . . . . . . . . . . . 1076 Ying Tang and Jing Fan Geometry Independent Raindrop Splash Rendering for Generic, Complex Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086 J¨ urgen Rossmann and Nico Hempe Extension of B-Spline Curves with G2 Continuity . . . . . . . . . . . . . . . . . . . . 1096 Yuan-feng Zhou, Cai-ming Zhang, and Shan-shan Gao Building New Mixed Reality Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106 Camilo A. Perez and Pablo A. Figueroa A Novel Acceleration Coding/Reconstruction Algorithm for Magnetic Resonance Imaging in Presence of Static Magnetic Field In-Homogeneities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115 Giuseppe Placidi, Danilo Franchi, Angelo Galante, and Antonello Sotgiu Reconstruction of Some Segmented and Dynamic Scenes: Trifocal Tensors in P4 , Theoretical Set Up for Critical Loci, and Instability . . . . . 1125 Marina Bertolini, GianMario Besana, and Cristina Turrini Efficient Algorithms for Reconstruction of 2D-Arrays from Extended Parikh Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137 V. Masilamani, Kamala Krithivasan, K.G. Subramanian, and Ang Miin Huey Reconstruction of Binary Images with Few Disjoint Components from Two Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147 P´eter Bal´ azs
XLIV
Table of Contents – Part II
A Connection between Zn and Generalized Triangular Grids . . . . . . . . . . . 1157 Benedek Nagy and Robin Strand Collage of Hexagonal Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167 F. Sweety, D.G. Thomas, and T. Kalyani Discrete Contour Extraction from Reference Curvature Function . . . . . . . 1176 H.G. Nguyen, B. Kerautret, P. Desbarats, and J.-O. Lachaud Change Detection with SPOT-5 and FORMOSAT-2 Imageries . . . . . . . . . 1186 Patricia Cifuentes, Jos´e A. Malpica, and Francisco J. Gonz´ alez-Matesanz Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1197
Detection of a Large Number of Overlapping Ellipses Immersed in Noise Armando Manuel Fernandes CENTRIA – Centre for Artificial Intelligence Universidade Nova Lisboa, Quinta da Torre, 2829-516 Caparica, Portugal
[email protected]
Abstract. A new algorithm able to efficiently detect a large number of overlapping ellipses with a reduced number of false positives is described. The algorithm estimates the number of candidate ellipse centers in an image with the help of a 2-dimensional accumulator and determines the five ellipse parameters with an ellipse fitting algorithm. The proposed ellipse detection algorithm uses a heuristic to select, among all image points, those with greater probabilities of belonging to an ellipse. This leads to an increase in classification efficiency, even in the presence of noise. Testing has shown that the proposed algorithm detected 97.4% of the ellipses in 100 images. Each image contained ten overlapping ellipses surrounded by noise. The ellipse parameters were determined with great accuracy.
1 Introduction Ellipse detection is relevant for several applications, such as cell identification [1], detection of traffic signals for automatic driving [2], face detection [3], crater detection [4], water eddy detection [5], and bubble analysis [6]. Various ellipse detection algorithms have been reported in the scientific literature, such as the Fast Hough Transform [7], the Probabilistic Hough Transform [8, 9], and the Randomized Hough Transform [10]. In the last few years, the Randomized Hough Transform has been quite popular; however, the algorithm has a low detection accuracy when the images contain more than five ellipses because the points for estimating ellipse parameters are randomly collected [10]. Zhang and Liu [3] reported that the detection percentages were considerably increased by employing a heuristic in the selection of the four points that are used for the determination of the coordinates of the ellipse center. The heuristic uses the convexity of each edge point of a binary image, which is a vector indicating the concave side and the inclination of the edge curves at the edge point. The inclination is measured with the tangent line to the edge curve at each edge point. The algorithm we propose also employs convexities to select points. However, in Zhang and Liu [3], all ellipse parameters were determined with accumulators, while in the present research, only one accumulator is used to estimate the number of ellipse centers present in the image. Consequently, problems caused by the use of accumulators are minimized. Those problems include the existence of spurious, blurry peaks [11] and the use of large amounts of memory that increase with the precision G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1–10, 2008. © Springer-Verlag Berlin Heidelberg 2008
2
A.M. Fernandes
with which parameters are determined. To reduce the amount of memory used by the accumulators, it is possible to decompose the parameter space [3, 12], but this introduces new problems such as errors in the calculation of values for the accumulators due to an inaccurate determination of slopes from the image curves. This inaccuracy is provoked by the discrete nature of the images. To minimize the issues of inaccurate slope determination, Zhang and Liu [3] employed a more complex point selection heuristic as compared to that reported in the present work. The algorithm proposed here does not require the values of edge curve slopes for the ellipse parameter estimation. The algorithm presented in this paper combines the best qualities of the Hough transform and ellipse fitting, which are the ability to detect multiple ellipses in one image and accuracy in the parameter determination, respectively. This is due to the determination of five ellipse parameters by fitting an ellipse to points used in the calculation of each ellipse center present in the only accumulator used by the algorithm. In this way, the algorithm transforms the problem of finding several ellipses in one image into several problems of fitting one ellipse to a set of points containing outliers. However, ellipse fitting algorithms are unable to handle outliers, so a method was developed to iteratively discard those outliers. The use of iterative methods to discard outliers was already reported by Qiao and Ong [13], but in their case the points were grouped into arcs and the method of finding outliers was different from that employed in the current research. In the work of Zhang and Liu [3], some ellipse parameters were determined from the accumulation of values calculated with equations whose variables are other ellipse parameters. Therefore, error propagation occurs such that parameters determined initially have less error than those determined later. This issue is important in the referenced work [3] due to a nonlinear relationship between the parameters, but it is irrelevant when ellipse fitting is employed. To test the proposed algorithm, we generated images following a defined set of parameters. In this way, the generated images present the same level of difficulty for the automatic detection of ellipses.
2 Algorithm for Ellipse Detection The complete procedure to detect ellipses in images is described in Fig. 1. The concepts required to understand the algorithm are described below. 2.1 Convexity Determination A convexity category must be provided to each edge point of the binary image to simplify the search for ellipse points [3]. The convexity category is attributed depending on a convexity vector. This vector is perpendicular to the line that is tangent to the curve obtained with a second order fit of the edge point and its neighbors, as depicted in Fig. 2a. The vector points toward the concave side of the curve. There are eight convexity categories, and their corresponding vectors are illustrated in Fig. 2b.
Detection of a Large Number of Overlapping Ellipses Immersed in Noise
3
Get edge points in the binary image
See Fig. 2
To find points admissible to P whose convexity category is N
Give each edge point of binary image a convexity category N
Copy PointsToChoose to set PointsToSelect
Create PointsToChoose with all edge points
Remove from PointsToSelect all points whose distance to P is smaller than MinDist or larger than MaxDist
Create list of admissible points for each point P from PointsToChoose Randomly select one point from PointsToChoose
Remove all points with index N from PointsToSelect
Put it in PointsChosen
Create axes with origin in P
Does PointsChosen contain PointsPerGroup points?
xx axis has the same direction as convexity category vector of P
Randomly select one of those points
Yes
No Any points admissible to all points of PointsChosen?
See Fig. 3
Yes Determine ellipse center using the points selected
yy axis is rotated 90° counter clockwise relative to xx axis
Add ellipse center to 2D histogram
Remove all points with negative xx coordinates from PointsToSelect
No Empty PointsChosen
No
See Fig. 4
Passed here Iter times? Yes Get candidate ellipse centers
See Fig. 5b
Perform random ellipse fitting on the points used to generate each candidate ellipse center
See Fig. 5c
Perform iterative ellipse fitting on the same points Choose ellipses detected from candidate ellipses
Remove all points with positive yy coordinates and indexes N-1,N-2,N-3 from PointsToSelect Remove all points with negative yy coordinates and indexes N+1,N+2,N+3 from PointsToSelect All points in PointsToSelect are admissible to P
Eliminate repeated ellipses
Fig. 1. Complete procedure for ellipse detection. Points admissible to each other have high probabilities of belonging to the same ellipse. The steps on the right allow finding the points admissible to a point P.
4
A.M. Fernandes
The convexity category for each edge point is that whose representative vector has the smallest difference relative to the convexity vector. The convexity categories are counted 12345678123. For example, convexity category “N-3” means three convexity categories “before” N, and is equal to seven when N is two. The total number of neighbors used in the convexity determination is ConvNei, which is double the number of neighbors located on each side of the main point. To calculate the convexity vectors, all binary image points with more than two neighbors are discarded. This is necessary so that each point belongs to only one curve, which is fitted with a second order curve. Points in a straight line are also discarded because in this case, there is no concave side. The isolated points are discarded because they do not have neighbors, which makes the algorithm robust to speckle noise. The groups of points with good probabilities of belonging to an ellipse are those whose convexity category vectors are aimed inside an ellipse fitting the points, as shown in Fig. 2b. Admissible points, as defined in Fig. 1, have relative convexity categories and positions similar to those in Fig. 2b. To find a point admissible to all points in PointsChosen, we tested up to 100 points in the same iteration Iter, before emptying PointsChosen.
Tangent line Fitted curve of second order Convexity Vector a)
Binary image edge points
7
8 1
6
b)
2
5
4
3
Fig. 2. a) Geometry for determining the convexity vector. b) Vectors representing convexity categories and their relative positions in an ellipse.
2.2 Ellipse Center Determination Ellipse centers are determined by applying the method described by Yuen et al. [14] and Aguado [12] to four points of PointsChosen. This method is described next, and is shown in Fig. 3. Consider two points, P1 and P2, on an ellipse such that the line joining P1 and P2 does not pass over the ellipse center and the tangents to the ellipse at points P1 and P2 cross at a point called X1. The line L1 passing over X1 and over the midpoint of the line connecting P1 to P2 passes over the ellipse center. Now, consider two other points, P3 and P4, which were chosen in the same manner as P1 and P2, and the line L2, determined for P3 and P4 in the same manner as L1 was for P1 and P2. The position where L1 and L2 cross is the center of the ellipse. However, due to image noise, in many cases, points P1, P2, P3, and P4 do not belong to ellipses, and as a result we can only consider the calculated center points as preliminary ellipse centers.
Detection of a Large Number of Overlapping Ellipses Immersed in Noise
5
2.3 Finding Candidate Ellipse Centers A histogram of the positions of the preliminary ellipse centers determined from the groups PointsChosen containing a number of points PointsPerGroup was created. This histogram is called an accumulator in the ellipse detection literature. For each histogram bin with a size of one by one pixels, the number of preliminary centers in the neighboring bins was summed. The neighborhood corresponds to a square of size MaxNei x MaxNei pixels. The sum can be viewed as a function of the image coordinates, and that function has local peaks because the image edge points that do not belong to ellipses generate preliminary ellipse centers that are dispersed over the images, while the edge points from the ellipses result in preliminary ellipse centers that tend to agglomerate in regions a few tens of pixels in diameter. Consequently, the position of the function peaks is, with large probability, the location of the center of a real ellipse from the image. Then, the second largest peak is chosen, and the method continues in this manner. The selected peaks are called candidate ellipse centers (Fig. 4). To detect all ellipses in the image, a maximum number of peaks, HistMax, was selected. To avoid choosing spatially close peaks that may belong to the same ellipse, the peaks are selected sequentially starting with the one with the highest values and setting to zero the bins in a radius of MaxEli pixels around the selected peak.
X1
P1
L2
P3
MidPoint P1P2
P4
P2 Ellipse Center
Fig. 3. Geometry of ellipse center detection
Sum of Preliminary Ellipse Centers
L1
Fig. 4. The dark regions are candidate ellipse centers. Image coordinates are in pixels.
2.4 Random Ellipse Fitting The image points from the sets PointsChosen that were used to calculate the preliminary ellipse centers that originate each candidate ellipse center are then used to calculate a candidate ellipse. These image points will be called candidate ellipse points. The candidate ellipse points are those used to calculate the preliminary ellipse centers in a neighborhood of 2MaxNei x 2MaxNei pixels around the candidate ellipse centre. The typical spatial distribution of the candidate ellipse points is shown in Fig. 5a, which plots the number of times each point was employed in the determination of a preliminary ellipse center. The points belonging to the real ellipse, whose parameters are being determined, were employed more frequently than the majority of the other points. Many candidate ellipse points do not belong to the real ellipse and are distant from the points that actually belong to the ellipse. Therefore, a first estimate of the parameters of the real ellipse is obtained by fitting an ellipse to a
6
A.M. Fernandes
Fig. 5. a) Histogram of the number of times the candidate ellipse points were chosen for determination of preliminary ellipse centers. b) The ellipse determined with random ellipse fitting is shown in black. c) and d) Two iterations of the iterative ellipse fitting. In b), c), and d), the points to fit are in gray. Image coordinates are in pixels.
percentage RanPer of the candidate ellipse points. These points are chosen randomly. The ellipse is calculated by least mean square fit using the algorithm from Halir and Flusser [15]. After executing the random point selection Repeat times, the ellipse with the largest percentage of points covered by the image points is chosen. An ellipse point is said to be covered by an image point if the image point is within a certain neighborhood of the ellipse point. The neighborhood for the random ellipse fitting has a radius of Alfa2 pixels. To find the covered points of an ellipse, one must determine the coordinates of its points as if we wanted to draw the ellipse in the image. Each ellipse point has a size of one pixel. The ellipse fitted to the points from Fig. 5a is shown in Fig. 5b in black. The obtained ellipse is the candidate ellipse, but its parameters require further adjustment. 2.5 Iterative Ellipse Fitting The parameters of the candidate ellipse obtained with random ellipse fitting do not accurately describe the real ellipse represented by the candidate ellipse points, excluding the outliers. To solve this problem, an iterative procedure is conducted. The initial ellipse for the iterative process is the one obtained by the random ellipse fitting, and is shown in Fig. 5b. The iterative procedure begins by calculating the sum of the distance of each candidate ellipse point to the two foci of the initial ellipse. This sum is called Σ and is determined for all candidate ellipse points. The candidate ellipse points whose Σ is larger than the average plus the standard deviation of Σ are considered to be outliers and are discarded. A new ellipse is then fitted to the new set
Detection of a Large Number of Overlapping Ellipses Immersed in Noise
7
of candidate ellipse points using the algorithm from Halir and Flusser [15]. This process is iterated Repeat2 times, such that at each iteration, the set of candidate points used for ellipse fitting contains a larger percentage of points belonging to the real ellipse relative to the previous iteration. This improves the ellipse parameter estimate obtained after each iteration. The iterative process must be stopped before all points of the real ellipse are discarded. Fig. 5b to Fig. 5d show two iterations with the continuous elimination of outliers. The parameters of the ellipse obtained after the iterations are similar to those of the real ellipse. 2.6 Selection of Ellipses Detected from Candidate Ellipses One candidate ellipse was determined for each candidate ellipse center. Only candidate ellipses whose percentage of points covered by the image points is larger than DecPerc were selected. The parameter DecPerc must have a value close to the minimum percentage of an ellipse perimeter that is visible in the image. The coverage of each point of the candidate ellipses is determined using a circular neighborhood with a radius of Alfa pixels. At this point, the selected ellipses are repeated. This occurs due to spurious peaks resulting from the sum of bins from the histogram with preliminary ellipse center positions. 2.7 Elimination of Repeated Ellipses Frequently, two candidate ellipses are coincident, meaning that they have similar parameters. To eliminate the coincident ellipses, all candidate ellipses were checked in a two by two format. If a candidate ellipse has a percentage of its points larger than elOverlap that are covered by the points of another ellipse, then the ellipse can be discarded. The coverage is calculated using a neighborhood of Tolerance pixels.
3 Experimental Results The algorithm was tested using images generated randomly with certain defined parameters. This guarantees that the generated images have comparable complexities. The parameters required to generate the ellipses are: ElNum, the number of ellipses in the image; MinSize and MaxSize, the minimum and maximum sizes of the ellipse axes, respectively, measured in number of pixels; ImSize, the size of the side of the square where the ellipse centers are placed, measured in number of pixels; and Overlap, the maximum ellipse overlap. The distance between the ellipse centers is larger than the product of the coefficient Overlap and the sum of the semi-sizes of the largest axes of two ellipses belonging to the same image. The generated ellipses are formed by segments; therefore, the following parameters are also necessary: NumSegMin and NumSegMax, the minimum and maximum number of segments, respectively; VisibleElPer, the percentage of visible ellipse perimeter, measured in pixels; and Tol, the tolerance of VisibleElPer. The actual number of segments is a random value between NumSegMin and NumSegMax, inclusively. The segments are alternated so that one is visible and the next is invisible. Finally, noise composed of ellipse arcs may be added to the images. The parameters for adding noise are: MinSize and MaxSize, which are the same as previously, but now for ellipses where one arc is
8
A.M. Fernandes
extracted; NoiseNum, the number of arcs added to the image; and MaxAngle, the maximum angle of the arcs. The generated images contain ten ellipses formed by separated arcs immersed in a high level of noise. The ellipses overlap. A total of 100 images were generated. The parameters used to generate the images are shown in Table 1. The ellipse detection algorithm was repeated ten times for each image. Each repetition is called a “run”. The calculations to identify the detection efficiency of the proposed algorithm were performed using the parameters from Table 2. The candidate ellipses were considered to be true positives when a percentage of their points elOverlap was covered by the points of the real ellipse used to create the images. The neighborhood for the coverage analysis was Tolerance. Of a total of 10000 overlapping ellipses surrounded by 80000 ellipse arcs corresponding to the noise, 97.4% of the ellipses were detected with only 46 false positives. Fig. 6a shows that for 44 of the images, all ten ellipses were detected in the ten runs. For 20 of the images fewer than 9.5 ellipses were detected per run. Nine or fewer ellipses were detected per run in only six images. Fig. 6b shows that for more than 80 of the images generated there were no false positives in the ten runs. The false positives exist when the shape of the region of overlap of two real ellipses resembles an ellipse.
ElNum
MinSize (Pixels)
MaxSize (pixels)
ImSize (pixels)
Overlap
NumSegMin
NumSegMax
VisibleElPer (%)
Tol (%)
NoiseNum
MaxAngle (rad)
Table 1. Parameters used to generate images for tests
10
100
200
300
0.5
10
15
90
5
80
0.8
DecPerc (%)
Alfa (Pixels)
elOverlap (%)
Tolerance (Pixels)
Alfa2 (Pixels)
Repeat
RanPer (%)
10 250 20 10 20 30 50 10 15
MaxEli (Pixels)
5
Repeat2
HistMax (Pixels)
MinDist (Pixels)
4
MaxNei (Pixels)
Iter (x105)
ConvNei (Pixels)
PointsPerGroup
MaxDist (Pixels)
Table 2. Parameters used in the tests
5
80
2
90
5
A typical result of the ellipse detection is depicted in Fig. 7. One can see a good match between the ellipses and the points. The means of the absolute values of the differences between the values of the ellipse parameters for the true positives and the real ellipses are 1.0±0.1 pixels and 0.7±0.1 pixels for the xx and yy ellipse center coordinates, respectively, 0.1±0.1 pixels and 0.1±0.1 pixels for the ellipse semi-major and semi-minor axes lengths, respectively, and 0.0°±0.2° for the tilt angle of the
Detection of a Large Number of Overlapping Ellipses Immersed in Noise
9
45 a) 40 35 30 25 20 15 10 5 0 8.6 8.8 9.0 9.2 9.4 9.6 9.8 10.0 Average Number of Ellipses Detected per Run
Number of Images
Number of Images
major axis. These values were calculated for 9738 ellipses. The tests were done using only one of the processors of a Pentium Duo at 3.4Ghz with 1Gb of RAM. The image processing time for each one of the 100 images was 77±13 seconds. However, the current software implementation is not optimized to perform fast calculations. 90 80 70 60 50 40 30 20 10 0
b)
0.0 0.2 0.4 0.6 0.8 Average Number of False Positives per Run
Fig. 6. Histograms for 100 images with 10 ellipses each versus averages per run for 10 runs. a) Average number of ellipses detected, and b) average number of false positives.
Fig. 7. On the left are shown two original images used to test the detection algorithm. On the right, the ellipses detected, are plotted in black over the points in gray. Image coordinates are in pixels.
4 Conclusions A new ellipse detection algorithm has been proposed. The algorithm uses one 2D accumulator to find candidate ellipse centers and an iterative process to fit ellipses to the points used to find those centers. A small number of iterations is required to determine the ellipse parameters. The five parameters of the detected ellipses were
10
A.M. Fernandes
provided by an ellipse fitting algorithm, so the typical problems associated with the use of accumulators for ellipse parameter determination were eliminated. Extensive testing showed that the algorithm can achieve detection percentages above 97% with a reduced number of false positives. The proposed procedure is robust against significant amounts of noise, ellipse overlap, and ellipse incompleteness. The ellipse center location and sizes of the semi-axes were determined with errors of less than 1.5 pixels, on average, while the tilt of the ellipses was determined with an error less than 0.5˚. A set of parameter values that provided good detection percentages with a small number of false positives for the large number of images tested was easily found. In future work, the performance of the proposed algorithm will be assessed on real images and compared to that of other algorithms.
References 1. Kharma, N., Moghnieh, H., Yao, J., Guo, Y.P., Abu-Baker, A., Laganiere, J., Rouleau, G., Cheriet, M.: Automatic segmentation of cells from microscopic imagery using ellipse detection. Let Image Process 1, 39–47 (2007) 2. Kim, E., Haseyama, M., Kitajima, H.: Fast and Robust Ellipse Extraction from Complicated Images (2002) 3. Zhang, S.C., Liu, Z.Q.: A robust, real-time ellipse detector. Pattern Recogn. 38, 273–287 (2005) 4. Leroy, B., Medioni, G., Johnson, E., Matthies, L.: Crater detection for autonomous landing on asteroids. Image Vision Comput 19, 787–792 (2001) 5. Fernandes, A., Nascimento, S.: Automatic water eddy detection in SST maps using random ellipse fitting and vectorial fields for image segmentation. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds.) DS 2006. LNCS (LNAI), vol. 4265, pp. 77–88. Springer, Heidelberg (2006) 6. Honkanen, M., Saarenrinne, P., Stoor, T., Niinimaki, J.: Recognition of highly overlapping ellipse-like bubble images. Meas. Sci. Technol. 16, 1760–1770 (2005) 7. Li, H.W., Lavin, M.A., Lemaster, R.J.: Fast Hough Transform - a Hierarchical Approach. Comput Vision Graph 36, 139–161 (1986) 8. Kiryati, N., Eldar, Y., Bruckstein, A.M.: A Probabilistic Hough Transform. Pattern Recogn. 24, 303–316 (1991) 9. Bergen, J.R., Shvaytser, H.: A Probabilistic Algorithm for Computing Hough Transforms. J Algorithm 12, 639–656 (1991) 10. McLaughlin, R.A.: Randomized Hough Transform: Improved ellipse detection with comparison. Pattern Recogn. Lett. 19, 299–305 (1998) 11. Tsuji, S., Matsumoto, F.: Detection of Ellipses by a Modified Hough Transformation. IEEE T Comput 27, 777–781 (1978) 12. Aguado, A.S., Montiel, M.E., Nixon, M.S.: On using directional information for parameter space decomposition in ellipse detection. Pattern Recogn. 29, 369–381 (1996) 13. Qiao, Y., Ong, S.H.: Arc-based evaluation and detection of ellipses. Pattern Recogn. 40, 1990–2003 (2007) 14. Yuen, H.K., Illingworth, J., Kittler, J.: Detecting Partially Occluded Ellipses Using the Hough Transform. Image Vision Comput 7, 31–37 (1989) 15. Halir, R., Flusser, J.: Numerically Stable Direct Least Squares Fitting of Ellipses. In: Skala, V. (ed.) Proc. Int. Conf. in Central Europe on Computer Graphics, Visualization and Interactive Digital Media, pp. 125–132 (1998)
Recognizing Ancient Coins Based on Local Features Martin Kampel and Maia Zaharieva Pattern Recognition and Image Processing Group – TU Vienna Favoritenstr 9 - 11, 1040 Vienna, Austria
[email protected]
Abstract. Numismatics deals with various historical aspects of the phenomenon money. Fundamental part of a numismatists work is the identification and classification of coins according to standard reference books. The recognition of ancient coins is a highly complex task that requires years of experience in the entire field of numismatics. To date, no optical recognition system for ancient coins has been investigated successfully. In this paper, we present an extension and combination of local image descriptors relevant for ancient coin recognition. Interest points are detected and their appearance is described by local descriptors. Coin recognition is based on the selection of similar images based on feature matching. Experiments are presented for a database containing ancient coin images demonstrating the feasibility of our approach.
1
Introduction
Numismatics is at a point where it can benefit greatly from the application of computer vision methods, and in turn provides a large number of new, challenging and interesting conceptual problems and data for computer vision. For coin recognition we distinguish between two approaches: coin identification and coin classification. A coin classification process assigns a coin to a predefined category or type, whereas a coin identification process assigns a unique identifier to a specific coin. What makes this application special and challenging for object recognition, is that all the coins are very similar. The first coins were struck in Asia Minor in the late 7th century BC. Since then coins are a mass product [1]. In the Antiquity coins were hammer-struck from manually engraved coin dies. Coins from the same production batch will have very much the same picture and also the same quality of its relief. Depending on the series of coins in question, the only varying details can be either part of the picture or legend or there can be a difference in a prominent detail such as the face of a figure. The scientific requirement is to assign a coin its correct number according to a reference book.
This work was partly supported by the European Union under grant FP6-SSP5044450. However, this paper reflects only the authors’ views and the European Community is not liable for any use that may be made of the information contained herein.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 11–22, 2008. c Springer-Verlag Berlin Heidelberg 2008
12
M. Kampel and M. Zaharieva
Fig. 1. Different coins of the same coin type
Fig. 2. Different image representations of the same coin
Ancient and modern coins bear fundamental differences that restrict the applicability of existing algorithms [14]. Due to their nature ancient coins provide a set of identifying features. The unique shape of each coin originates in the manufacturing process (hammering procedure, specific mint marks, coin breakages, die deterioration, etc.). Furthermore, the time leaves its individual mark on each coin (fractures, excessive abrasion, damage, corrosion, etc.). Eventually, identification of ancient coins turns out to be ”easier” compared to classification. For example, Figure 1 shows ten different coins of the same coin type. A classification algorithm should ideally classify them all of the same class. However, they all provide complete different characteristics (see shape, die position, mint marks or level of details). At the same time, exactly those features enable the identification process. In contrast, Figure 2 presents five pictures of one and the same coin. The pictures were taken using different acquisition setups, i.e. scan as well as fixed and free hand cameras with varying lighting conditions. The figure points out the challenges for an automated identification process as well as the importance of quality images for the process itself. Different lighting conditions can hide or show details on the coin that are significant for a successful identification process (e.g. compare the first and the third image in Figure 2). The remainder of this paper is organised as follows: In Section 2 related work on recognizing coins is presented. Section 3 gives an overview of local features with respect to our needs. The coin recognition workflow is described in Section 4. The experiments performed and their results are presented in Section 5. We conclude the paper in Section 6 with discussion on the results achieved and an outlook for further research.
Recognizing Ancient Coins Based on Local Features
2
13
Related Work
Research on pattern recognition algorithms for the identification of coins started in 1991 when Fukumi et al. [2] published their work on rotation-invariant visual coin recognition using a neural networks approach. Also [3] is devoted to neural network design, but investigates the possibilities of simulated annealing and genetic algorithms. In 1996 Davidson [4] developed an instance-based learning algorithm based on an algorithm using of decision trees [5]. An industrial implementation of a neural networks approach is described in [6]. A more recent neural algorithm was published in [7]. This approach employs the output of a filter bank of Gabor filters fed into a back propagation network. The algorithm uses correlation in the polarspace and in combination with a neural networks. Khashman et al. implemented a neural coin recognition system for use in slot machines [8]. Huber et al. present in [9] a multistage classifier based on eigenspaces that is able to discriminate between hundreds of coin classes. The Dagobert coin recognition system presented by N¨olle et al. [10] aims at the fast classification of a large number of modern coins from more than 30 different currencies. In their system coin classification is accomplished by correlating the edge image of the coin with a preselected subset of master coins and finding the master coin with lowest distance. In [11] Maaten et al. present a coin classification system based on edge-based statistical features. It was developed for the MUSCLE CIS Coin Competition 2006 [12] focusing on reliability and speed. The coin classification method proposed by Reisert et al. [13] is based on gradient information. Similar to the work of N¨ olle et. al [10] coins are classified by registering and comparing the coin with a preselected subset of all reference coins. Current research approaches for coin recognition algorithms possess mainly two limitations. On the one hand, the input digital image is well defined – there is always only one coin pictured and the image is taken under very controlled conditions (such as background, illumination, etc.). On the other hand, the algorithms focus mainly on the recognition of modern coins. Those assumptions facilitate the classification and identification process substantially. In the case of controlled conditions and the well known circular shape of modern coins, the process of coin detection and segmentation becomes an easier task. The almost arbitrary shape of an ancient coin narrows the amount of appropriate segmentation algorithms. Tests performed on image collections both of medieval and modern coins show that algorithms performing good on modern coins do not necessarily meet the requirements for classification of medieval ones [14]. The features that most influence the quality of recognition process are yet unexplored.
3
Local Image Features
Local features describe image regions around given interest points. Their application in the computer vision is manifold ranging from object and texture recognition [15] to robot localization [16], symmetry detection [17] and wide
14
M. Kampel and M. Zaharieva
baseline stereo matching [18]. Local features are already successfully used for object classification. Crucial influence on local feature based object recognition bear both the detection of interest points and their representation. Hence, in the following we give a short overview over top performing interest point detectors and local feature descriptors and discuss their applicability with respect to the identification of ancient coins. 3.1
Interest Point Detectors
In the literature exist a broad number of interest point detectors with varying level on invariance against rotation, scale or affine changes. Comparative studies on interest points and their performance evaluation can be found in [19,20]. The Harris corner detector [21] is based on local auto-correlation matrix of the image function. The squared first derivatives are averaged over a 41 × 41 Gaussian weighted window around an image point. If the auto-correlation matrix has two significant eigenvalues, an interest point is detected. However, the detected points are not invariant to scale and affine changes. To achieve scale invariance Mikolajczyk et al. [22] extend the Harris detector by selecting corners at location where a Laplacian attains an extrema in scale space (Harris-Laplace). The Harris-Affine detector [22,19] additionally uses second moment matrix to achieve affine invariance. Detected points are stable under varying lighting conditions since significant signal change in orthogonal directions is captured. Hessian-Laplace localizes points at local maxima of the Hessian determinant in scale-space maxima of the Laplacian-of-Gaussian [15,19]. Detected keypoints are invariant to scale and rotation transformations. Similar to Harris-Affine, the Hessian-Affine detector provides in a next step affine invariance based on second moment matrix [19]. In contrary to the Harris-based detectors, Hessian interest points indicate the presence of blob like structures. Bay et al. [23] introduced recently a further detector based on the Hessian matrix – the Fast-Hessian detector. It approximates Gaussian second order derivative with box filter. To further reduce the computational time, image convolutions use integral images. Tuytelaars et al. present in [18] further two methods to extract affine invariant regions. The Geometry-based region detector starts from Harris corners and uses the nearby edges identified by the Canny edge operator [24] to build a parallelogram. Keypoints are detected if the parallelogram goes through an extremum of intensity-based functions. The second method proposed – Intensity-based region detector – relies solely on the analysis of image intensity. It localizes interest points based on intensity function along rays originating from local extrema in intensity. The Maximally Stable Extremal Regions (MSER) proposed by Matas et al. [25] are a watershed based algorithm. It detects intensity regions below and above a certain threshold and select those which remain stable over a set of thresholds. The Difference-of-Gaussian(DoG) detector was introduced by Lowe as keypoint localization method for the Scale Invariant Feature Transform (SIFT) approach [26,15]. Interest points are identified at peaks (local maxima and minima) of Gaussian function applied in scale space. All keypoints with low contrast or keypoints that are localized at edges are eliminated using a Laplacian function.
Recognizing Ancient Coins Based on Local Features
15
Table 1. Average interest points detected Detector Interest points Difference-of-Gaussian (DoG) [26] 968 Harris-Laplace [15] 204 Harris-Affine [19] 198 Hessian-Laplace [19] 1076 Hessian-Affine [19] 778 Fast-Hessian [23] 198 Geometry-based region (GBR) [18] 61 IBR [18] 184 Maximally Stable Extremal Regions (MSER) [25] 134
Common critic to edge-based methods is that it is more sensitive to noise and changes in neighboring texture. Interest point detectors which are less sensitive to changes changes in texture perform well in a classification scenario since they recognize and capture those features that are common for all instances in a given class. On the opposite, identification relies on those features that are unique for a given object. Due to their nature and manufacturing process, ancient coins are unique. Coins produced by the same die show the same picture. However, since they are hand-hammered, shape, texture and relief can vary to a large degree. In this particular scenario, texture-sensitive interest point detectors are expected to perform better. Table 1 shows average interest points extracted per detector for the dataset explained in Section 5. As we will show in Section 5, the methods which detect most interest points do not necessarily perform the best. First, we are faced with the problem of overfitting (i.e. each coin is similar to all the other coins to some degree). Second, essential role play the information captured per interest point. Thus, in the next subsection we give a short overview of the local feature descriptors we used for the experiments. 3.2
Local Feature Descriptors
Given a set of interest points, the next step is to choose the most appropriate descriptor to capture the characteristics of a provided region. Different descriptors emphasize different image properties such as intensity, edges or texture. Please refer to [27] for a thorough survey on the performance of local feature descriptors. We focus our study on four descriptors which show outstanding performance with respect to changes in illumination, scale, rotation and blur. (1) Lowe [15] introduced the Scale Invariant Feature Transform (SIFT) descriptor which is based on gradient distribution in salient regions – at each feature location, an orientation is selected by determining the peak of the histogram of local image gradient orientations. Subpixel image location, scale and orientation are associated with each SIFT feature vector.
16
M. Kampel and M. Zaharieva
(2) Mikolajczyk and Schmid [27] propose an extension of the SIFT descriptor – Gradient Location and Orientation Histogram (GLOH) – designed to increase the robustness and distinctiveness’s of the SIFT descriptor. Instead of dividing the path around the interest points into a 4 × 4 grid, the authors divide it into radial and angular grid. A log-polar location grid with 3 bins in radial and 8 bins in angular directions is used. The gradient orientations are quantized into 16 bins which gives a 272 bin histogram further reduced in size using PCA to 128 feature vector dimension. (3) Belongie et al. [28] introduce Shape Context as feature descriptor for shape matching and object recognition. The authors represent the shape of an object by a discrete set of points sampled from its internal or external boundaries. As starting points, edge pixels as found by an edge detector. Following, for each point the relative location of the remaining points is accumulated in a coarse log-polar histogram. (4) Speeded Up Robust Features (SURF) [23] are fast scale- and rotation invariant features. The descriptor captures distributions of Haar-wavelet responses within the neighborhood of an interest point. Each feature descriptor has only 64 dimensions which results in fast computation and comparison. In [27] complementary evaluation on the performance of local descriptors with respect to rotation, scale, illumination, and viewpoint change, image blur and JPEG compression, is presented. In most of the tests SIFT and GLOH clearly outperformed the remaining descriptors: shape context, steerable filters, PCASIFT, differential invariants, spin images, complex filters, and moment invariants. In [29] Stark and Schiele report that the combination of Hessian-Laplace detector with SIFT and GLOH descriptor outperforms local features such as Geometric Blur, k-Adjacent Segments and Shape Context in a object categorization scenario. For their evaluation the authors used three different datasets containing quite distinguishable objects such as cup, fork, hammer, knife, etc. By contrast, our two coin data sets possess very different characteristics in comparison with existing evaluation and application scenarios. Both data sets contain similar objects and both are targeted to evaluate identification performance.
4
Recognition Workflow
We define the workflow for the identification of ancient coins by five well-defined stages as shown in Figure 3. In the preprocessing step (1) coins contained in the image are detected and segmented. Essential influence on the process carries the image diversity, e.g. single or multiple objects pictured, varying lighting conditions, shadows, diverse background textures, etc. In the scenario of ancient coins identification the almost arbitrary shape of coin additionally impede the task of coin(s) detection and segmentation. Since our test database consists solely of images of single coin on an unitary background no preprocessing is required. Eventually, the applied local feature detectors locate interest points on the background (e.g. due to intensity change). However, their amount is minimal and has no influence on the identification process.
Recognizing Ancient Coins Based on Local Features
17
Fig. 3. The five stages of coins identification workflow
The goal of the feature extraction step (2) is twofold. First, local features algorithms are applied to extract local image descriptors for coins identification. Second, features that can be used to reduce the number of required feature comparisons by reducing the coins database can be extracted. Provided uncontrolled acquisition process, simple features such as area or perimeter of a coin are not eligible since the scaling factor is unknown. Other features such as shape descriptors can be used as basis for step (3) - preselection step [30]. Step (4) descriptor matching is performed by identifying the first two nearest neighbors in terms of Euclidean distances. A descriptor D1 is accepted only if the distance ratio of the nearest (1.N N ) to the second nearest (2.N N ) neighbors is less then or equal to 0.5: 2d(D1 , D1.N N ) <= (D1 , D2.N N ).
(1)
In [15] Lowe suggests a distance ratio of 0.8. However, our experiments showed that for the case of lower inter-class differences (as all classes are coins), a lower distance ratio tends to keep more distinctive descriptors while eliminating a great part of the false matches. The value of 0.5 was determined experimentally and used throughout the tests. Furthermore, we apply a restriction rule to fasten the quality of the matches. Since each image in the database picture is a single ancient coin, a given keypoint can only be matched to a single point in a different feature set. Thus, all multiple matches are removed as they are considered to be unstable for the identification process. Finally, an additional verification step (5) can assure the final decision. Provided images of both obverse and reverse side of a coin, each side is first identified separately. If both sides vote for the same coin identification, the coin is identified adequately. Otherwise, it is classified as unknown.
18
5
M. Kampel and M. Zaharieva
Experiments
For our experiments we used a dataset of images acquired at the Fitzwilliam Museum in Cambridge, UK. We used varying technical setups – scan as well as fixed and free hand cameras, and varying lighting conditions. The dataset consists of 350 images of three different coin types (10 to 16 coins a` coin type, 3 to 5 pictures `a coin side). Ground truth is encoded in the file names. For testing the recognition one image was selected as test images. Presented evaluations as [27,29] on the performance of local descriptors use different datasets containing quite distinguishable objects such as cup, fork, hammer, knife, etc. By contrast, our coin data set possesses very different characteristics in comparison with existing evaluation and application scenarios. The data set contains similar objects and is targeted to evaluate coin recognition performance. In a first experiment we compare the performance of three descriptors on coin identification. Figure 4 shows corresponding interest points detected by the different approaches. Despite the lower image quality of the input image, the rotation and scale change of the coin, the SIFT approach matches correctly against image of the same coin acquired by the scan device (see Figure 4(a)). The Fast Approximated SIFT approach – Figure 4(b) – tends to detect keypoints mostly on the background of the image. The algorithm detects far more points than SIFT, e.g. for the example input image 8999 keypoints (by contrast keypoints detected by SIFT for the same image: 721). However, they lack of stability and distinctiveness. Eventually, each detected interest point is similar (i.e. being matched) to a large number of keypoints in the second image. The elimination of multiple matched points reduces the number of final matches by approximately 90%. Performing manual pairwise comparison of the resulting matches, PCA-SIFT (see Figure 4(c)) seems to achieve almost the same amount on descriptors as SIFT for less computational time. However, the stability of the PCA-SIFT features is considerably lower since approximately 40% of the correct classified
(a) SIFT
(b) Fast Approximated SIFT
(c) PCA-SIFT
Fig. 4. Example matches for a given ancient coin acquired using a free hand camera (Input image on the left side and corresponding match on the right hand). Using the SIFT approach (a), the test coins was successfully matched against an image of the same coin acquired using scan device. The Fast Approximated SIFT fail to recognize the image (b). PCA-SIFT (c) matched against different coin of the same coin type.
Recognizing Ancient Coins Based on Local Features
19
Table 2. Evaluation results on the recognition performance of the local image feature descriptors using the small database of ancient coins. CR shows the rate of correctly classified coins, and IR those of correctly identified ancient coins. Interest Point Detectors DoG Harris-Laplace Harris-Affine Hessian-Laplace Hessian-Affine Fast-Hessian GBR IBR MSER
(1) SIFT CR IR 90.57% 84.57% 68.39% 50.86% 76.15% 55.46% 65.90% 47.28% 71.63% 50.72% 85.43% 79.43% 51.47% 27.36% 80.29% 60.57% 80.86% 68.29%
(2) GLOH CR IR 60.00% 40.00% 71.84% 53.45% 73.56% 54.31% 65.90% 47.28% 68.48% 49.28% 85.43% 78.29% 48.53% 24.76% 75.71% 50.57% 77.71% 64.29%
(3) Shape CR IR 61.14% 29.14% 79.71% 61.45% 73.04% 53.04% 92.57% 82.00% 88.00% 80.00% 84.86% 72.29% 52.44% 29.64% 80.29% 55.14% 74.86% 58.00%
(4) SURF CR IR 82.57% 28.57% 71.30% 28.12% 71.88% 27.83% 84.29% 32.29% 79.43% 29.71% 90.86% 78.29% 56.03% 15.31% 77.43% 25.14% 74.00% 28.29%
Fig. 5. Performance distribution of the interest point detectors
images are due matching of the obverse with the reverse side of a coin. The PCA reduction of feature vector size seems to lead to loss of valuable information for the identification process. In terms of identification rate, SIFT clearly outperforms both modifications by more than 10%. The second experiment aims at evaluation of the performance of the presented interest point detectors and local descriptors with respect to recognition. We compare both classification (CR) and identification rate (IR) and show that a good classification rate is no guarantee for the distinctiveness and stability of the respective detectors or descriptors. Table 2 summarizes the results on the
20
M. Kampel and M. Zaharieva
coin data set. The best classification rate of 92.57% was achieved with Shape Context combined with Hessian Laplace detector. The best identification rate of 84.57% was achieved with SIFT combined with DoG. The main reason for the significant difference between classification and identification rate is the nature of local descriptors. Local descriptors simply describe the close surroundings of given interest point. Dependent on the size of this box, matching, i.e. similar enough, descriptor can be found on multiple coins or even on the same coin or different sides of the same coin. Figure 5 visualizes the performance distribution with respect to the interest point detectors. One can clearly identify four groups. The first one, low identification and low classification rate, is dominated by the GBR detector. Independent of the applied local feature descriptor the achieved performance is too low with a rate close or far bellow 50%. The second group, high classification and low identification rate, is defined by the use of the SURF descriptor. Independently of the applied interest point detector, the SURF descriptor shows high stability with respect to classification. The last conspicuous group, high classification and high identification rate, is dominated by the Fast Hessian detector.
6
Conclusion
In this paper, we described a strategy for the recognition of ancient coins based on local image features. The achieved recognition rates indicate the feasibility of the approach. SIFT features show outstanding performance in existing evaluations. However, the main drawback and critical point is their computational time. Benefits of the proposed system are in the field of coin recognition. Based on the promising results we plan to extend the evaluation on a recently recorded coin collection of 2400 images of 240 different coins. Future research will include methods in the field of optical character and symbol recognition. Furthermore, we will extend our work towards die and mint sign identification based on spatially constrained local features. Acknowledgment. The authors want to thank Dr. Mark Blackburn and his team at the Coin Department, Fitzwilliam Museum, Cambridge, UK, for sharing their experience and providing the database of images of ancient coins.
References 1. Duncan-Jones, R.: Money and Government in the Roman Empire. Cambridge (1994) 2. Fukumi, M., Omatu, S., Takeda, F., Kosaka, T.: Rotation-invariant neural pattern recognition system with application to coin recognition. IEEE Transactions on Neural Networks 3, 272–279 (1992) 3. Mitsukura, Y., Fukumi, M., Akamatsu, N.: Design and evaluation of neural networks for coin recognition by using ga and sa. In: Proceedings of the IEEE-INNSENNS International Joint Conference on Neural Networks, IJCNN 2000, vol. 5, pp. 178–183 (2000)
Recognizing Ancient Coins Based on Local Features
21
4. Davidsson, P.: Coin classification using a novel technique for learning characteristic decision trees by controlling the degree of generalization. In: Proc. of 9th Int. Conference on Industrial & Engineering Applications of Artificial Intelligence & Expert Systems (IEA/AIE 1996), pp. 403–412 (1996) 5. Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991) 6. Moreno, J.M., Madrenas, J., Cabestany, J., La´ una, J.R.: Using classical and evolutive neural models in industrial applications: A case study for an automatic coin classifier. In: Biological and Artificial Computation: From Neuroscience to Neurotechnology, pp. 922–931 (1997) 7. Bremananth, R., Balaji, B., Sankari, M., Chitra, A.: A new approach to coin recognition using neural pattern analysis. In: Proceedings of IEEE INDICON 2005, pp. 366–370 (2005) 8. Khashman, A., Sekeroglu, B., Dimililer, K.: Rotated Coin Recognition Using Neural Networks. Advances in Soft Computing, vol. 41, pp. 290–297. Springer, Berlin (2007) 9. Huber, R., Ramoser, H., Mayer, K., Penz, H., Rubik, M.: Classification of coins using an eigenspace approach. Pattern Recognition Letters 26, 61–75 (2005) 10. N¨ olle, M., Penz, H., Rubik, M., Mayer, K.J., Holl¨ ander, I., Granec, R.: Dagobert – a new coin recognition and sorting system. In: Proc. of the 7th International Conference on Digital Image Computing - Techniques and Applications (DICTA 2003), Macquarie University, Sydney, Australia, pp. 329–338. CSIRO Publishing (2003) 11. van der Maaten, L.J., Poon, P.: Coin-o-matic: A fast system for reliable coin classification. In: Proc. of the Muscle CIS Coin Competition Workshop, Berlin, Germany, pp. 7–18 (2006) 12. N¨ olle, M., Rubik, M., Hanbury, A.: Results of the muscle cis coin competition 2006. In: Proceedings of the Muscle CIS Coin Competition Workshop, Berlin, Germany, pp. 1–5 (2006) 13. Reisert, M., Ronneberger, O., Burkhardt, H.: An efficient gradient based registration technique for coin recognition. In: Proc. of the Muscle CIS Coin Competition Workshop, Berlin, Germany, pp. 19–31 (2006) 14. Zaharieva, M., Kampel, M., Zambanini, S.: Image based recognition of ancient coins. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 547–554. Springer, Heidelberg (2007) 15. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 16. Murillo, A.C., Guerrero, J.J., Sag¨ u´es, C.: Surf features for efficient robot localization with omnidirectional images. In: IEEE International Conference on Robotics and Automation, pp. 3901–3907 (2007) 17. Loy, G., Eklundh, J.O.: Detecting symmetry and symmetric constellations of features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 508–521. Springer, Heidelberg (2006) 18. Tuytelaars, T., Van Gool, L.: Matching widely separated views based on affine invariant regions. International Journal of Computer Vision 59, 61–85 (2004) 19. Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Van Gool, L.: A comparison of affine region detectors. International Journal of Computer Vision 65, 43–72 (2005) 20. Schmid, C., Mohr, R., Baukhage, C.: Evaluation of interest point detectors. International Journal of Computer Vision 2, 151–172 (2000)
22
M. Kampel and M. Zaharieva
21. Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Conference, pp. 147–152 (1988) 22. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. International Journal of Computer Vision 60, 63–86 (2004) 23. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404– 417. Springer, Heidelberg (2006) 24. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 679–698 (1986) 25. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: Proceedings of the Britisch Machine Vision Conference, London, vol. 1, pp. 384–393 (2002) 26. Lowe, D.G.: Object recognition from local schale-invariant features. In: International Conference on Computer Vision (ICCV 1999), Washington, DC, USA, vol. 2, pp. 1150–1157. IEEE Computer Society, Los Alamitos (1999) 27. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1615–1630 (2005) 28. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 509–522 (2002) 29. Stark, M., Schiele, B.: How good are local features for classes of geometric objects. In: 11th International Conference on Computer Vision (ICCV 2007), Rio de Janeiro, Brazil (2007) 30. Zaharieva, M., Huber-M¨ ork, R., N¨ olle, M., Kampel, M.: On ancient coin classification. In: Arnold, D., Chalmers, A., Niccolucci, F. (eds.) 8th International Symposium on Virtual Reality, Archaeology and Cultural Heritage (VAST 2007), Eurograpchics, pp. 55–62 (2007)
Learning Pairwise Dissimilarity Profiles for Appearance Recognition in Visual Surveillance Zhe Lin and Larry S. Davis Institute of Advanced Computer Studies University of Maryland, College Park, MD 20742 {zhelin,lsd}@umiacs.umd.edu
Abstract. Training discriminative classifiers for a large number of classes is a challenging problem due to increased ambiguities between classes. In order to better handle the ambiguities and to improve the scalability of classifiers to larger number of categories, we learn pairwise dissimilarity profiles (functions of spatial location) between categories and adapt them into nearest neighbor classification. We introduce a dissimilarity distance measure and linearly or nonlinearly combine it with direct distances. We illustrate and demonstrate the approach mainly in the context of appearance-based person recognition.
1
Introduction
Appearance-based person recognition is an important but still challenging problem in vision. In visual surveillance, appearance information is crucial not only in tracking, but also for identifying persons across space, time, and cameras. Pose articulation, viewpoint variation, and illumination change are common factors which affect the performance of recognition systems. More importantly, when the number of persons increases in the database, appearance ambiguities between them become significant, and consequently, more and more discriminative feature selection and classification approaches are needed.1 We aim to explore pairwise class relations in more details and incorporate them into traditional nearest neighbor classification. Our approach is mainly motivated by the following observations: (1) A small region (or a feature) can be crucial in recognition because it might be the only distinguishing element to discriminate two otherwise very similar appearances, (2) Discriminative features are much easier to train in a pairwise scheme than in a one-against-all scheme, (3) Discriminative features are characteristic for a certain pair of classes and generally different for different pairs of classes. There are numerous approaches to multiclass learning and recognition. In instance-based learning, nearest neighbor (NN) and k-NN are the most commonly used nonparametric classifiers. In discriminative learning, linear discriminant 1
In this paper, we formulate appearance-based person recognition as a multiclass classification problem where each person is considered to be a class and his/her appearances are considered to be class samples.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 23–34, 2008. c Springer-Verlag Berlin Heidelberg 2008
24
Z. Lin and L.S. Davis
analysis (LDA) and support vector machine (SVM) are well known classification methods successfully applied to many practical problems. There are several schemes to decompose a multiclass classification problem into a set of binary classification problems. For example, one-against-all [1], pairwise coupling [2, 3, 4], error correcting output codes [5], decision trees [6], etc. In [7], a multiclass classification problem was transformed into a binary classification problem by modeling intra- and extra-class distributions. Recently, multiclass learning approaches have been successfully applied to object category recognition. k-NN and SVM are combined in [8] by performing a multiclass SVM only for a set of neighbors and query. Random forests and ferns classifier in [9] uses a one-againstall multiclass-SVM. Also, a number of approaches have been proposed in the machine learning community for learning locally or globally adaptive distance metrics [10, 11, 12, 13, 14] by weighting features based on labeled training data. To simplify the one-against-all discrimination task for a huge number of categories, a triplet-based distance metric learning framework was proposed in [15], and later was extended by learning weights for every prototype in [16] for image retrieval. However, fixed weighting for each prototype can be inefficient for a very large number of classes, especially when ambiguities between classes are significant. Most previous approaches to multiclass classification have focused on designing good classifiers with better separation margins and good generalization properties, or on learning discriminative distance metrics. Our work is similar to those of learning discriminative distance metrics [13, 14, 16], but different in that, instead of learning weights for different features, we explore pairwise dissimilarities and adapt them into classification by calculating distances of a query to prototypes using both intra-class and inter-class invariant information. For example, in the context of person recognition, the intra-class invariance is based on common appearance information in each class, i.e. appearance information that does not change dramatically for each individual as pose, viewpoint and illumination change. The inter-class invariance is based on the observation that any pair of examplars EA and EB from two different classes A and B share certain common discriminative properties. For example, in the context of person appearance recognition, if person 1 and person 2 have different jacket colors, then any variants of person 1 and person 2 will probably also have different jacket colors.
2 2.1
Appearance Representation and Matching Appearance Model
Color and texture are the primary cues for appearance-based object recognition. For human appearances, the most common model is the color histogram [17]. Spatial information can be added by representing appearances in joint colorspatial spaces [18]. Other representations include spatial-temporal appearance modeling [19], spatial and appearance context modeling [20], part-based
Learning Pairwise Dissimilarity Profiles
25
appearance modeling [21], etc. We build appearance models of individuals based on nonparametric kernel density estimation [22]. It is well known that a kernel density estimator can converge to any complex-shaped density with sufficient samples. Also due to its nonparametric property, it is a natural choice for representing the complex color distributions that arise in real images. Given a set of sample pixels, represented by d-dimensional feature vectors {si = (si1 ...sid )t }i=1...Ns , from a target appearance a, we estimate the probability of a new feature vector z = (z1 , z2 , ..., zd )t from the same appearance a using multivariate kernel density estimation as: pˆa (z) =
Ns d 1 zj − sij k( ), Ns σ1 ...σd i=1 j=1 σj
(1)
where the same kernel function k(·) is used in each dimension (or channel) with different bandwidth σj . The kernel bandwidths can be estimated as in [22]. We assume independence between channels and use a Gaussian kernel for each channel. The kernel probability density function (PDF) pˆa (z) in (Equation 1) is referred to as the model of the appearance a. As in [18], we extend the color feature space to incorporate spatial information in order to preserve color structure in appearances. Assuming people are in approximate upright poses, we encode each pixel by a feature vector (c, h)t in a 4D joint color-height space, R4 , with 3D color feature vector c and 1D height feature h (represented by vertical image coordinate y). We decide to use only the y coordinate instead of using 2D spatial coordinates (x, y) for handling viewpoint and pose variations, while preserving vertical color structures. For dealing with illumination changes, we use the following two illumination insensitive color features. Normalized Color Feature : 3D normalized rgs color2 coordinates are commonly used as illumination insensitive features since the separation of chromaticity from brightness in the rgs space allows the use of a much wider kernel with the s variable to cope with the variability in brightness due to shading effects [23]. Color Rank Feature: The features are encoded as the relative rank3 of intensities of each color channel R, G and B for all sample pixels. Color rank (rR, rG, rB) features ignore the absolute values of colors by reflecting relative color rankings instead. Ranked color features are invariant to monotonic color transforms and are very stable under a wide range of illumination changes [24]. 2.2
Appearance Matching
Appearance models represented by kernel PDFs (Equation 1) can be compared by information theoretic measures such as the Battacharyya distance [17] or the Kullback-Leibler (KL) divergence (or distance) [18] for tracking and matching objects in video. 2 3
r = R/(R + G + B), g = G/(R + G + B), s = (R + G + B)/3. The rank is quantized in the interval [1, 100].
26
Z. Lin and L.S. Davis
Suppose two appearances a and b are modeled as kernel PDFs pˆa and pˆb in the joint color-height space. Assuming pˆa as the reference model and pˆb as the test model, the similarity of the two appearances can be measured by the KL distance as follows: pˆb (z) b a DKL (ˆ p ||ˆ p ) = pˆb (z)log a dz. (2) pˆ (z) Note that the KL distance is a nonsymmetric measure in that DKL (ˆ pb ||ˆ pa ) = a b DKL (ˆ p ||ˆ p ). For efficiency, the distance is calculated using only samples instead of the whole feature set. Given Na samples {si }i=1...Na from appearance a and Nb samples {tk }k=1...Nb from appearance b, Equation 2 can be approximated by the following form given sufficient samples from the two appearances: DKL (ˆ pb ||ˆ pa ) =
Nb 1 pˆb (tk ) log a , Nb pˆ (tk )
(3)
k=1
where pˆa (tk ) = N1a
Na d i=1
j=1
k(
tkj −sij ), σj
and pˆb (tk ) =
1 Nb
Nb d i=1
tkj −tij ). j=1 k( σj b pˆ (u) log pˆa (u) , where
Let Φab denote the log-likelihood ratio function, i.e. Φab (u) = u is a d-dimensional feature vector in the color-height space. Since we sample test pixels only from appearance b, pˆb is evaluated by its own samples, so pˆb is generally equal to or larger than pˆa for all test samples tk . Note that a few noisy (ambiguous) samples tk from appearance b can be better matched by the reference model PDF pˆa than the test model pdf pˆb so that pˆb (tk ) < pˆa (tk ), and, consequently, the log-likelihood ratio Φab (tk ) can be slightly less than zero. For minus log-likelihoods generated from those noisy samples, we force them to zeros (positive correction). Then, Equation 3 can be written as: ab b + ab ab DKL = N1b N k=1 [Φ (tk )]+ , where [Φ (tk )]+ = max Φ (tk ), 0 . The posi+ tively corrected KL distance DKL is guaranteed to be nonnegative for all sam+ b a ples: DKL (ˆ p ||ˆ p ) ≥ 0 , where equality holds if and only if two density models are identical for all test samples. The direct appearance-based distance Da (q, pj ) + from a query q to a prototype pj is computed as: Da (q, pj ) = DKL (ˆ pq ||ˆ pj ). In conventional NN classification, Da (q, pj ) is evaluated for all prototypes {pj }j=1...N and the minimum is chosen for classification.
3
Learning Pairwise Invariant Properties
For classifying large number of classes, it has been noted previously in [2,3] that a one-against-all scheme has difficulty separating one class from all of the others and often very complex classification models (probably leading to overfitting) are used for that purpose. In contrast, pairwise coupling is much easier to train since it only needs to separate one class from an other. For handling the scalability problem, we perform more detailed analysis of discriminative features between classes by estimating invariant information from pairwise comparisons.
Learning Pairwise Dissimilarity Profiles
27
Fig. 1. The log-likelihood ratio from a to b is calculated for all pixels (x, y) in the test appearance b to obtain the log-likelihood ratio function (or image) Φab (x, y). And, Φab (x, y) is marginalized over the x-axis and normalized to obtain an invariant profile φab (y) from a to b. The profile φba (y) from b to a is obtained in the same way. Here, normalized color-height features are used to generate the profiles.
3.1
Pairwise Invariant Profiles
As discussed above, the KL distances are calculated as an average of log-likelihood ratios (Equation 3) over all samples from the test appearance. In addition to averaging the log-likelihood ratios to evaluate distances between two appearances, we can observe an interesting property. As seen in Figure 1, if we densely sample pixels in the target appearance b, the resulting log-likelihood ratio function Φab (x, y)4 exactly reflects differences of two appearances; that is, the log-likelihood ratio function quantitatively reflects discriminating regions (or features) between two appearances. This motivates us to conjecture that if we had variations of these two appearances, denoted by a and b , which might be captured from different cameras or at different times, the difference between those new appearances would be very similar to the case of a and b. Consequently, the log-likelihood ratio function would be similar, i.e. Φa b (x, y) Φab (x, y). For dealing with shape variations due to viewpoint and pose variations and to estimate invariant information between two appearances, we project the 2D loglikelihood ratio function Φab (x, y) onto the y-axis (or marginalize the function over the x-axis), x and normalize the projected 1D function to have unit length. φab (y) = C 0 0 [Φab (x, y)]+ dx, where x0 is the width of appearance b and C y is a constant such that φab 2 = 0 0 [φab (y)]2 dy = 1, and y0 is the height of appearance b. We define the 1D function φab as the normalized invariant profile from a to b (Figure 1). 3.2
All-Pairs Training
Suppose we have N training samples (prototypes) {pi }i=1...N labeled as n different appearances. We learn normalized invariant profiles for every pair of the prototypes. Hence, we produce N × N normalized invariant profiles indexed by {(i, j)}i,j=1...N . Note that for two identical prototypes with index i = j, the log-likelihood ratio function φii (y) = φjj (y) = 0 for ∀y ∈ (0, y0 ), hence we set √ the profiles as uniform φii (y) = φjj (y) = 1/ y0 for this case. While we can see that φij and φji are different for i = j, they are very similar in shape. 4
We can treat Φab as a function of image pixel location (x, y) as Φab (u) = Φab (u(x, y)) = Φab (x, y), where u is the feature vector of pixel (x, y).
28
4
Z. Lin and L.S. Davis
Discriminative Information-Based Distances
Given a query q, we want to match it to prototypes {pi }i=1...N using the learned set of profiles {φij }i,j=1...N . We first calculate a likelihood ratio function Φiq from every prototype pi to query q, and perform normalization (as in the learning step) to obtain a set of query profiles {φiq }i=1...N . The idea is to vote for the ID of q (which is unknown) by matching a query profile φiq to a learned profile φij for which the corresponding ID j is known. The distance D(j, q|i) between ythe query profile φiq and the learned profile φij is defined as follows: D(j, q|i) = 0 0 [φiq (y)− φij (y)]2 dy. The intuition here is that the smaller the distance D(j, q|i), the more similar the two profiles are, consequently, the more confident to vote for j as the ID of q. For each i, we calculate D(j, q|i) for all j = 1...N and then vote for the ID of q. We perform such a voting procedure for all training samples {pi }i=1...N . ∗ We can vote for the ID of q based only on the best matching profile φij for each prototype pi , i.e. the one for which ji∗ = arg minj D(j, q|i), then assign one vote for ji∗ : V (ji∗ ) + 1 −→ V (ji∗ ). This is referred to as hard voting. The voting can be performed either in a soft manner, i.e. we vote for the ID of q based on the evidence from all profiles {φij }i,j=1...N , instead of only choosing the best matching ID j ∗ (corresponding to the lowest profile distance) for each i. The soft voting-based distance Dsv (q, pj ) from query q to prototype pj is defined as: N Dsv (q, pj ) = N1 i=1 D(j, q|i). Compared to hard voting, soft voting is less sensitive to ambiguities and noise effects since it collects all possible evidence for calculating final votingbased distances instead of choosing the top one match as in hard voting. From experiments, we verify that soft voting gives better performance in recognition rate. Based on the above reasoning, we compute the (indirect) discriminative information-based distance from the a query q to a prototype pj as: Dd (q, pj ) = Dsv (q, pj ).
5
Classification and Recognition
As discussed previously, traditional nearest neighbor or k-NN methods directly use Da as the distance from a query to a prototype, and the classification is performed by finding the minimum distance or by majority voting for k nearest neighbors. Da only considers information between a query and a prototype, while Dd only considers inter-relations between different training samples; that is, the two distances are based on independent information. This leads to the idea that combining the two would boost recognition performance. We tested two parameterized distance measures D1 and D2 involving linear and nonlinear combinations: D1 (q, pj ) = (1 − α)Da (q, pj ) + αDd (q, pj ), D2 (q, pj ) = Da1−β (q, pj )Ddβ (q, pj ). We learn the parameters α and β by evaluating the overall recognition rates using a large number of labeled testing samples. Experiments shows that the optimal parameter estimates are as listed in Table 1. The parameters can be selected flexibly around the optimal values (±0.1) without performance degradation. Given a query q, we want to estimate its unknown
Learning Pairwise Dissimilarity Profiles
a b c d e f g h i Prototype ID
j
0
combine1
Comparison
a b c d e f g h i Prototype ID
j
2
combine2 D
1
10 D
sv
D
a
D
0
Combine 10
soft voting
1 0.5
5
Combine
Discriminative Information−based 1.5
direct
10
5 0
a b c d e f g h i Prototype ID
j
5
0
a b c d e f g h i Prototype ID
j
Relative Margins
Direct Appearance−based 15
29
2
direct soft voting combine1 combine2
1.5
1 Distances
Fig. 2. An illustration of the invariance of pairwise normalized profiles, and a comparison of direct, indirect, and combined distances for an example of 10 prototypes (with different labels) and one query. Top: it can be observed that the current test profiles of appearance q are very similar to the learned profiles of appearance c (which is the true classification of q) while largely different for the case of other appearances such as a and b. Bottom: distances Da , Dsv , D1 , D2 from q to all prototypes are evaluated and the relative margins are compared. We can see that all distance measures result in correct top one recognition, while the combined distance measures D1 and D2 result in larger relative margins than Da and Dsv . Table 1. Learned optimal parameter values for combined distance measures D1 and D2 . Total of 180 labeled test samples are used against 180 training samples of different appearances for learning the parameters α and β. Normalized Color + Height feature α = 0.24, β = 0.20 Color Rank + Height feature α = 0.75, β = 0.30
class label (person ID) by calculating the distances between the query and all labeled prototypes. Classification is done by the nearest neighbor rule using one of the combined distance measures D1 and D2 as: j ∗ = arg minj Dcombine (q, pj ). Figure 2 shows an illustration of the invariance of pairwise dissimilarity profiles and an example of recognition using different distance measures.
6
Implementation Details
The bandwidths for each dimension of the 4D feature spaces are listed in Table 2. The bandwidths are generally estimated as 2% of the ranges of the corresponding channel and adjusted slightly by repeated trial-and-error procedures. We resize image patches of a person to have fixed height of 50 pixels y0 = 50. Note that all the parameters including bandwidths for different feature spaces, height
30
Z. Lin and L.S. Davis Table 2. Bandwidths estimated and used for our experiments Feature Space Channel Bandwidth Normalized Color + Height σr = σg = 0.02, σs = 20, σh = 1 Color Rank + Height σrR = σrG = σrB = 4, σh = 1
sampling rate, distance model parameters α and β are fixed to the listed values throughout the experimental evaluation. For efficiently matching people in video, we select key frames (appearances) as prototypes of training and testing sequences and recognize appearances based on the prototypes. The process is as follows. The first frame (t = 0) is automatically selected as the first key frame K1 . Then, we calculate the symmetric KL distances of the subsequent frames (t = 1, 2...) to all current key frames {Kj }j=1...i . The symmetric KL distance is defined as: DsKL (ˆ pb , pˆa ) = + + b a a b min(DKL (ˆ p ||ˆ p ), DKL (ˆ p ||ˆ p )). For the current frame t ≥ 1, if all symmetric KL distances {DsKL (ˆ pt , pˆKj ), j = 1...i} are greater than a fixed threshold5 τ = 1.5, frame t becomes the next key frame Ki+1 , and is added to the set of current key frames. In this way, those frames with large information gain or having new information are selected, and those not selected can be explained by the key frames with a bounded deviation from one of the key frames in the symmetric KL distance. We preprocess the videos by codebook-based background subtraction [25] and neighborhood-based noise removal to obtain single connected component for each frame. Person sub-images (rectangular patches) are extracted from bounding box of foreground regions.
7
Experimental Results
We use Honeywell Appearance Datasets for experimental evaluation. The classification and recognition performance is quantitatively analyzed in terms of Cumulative Match Curve (CMC) and Expected Relative Margin. The expected relative margin δ is defined as a geometric mean of relative margins for all test NT di2 1/NT samples: δ = ( i=1 ) , where NT denotes the number of test samples, di1 di 1
denotes the distance of query qi to the correct (same class) prototype, and di2 denotes the distance of query qi to the closest incorrect (different class) prototype. Our test data include videos of 61 individuals taken by two overlapping cameras widely separated in space. Figure 3 shows samples from all 61 appearances.We can observe that the dataset has many appearance ambiguities because a limited number of people were used to create a large number of ‘appearance’ classes by a partial change of clothing for each person. Also there are significant illumination, viewpoint and pose changes across the two cameras. Note that in all the following experiments, we listed comparisons of our approach with the baseline technique 5
The threshold is estimated such that on average, three key frames are selected for each track of about 30 frames.
Learning Pairwise Dissimilarity Profiles
(a) appearances from cam1
31
(b) appearances from cam2
Fig. 3. List of sample appearances taken under two overlapping and widely separated cameras CMCs − 30 persons − train(cam1) & test(cam2)
CMCs − 30 persons − train(cam2) & test(cam1)
1
Relative margins − 30 persons
1 norm−direct norm−soft voting norm−combine1 norm−combine2 rank−direct rank−soft voting rank−combine1 rank−combine2
1.3
0.6 norm−direct norm−soft voting norm−combine1 norm−combine2 rank−direct rank−soft voting rank−combine1 rank−combine2
0.4
0.2
0 0
1
2
3 k
4
5
0.6 norm−direct norm−soft voting norm−combine1 norm−combine2 rank−direct rank−soft voting rank−combine1 rank−combine2
0.4
0.2
0 0
6
Average relative margin
0.8 Recognition rate
Recognition rate
0.8
CMCs − 61 persons − train(cam1) & test(cam2)
1
2
3 k
4
5
1.2 1.1 1 0.9 0.8
6
CMCs − 61 persons − train(cam2) & test(cam1)
1
1
0.8
0.8
cam1−>cam2
cam2−>cam1
Relative margins − 61 persons norm−direct norm−soft voting norm−combine1 norm−combine2 rank−direct rank−soft voting rank−combine1 rank−combine2
0.6 norm−direct norm−soft voting norm−combine1 norm−combine2 rank−direct rank−soft voting rank−combine1 rank−combine2
0.4
0.2
0 0
1
2
3 k
4
5
0.6 norm−direct norm−soft voting norm−combine1 norm−combine2 rank−direct rank−soft voting rank−combine1 rank−combine2
0.4
0.2
6
Average relative margin
Recognition rate
Recognition rate
1.3
0 0
1
2
3 k
4
5
1.2 1.1 1 0.9
6
0.8
cam1−>cam2
cam2−>cam1
Fig. 4. Recognition performance with respect to the increasing number of persons involved in training and testing. Note that, due to space limitation, only plots for cases N = 30, 61 are shown. (‘norm’: normalized color feature, ‘rank’: color rank feature, ‘direct’: appearance-based distance Da , ‘soft voting’: discriminative information-based distance Dd = Dsv , ‘combine1’ and ‘combine2’: combined distances D1 , D2 .)
(i.e. nearest neighbor classifier) using the same type features. Our approach is labeled as ‘combine’ and the baseline technique is labeled as ’direct’. Scalability: To show the scalability of our voting-based and combined approaches, we evaluated cumulative recognition rates and expected relative margins for increasing number (N = 10, 20, 30, 40, 50, 60, 61) of individuals involved in training and testing (Figure 4). For each of the 61 individuals, we select one key frame from a camera for training and one key frame from another camera for testing, and evaluated the performance using different features and different training-testing data (cam1 to cam2, cam2 to cam1). The results show that our combined approaches consistently perform better than the direct appearancebased method. More importantly, we note that our voting-based approach and
32
Z. Lin and L.S. Davis CMCs − comparison w.r.t. # of prototypes
Relative margins − # of prototypes
1 prototype−direct 1 prototype−soft voting 1 prototype−combine1 1 prototype−combine2 3 prototypes−direct 3 prototypes−soft voting 3 prototypes−combine1 3 prototypes−combine2
0.9 0.8 0.7 0.6 0.5 0
1
2
3 k
4
5
6
1.3 Average relative margin
Recognition rate
1
direct soft voting combine1 combine2
1.2
1.1
1
1 prototype
3 prototypes
Fig. 5. Recognition performance with respect to the changing number of training samples (prototypes) per class Table 3. Comparisons of top one recognition rates to state of the art work. Our combined approaches with (color rank + height) features are marked as ‘bold’. Approaches Avg. Recog. Rate norm-direct-appearance 61 persons 0.56 norm-soft voting 61 persons 0.35 norm-combine1 61 persons 0.59 norm-combine2 61 persons 0.61 rank-soft voting 61 persons 0.71 rank-direct-appearance 61 persons 0.84 rank-combine1 61 persons 0.88 rank-combine2 61 persons 0.89 [24] rank-path length 61 persons 0.85 [19] model fitting 44 persons 0.59 [20] shape & appear. c. 99 persons 0.82
combined approaches only have very small degradations with increasing numbers of people, while direct methods resulted in large degradation in recognition performance. This is more obviously seen for the case of color rank features.We can see that our discriminative information-based combined distances are more useful in recognizing large number of classes than the direct appearance-based distances. Effects of Number of Training Samples: We compared recognition performance over the number of training samples per class (Figure 5). We select three key frames for each of 30 individuals from camera 2 as training samples and three key frames for each of 30 individuals from camera 1 as test samples. The figure shows that increasing the number of training samples per class improves cumulative recognition rates and relative margins significantly. Comparison with Previous Approaches: We also compared the performance of our voting-based and combined approaches to the direct appearancebased approach in terms of average top one recognition rates on 61 individuals. Both cases of (Normalized Color + Height) and (Color Rank + Height) feature spaces show that our combined approaches improve top one recognition rates of direct appearance-based method by 5-6% (equivalently, the error rate is reduced by 31%) (Table 3). Results on the same dataset with a same number
Learning Pairwise Dissimilarity Profiles
33
of individuals are compared to Yu et al. [24]6 . Using similar number of pixels (500 samples) per appearance, our combined approach obtained 4% better top one recognition rate (equivalently, the error rate is reduced by 27%) and is 5-10 times faster than [24]. This is because computing path-length features for all samples significantly slows down the process. We used the much simpler normalized height feature to achieve better performance. Indirect comparisons (Table 3) to Gheissari et al. [19] and Wang et al. [20]7 on datasets with similar number of people show that our approach is comparable to state of the art work on person recognition. The computational complexity8 of our learning algorithm is O(N 2 ), while the complexity of our testing algorithm for a single query image is only O(N ), which is the same as the traditional nearest neighbor method.
8
Conclusion
We proposed a pairwise comparison-based learning and classification approach for appearance-based person recognition. We used a learned set of pairwise invariant profiles to adaptively calculate distances from a query to prototypes so that the expected relative margins can be improved. The combined distances from appearance and discriminative information lead to significant improvements over pure appearance-based nearest neighbor classification. We also experimentally validated the scalability of our approach to larger number of categories.
Acknowledgement This research was funded in part by the U.S. Government VACE program. The authors would like to thank Yang Yu for discussion and help on this work.
References 1. Nakajima, C., Pontil, M., Heisele, B., Poggio, T.: Full-body person recognition system. Pattern Recognition 36, 1997–2006 (2003) 2. Hastie, T., Tibshirani, R.: Classification by pairwise coupling. In: NIPS (1997) 3. Roth, V., Tsuda, K.: Pairwise coupling for machine recognition of handprinted japanese characters. In: CVPR (2001) 4. Wu, T., Lin, C., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004) 5. Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 263–286 (1995) 6. Amit, Y., Geman, D., Wilder, K.: Joint induction of shape features and tree classifiers. IEEE Trans. PAMI 19, 1300–1305 (1997) 6 7 8
The result of [24] is obtained from their most recent experiments. In [19, 20], Datasets (publicly unavailable) of 44 and 99 individuals are used. Using about 500 samples per appearance, the learning time for 61 appearances is about 2 minutes, and the time for matching a single query image to 61 prototypes is less than 2 seconds in C++ on a Intel Xeon CPU 2.40GHz machine.
34
Z. Lin and L.S. Davis
7. Moghadam, B., Jebara, T., Pentland, A.: Bayesian face recognition, MERL Technical Report, TR-2000-42 (2002) 8. Zhang, H., Berg, A., Maire, M., Malik, J.: SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In: CVPR (2006) 9. Bosch, A., Zisserman, A., Munoz, X.: Image classification using random forests and ferns. In: ICCV (2007) 10. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Trans. PAMI 18, 607–616 (1996) 11. Domeniconi, C., Gunopulos, D.: Adaptive nearest neighbor classification using support vector machines. In: NIPS (2001) 12. Xing, E., Ng, A., Jordan, M., Russell, S.: Distance metric learning, with application to clustering with side information. In: NIPS (2002) 13. Shalev-Shwartz, S., Singer, Y., Ng, A.: Online and batch learning of pseudo metrics. In: ICML (2004) 14. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. In: NIPS (2005) 15. Schultz, M., Joachims, T.: Learning a distance metric from relative comparisons. In: NIPS (2003) 16. Frome, A., Singer, Y., Sha, F., Malik, J.: Learning globally-consistent local distance functions for shape-based image retreval and classification. In: ICCV (2007) 17. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IJCV 25, 64– 577 (2003) 18. Elgammal, A., Davis, L.S.: Probabilistic tracking in joint feature-spatial spaces. In: ICCV (2003) 19. Gheissari, N., Sebastian, T., Tu, P., Rittscher, J., Hartley, R.: Person reidentification using spatiotemporal appearance. In: CVPR (2006) 20. Wang, X., Doretto, G., Sebastian, T., Rittscher, J., Tu, P.: Shape and appearance context modeling. In: ICCV (2007) 21. Li, J., Zhou, S.K., Chellappa, R.: Appearance context modeling under geometric context. In: ICCV (2005) 22. Scott, D.: Multivariate density estimation. Wiley Interscience, Hoboken (1992) 23. Zhao, L., Davis, L.S.: Iterative figure-ground discrimination. In: ICPR (2004) 24. Yu, Y., Harwood, D., Yoon, K., Davis, L.S.: Human appearance modeling for matching across video sequences. Mach. Vis. Appl. 18, 139–149 (2007) 25. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L.S.: Real-time foregroundbackground segmentation using codebook model. Real-Time Imaging 11, 172–185 (2005)
Edge-Based Template Matching and Tracking for Perspectively Distorted Planar Objects Andreas Hofhauser, Carsten Steger, and Nassir Navab TU M¨ unchen, Boltzmannstrasse 3, 85748 Garching bei M¨ unchen, Germany MVTec Software GmbH, Neherstrasse 1, 81675 M¨ unchen, Germany
Abstract. This paper presents a template matching approach to high accuracy detection and tracking of perspectively distorted objects. To this end we propose a robust match metric that allows significant perspective shape changes. Using a coarse-to-fine representation for the detection of the template further increases efficiency. Once an template is detected at interactive frame-rate, we immediately switch to tracking with the same algorithm, enabling detection times of only 20ms. We show in a number of experiments that the presented approach is not only fast, but also very robust and highly accurate in detecting the 3D pose of planar objects or planar subparts of non-planar objects. The approach is used in augmented reality applications that could up to now not be sufficiently solved, because existing approaches either needed extensive training data, like machine learning methods, or relied on interest point extraction, like descriptors-based methods.
1
Introduction
Methods that exhaustively search a template in an image are one of the oldest computer vision algorithms used to detect an object in an image. However, the mainstream vision community has abandoned the idea of an exhaustive search as there are two prejudices that are commonly articulated. First, that an object detection based on template matching is slow and second, that an object detection based on template matching is certainly extremely inefficient for, e.g., perspective distortions where an 8 dimensional search space must be evaluated. In our work, we address these issues and show that with several contributions it is possible to benefit from the robustness and accuracy of template matching even when an object is perspectively distorted. Furthermore, we show in a number of experiments that it is possible to achieve an interactive rate of detection that was only possible with a descriptor-based approach until now. In fact, if the overall search range of the pattern is restricted, real-time detection is possible on current standard PC hardware. Furthermore, as we have an explicit representation of the geometric search space, we can easily restrict it and therefore use the proposed method for high-speed tracking. This is in strong contrast to many other approaches (e.g. [1]), in which a difficult decision has to be made when to switch between a detection and a tracking algorithm. One of the key contributions of any new template matching algorithm is the image metric that is used to G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 35–44, 2008. c Springer-Verlag Berlin Heidelberg 2008
36
A. Hofhauser, C. Steger, and N. Navab
compare the model template with the image content. The design of this metric determines the overall behavior and its evaluation dominates the run-time. One key contribution of the proposed metric is that we explicitly distinguish between model contours that give us point correspondences and contours that suffer from the aperture problem. This allows us to detect, e.g., an assembly part that contains only curved contours.For these kinds of objects, descriptor-based methods, that excel as they allow for perspective shape changes, notoriously fail if the image contains not enough or only a small set of repetitive texture like in Figure 1.
Fig. 1. An object typically encountered in assembly scenarios. A planar sub-part of a non-planar object is taken as model region and the detection results are depicted as the white contours. The different views leads to significant non-linear illumination changes and perspective distortion. The object contains only curved contours, and hence extraction of discriminative point features is a difficult task.
1.1
Related Work
We roughly classify algorithms for pose detection into template matching and descriptor-based methods. In the descriptor-based category, the rough scheme is to first determine discriminative “high-level” features, extract surrounding discriminative descriptors from these feature points, and to establish the correspondence between model and search image by classifying the descriptors. The big advantage of this scheme is that the run-time of the algorithm is independent of the degree of the geometric search space. Recent prominent examples, that fall into this category are [2,3,4,5]. While showing outstanding performance in several scenarios, they fail if the object has only highly repetitive texture or only sparse edge information. The feature descriptors overlap in the feature space and are not discriminating anymore. In the template matching category, we subsume algorithms that perform an explicit search. Here, a similarity measure that is either based on intensities (like SAD, SSD, NCC and mutual information) or gradient features is evaluated. However, the evaluation of intensity-based metrics is computationally expensive. Additionally, they are typically not invariant against nonlinear illumination changes, clutter, or occlusion.
Edge-Based Template Matching and Tracking
37
For the case of feature-based template matching, only a sparse set of features between template and search image is compared. While extremely fast and robust if the object undergoes only rigid transformations, these methods become intractable for a large number of degrees of freedom, e.g., when an object is allowed to deform perspectively. Nevertheless, one approach for feature-based deformable template matching is presented in [6], where the final template is chosen from a learning set while the match metric is evaluated. Because obtaining a learning set and applying a learning step is problematic for our applications, we prefer to not rely on training data except for the original template. In contrast to this, we use a match metric that allows for local perspective deformations, while preserving robustness to illumination changes, partial occlusion and clutter. While we found a match metric with normalized directed edge points in [7,8] for rigid object detection, and also for articulated object detection in [9], its adaptation to 3D object detection is new. A novelty in the proposed approach is that the search method takes all search results for all parts into account at the same time. Despite the fact that the model is decomposed into sub-parts, the relevant size of the model that is used for the search at the highest pyramid level is not reduced. Hence, the presented method does not suffer the speed limitations of a reduced number of pyramid levels that prior art methods have. This is in contrast to, e.g., a component-based detection like in [9] which could conceptually also be adapted for the perspective object detection. Here small sub-parts must be detected, leading to a lower number of pyramid levels that can be used to speed up the search.
2
Perspective Shape-Based Object Detection
In the following, we detail the perspective shape-based model generation and matching algorithm. The problem that this algorithm solves is particularly difficult, because in contrast to optical flow, tracking, or medical registration, we assume neither temporal nor local coherence. While the location of the objects are determined with the robustness of a template matching method, we avoid the necessity of expanding the full search space as if it was a descriptor-based method. 2.1
Shape Model Generation
For the generation of our model, we decided to rely on the result of a simple contour edge detection. This allows us to represent objects from template images as long as there is any intensity change. Note that in contrast to corners or interest point features, we can model objects that contain only curved contours (see detection results in Figure 1). Furthermore, directly generating a model from an untextured CAD format is possible in principle. Our shape model M is composed of an unordered set of edge points M = xi , yi , dm (1) i , cji , pj |i = 1 . . . n, j = 1 . . . k .
38
A. Hofhauser, C. Steger, and N. Navab
Here, x and y are the row and column coordinates of the n model points. dm denotes the gradient direction vector at the respective row and column coordinate of the template. We assume that spatially coherent structures stay the same even after a perspective distortion. Therefore, we cluster the model points with expectation-maximization-based k-means such that every model point belongs to one of k clusters. The indicator matrix cji maps clusters to model points (entry 0 if not in cluster, else entry 1) that allow us to access the model points of each cluster efficiently at run-time. For the later detection we have to distinguish, whether a cluster of a model can be used as a point feature (that gives us two equations for the x and y location) or only as a contour line feature (that suffers from the aperture problem and gives only one equation). This is a label that we save in for each cluster in pj . We detect this by analyzing whether a part contains only one dominant gradient direction or gradient directions into various, e.g., perpendicular directions. For this we determine a feature that describes the main gradient directions for each of the j clusters of the model: dm n cji i=1 dim i ClusterDirectionj = (2) n cji . i=1 If a model part contains only direction vectors in one dominant direction, the length of the resulting ClusterDirection vector has the value one or, in case of noise, slightly below one. In all other cases, e.g., a corner-like shape or directions of opposing gradient directions, the length of ClusterDirection is significantly smaller than one. For a straight edge with a contrast change, the sign change of the gradient polarity gives us one constraint and hence prevents movement of that cluster along the edge and therefore we assign it a point feature label. Because the model generation relies on only one image and the calculation of the clusters is realized efficiently, this step needs, even for models with thousands of points, less than a second. This is an advantage for users of a computer vision system, as an extended offline phase would make the use of a model generation algorithm cumbersome. 2.2
Metric Based on Local Edge Patches
Given the generated model, the task of the perspective shape matching algorithm is to extract instances of that model in new images. Therefore, we adapted the match metric of [8]. This match metric is designed such that it is inherently invariant against nonlinear illumination changes, partial occlusion and clutter. If a location of an object is described by x, y (this is 2D translation only, but the formulas can easily be extended for, e.g., 2D affine transformations), the score function for rigid objects reads as follows: s(x, y) =
n m s 1 di , d(x+xi ,y+yi ) , s n i=1 dm i · d(x+xi ,y+yi )
(3)
where ds is the direction vector in the search image, · is the dot product and · is the Euclidean norm. The point set of the model is compared to a dense gradient
Edge-Based Template Matching and Tracking
39
direction field of the search image. Even with significant nonlinear illumination changes that propagate to the gradient amplitude the gradient direction stays the same. Furthermore, a hysteresis threshold or non-maximum suppression is completely avoided in the search image, resulting in true invariance against arbitrary illumination changes.1 Partial occlusion, noise, and clutter results in random gradient directions in the search image. These effects lower the maximum of the score function but do not alter its location. Hence, the semantic meaning of the score value is the ratio of matching model points. It is interesting to note that comparing the cosine between the gradients leads to the same result, but calculating this formula with dot products is several orders of magnitudes faster. The idea of extending this metric for 3D object detection is that we instantiate globally only similarity transformations. By allowing successive small movements of the parts of the model, we implicitly evaluate a much higher class of nonlinear transformations, like perspective distortions. Following this argument, we distinguish between an explicit global score function sg , which is evaluated for, e.g., affine 2D transformations2, and a local implicit score function sl , that allows for local deformations. The global score function sg is a sum of the contributions of all the clusters. k 1 sg (x, y) = sl (x, y, j). (4) n j=1 We assume that even after a perspective distortion the neighborhood of each model point stays the same and is approximated by a local euclidean transformation. Hence, we instantiate local euclidean transformations T for each cluster and apply it on the model points of that cluster in a small local neighborhood. If the cluster is a point feature we search in a 5×5 pixel window the optimal score. In case of a line feature, we search a 2 pixels in both directions of ClusterDirectionj of the respective cluster. The local score then is the maximum alignment of gradient direction between the locally transformed model points of each cluster and the search image. Accordingly, the proposed local score function sl is:
size(j)
sl (x, y, j) = max T
i=1
s dm cji , d(x+T (xc
ji
dm cji
·
),y+T (ycji ))
ds(x+T (xc ),y+T (yc )) ji ji
(5)
Here, the function size returns the number of elements in cluster j. For the sake of efficiency, we exploit the mapping that was generated in the offline phase for accessing the points in each cluster (the cji matrix). Furthermore, we cache T (xcji ) and T (ycji ) since they are independent of x and y. 1
2
Homogenous regions and regions that are below a minimum contrast change (e.g., less than 3 gray values) can optionally be discarded, as they give random directions that is due to noise. However, this is not needed conceptually, but gives a small speedup. For sake of clarity we write formulas only for 2D translation. They can easily be extended for, e.g., 2D rotation, scaling, and anisotropic scaling, as is done in our implementation.
40
A. Hofhauser, C. Steger, and N. Navab
2.3
Perspective Shape Matching
After defining an efficient score function that tolerates shape changes, we integrated it into a general purpose object detection system. We decided to alter the conventional template matching algorithm such that it copes with perspectively distorted planar 3D objects.
Fig. 2. The schematic depiction of the shape matching algorithm. The original model consists of the rectangle. The first distorted quadrilateral is derived from the parent of the hypothesis. The local displacements T , depicted as arrows bring the warped template to a displaced position and the fitted homography aligns the parts of the model again.
Hence, the perspective shape matching algorithm first extracts an image pyramid of incrementally zoomed down versions of the original search image. At the highest pyramid level, we extract the local maxima of the score sg function (4). The local maxima of sg are then tracked through the image pyramid. until either the lowest pyramid level is reached or no match candidate is above a certain score value. While tracking the candidates down the pyramid, a rough alignment was already extracted during evaluation of the current candidate’s parent on a higher pyramid level. Therefore, we first use the alignment originating from the candidate’s parent to warp the model. Now, starting from this warped candidate the local transformation T that give the maximal score give a locally optimal displacement of each cluster to image. Since we are interested in perspective distortions of the whole objects, we fit a homography with the normalized DLT algorithm [10] to the locations of the original cluster centers to the displaced cluster centers given T . Depending whether the clusters have been classified as point features or as line features each correspondence give as two or one equation in the DLT matrix3 Then we iteratively warp the whole model with the 3
Typically, the DLT equations are highly overdetermined. We evaluated a solution with the SVD as described in [10] or directly with the eigenvalue decomposition of the normal equation. As there is no noticeable robustness difference, despite the expected quadratic worse condition number for the solution with the eigenvalue decomposition, but the version with the eigenvalue decomposition is faster we use in the following discussion the solution with the normal equations. To give an impression of the difference, in one example sequence the whole object detection with SVD runs in 150 ms, with eigenvalue decomposition in 50 ms.
Edge-Based Template Matching and Tracking
41
extracted update and refine the homography on each pyramid level until the update becomes near to identity or a maximal step of iterations is reached. Up to now, the displacement of each part T is discretized up to pixel resolution. However, as the total pose of the object is determined by the displacements of many clusters, we typically obtain a very precise position of the objects. To reach a high accuracy and precision that is a requirement in many computer vision applications, we extract subpixel-precise edge points in the search image and determine correspondences for each model point. Given the subpixel precise correspondences, the homography is again iteratively updated until convergence. Here, we minimize the distance of the tangent of the model points to the subpixel precise edge point in the image. Hence, each model edge point gives as one equation since it is a line feature. 2.4
Perspective Shape Tracking
Once an template is found in an image sequence, we restrict the search space for the subsequent frames. This assumption is valid in, e.g., many augmented reality scenarios in which the inter-frame rotations can be assumed to be less then 45 degrees and the scaling change of the object be less than 0.8 to 1.2. Once, a track is lost we switch back to detection by expanding the search range to the original size. It is important to note that we use the same algorithm for tracking and detection and only parametrize it differently. Hence, we exploit prior knowledge to speed up the search if it is available. However, our template matching is not restricted to tracking alone, like e.g. [11], but can be used for object detection when needed.
3
Experiments
For the evaluation of the proposed object detection and tracking algorithm, we conducted experiments under various real world conditions. 3.1
Benchmark Test Set
For comparison with other approaches we used sample images from publicly available benchmark datasets4 (see Figure 3). The graffiti sequence is used for instance to evaluate how much perspective distortion a descriptor-based approach can tolerate. The last two depicted images are a challenge for many descriptor-based approaches. Another interesting comparison is with the phone test sequence provided in [12]. Here, Lukas-Kanade, SIFT and the method of [12] are reported to sometimes loose the object. The proposed algorithm is able to process the sequence at 60 ms without once loosing the object. We think this is due to the fact that we explicitly represent contour information and not just interest point features and because we are able to exhaustively search for the object. 4
http://www.robots.ox.ac.uk/˜vgg/data/data-aff.html
42
A. Hofhauser, C. Steger, and N. Navab
Fig. 3. Benchmark data set with detected template used for evaluations of the method
Fig. 4. Sample images used for different real world experimental evaluations. The fitted edge model points are depicted.
3.2
Industrial Robot Experiments
To evaluate the accuracy of the 3D object detection, we equipped a 6 axis industrial robot with a calibrated camera (12 mm lens 640x480 pixel) mounted at its gripper. Then we taught the robot a sequence of different poses where the camera–object distance changes in the range of 30-50 cm and significant latitude changes must be compensated. First, we determined the repeatability of the robot, to prevent a drift during different experiment runs. Therefore, we made the robot drive the sequence several times and determined the pose of the camera at different runs with an industrial calibration grid that is seen by the camera. The maximal difference between the poses of different runs was 0.0009 distance error between estimated and ground truth translational position and 0.04 degrees angle to bring the rotation from the estimated to the true pose. Then, we manually placed an object at the same place as the calibration grid and used the planar shape matching with a bundle adjustment pose estimation to determine the pose of the object (see Figure 4). The maximal difference to the poses that where extracted with the calibration grid was below 0.01 normalized distance and 0.4 degree angle. It is interesting to note that the remaining pose error is dominated by the depth uncertainty. The maximal errors are measured, when the images suffer severe illumination changes or are out of focus. When these situations are prevented the error of translation is below 0.005 normalized distance and the angle error below 0.2 degree angle.
Edge-Based Template Matching and Tracking
43
Fig. 5. Sample images from a longer sequence used in the experimental evaluation. The first image is used for generating the template of the house. Further movies can be viewed on the web site: http://campar.in.tum.de/Main/AndreasHofhauser.
3.3
Registering an Untextured Building
To evaluate the method for augmented reality scenarios we used the proposed approach to estimate the position of a building that is imaged by a shaking hand-held camera (see Figure 5). Despite huge scale changes, motion blur and parallax effects the facade of the building is robustly detected and tracked. We think that this is a typical sequence for mobile augmented reality applications, e.g. the remote expert situation. Furthermore, due to the fast model generation, the approach is particularly useful for, e.g., mobile navigation, in which the template for the object detection must be generated instantly. To sum up the experimental evaluation, we think that the result are very encouraging in terms of speed, robustness and accuracy compared to other methods that have been published in literature. Applications that up to now relied on interest points can replace their current object detection without the disadvantage of smaller robustness with regards to e.g. attitude change. Further, these applications can benefit of a bigger robustness and accuracy particularly for detecting untextured objects. Since the acquisition of the test sequences was a time consuming task and to stimulate further research and comparisons we will make the test data available upon request.
4
Conclusion
In this paper we presented a single-camera solution for planar 3D template matching and tracking that can be utilized in a wide range of applications. For this, we extended an already existing edge polarity based match metric for tolerating local shape changes. In an extensive evaluation we showed the applicability of the method for several computer vision scenarios.
44
A. Hofhauser, C. Steger, and N. Navab
References 1. Ladikos, A., Benhimane, S., Navab, N.: A real-time tracking system combining template-based and feature-based approaches. In: International Conference on Computer Vision Theory and Applications (2007) 2. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (2004) 3. Berg, A., Berg, T., Malik, J.: Shape matching and object recognition using low distortion correspondences. In: Conference on Computer Vision and Pattern Recognition, San Diego, CA (2005) 4. Pilet, J., Lepetit, V., Fua, P.: Real-time non-rigid surface detection. In: Conference on Computer Vision and Pattern Recognition, San Diego, CA (2005) ¨ 5. Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition in ten lines of code. In: Conference on Computer Vision and Pattern Recognition (2007) 6. Gavrila, D.M., Philomin, V.: Real-time object detection for “smart” vehicles. In: 7th International Conference on Computer Vision, vol. I, pp. 87–93 (1999) 7. Olson, C.F., Huttenlocher, D.P.: Automatic target recognition by matching oriented edge pixels. IEEE Transactions on Image Processing 6, 103–113 (1997) 8. Steger, C.: Occlusion, clutter, and illumination invariant object recognition. In: Kalliany, R., Leberl, F. (eds.) International Archives of Photogrammetry, Remote Sensing, and Spatial Information Sciences, Graz, vol. XXXIV, part 3A, pp. 345–350 (2002) 9. Ulrich, M., Baumgartner, A., Steger, C.: Automatic hierarchical object decomposition for object recognition. In: International Archives of Photogrammetry and Remote Sensing, vol. XXXIV, part 5, pp. 99–104 (2002) 10. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 11. Benhimane, S., Malis, E.: Homography-based 2d visual tracking and servoing. Special Joint Issue IJCV/IJRR on Robot and Vision. The International Journal of Robotics Research 26, 661–676 (2007) 12. Zimmermann, K., Matas, J., Svoboda, T.: Tracking by an optimal sequence of linear predictors. Transactions on Pattern Analysis and Machine Intelligence (to appear, 2008)
Enhancing Boundary Primitives Using a Multiscale Quadtree Segmentation Robert Bergevin and Vincent Bergeron Laval University, Department of Electrical and Computer Engineering, Québec, Canada {bergevin,vbergero}@gel.ulaval.ca
Abstract. A method is proposed to enhance boundary primitives of multi-part objects of unknown specific shape and appearance in natural images. Its input is a strictly over-segmented constant-curvature contour primitive (CCP) map. Each circular arc or straight-line segment primitive from the map has an unknown origin which may be the closed boundary of a multi-part object, the textured or marked region enclosed by that boundary, or the external background region. Five simple criteria are applied in order to weight each contour primitive and eliminate the weakest ones. The criteria are defined on the basis of the superposition of the CCP map on a multiscale quadtree segmentation of the original intensity image. A subjective ground-truth binary map is used to assess the degree to which the final weighted map corresponds to a selective enhancement of the primitives on the object boundary. Experimental results confirm the potential of the method to selectively enhance, in images of variable complexity, actual boundary primitives of natural and man-made multi-part objects of diverse shapes and appearances.
1
Introduction
Delimiting the region occupied by an unknown but interesting object in a static image is both useful and easy for humans. In computer vision, this is still a fundamental problem with no existing general solution. This is particularly notable with complex natural images where objects of interest appear under variations of shape, illumination, surface texture, viewpoint, and background. A novel generic object detection and localization method was proposed recently [1]. Its main assumption is that objects of properly complex shape are of more interest and preferably detected. On that basis, a number of potential object boundaries, defined as ordered groups of primitives from an input map of constant-curvature contour primitives (CCPs), are systematically generated and sorted according to a number of shape grouping criteria and constraints. From a given boundary, the region occupied by the object in the image is easily recovered. Considering an average input map of 400 CCPs and boundaries of around 30 ordered CCPs, the number of possible boundaries is huge, that is about 1086 . Hence, very efficient pruning was required in order to reduce the number of generated boundaries to a manageable number of about a thousand. Still, images G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 45–54, 2008. c Springer-Verlag Berlin Heidelberg 2008
46
R. Bergevin and V. Bergeron
often took up to many minutes to process. Sorting of generated boundaries was quite effective though since the generated boundary most similar to a manual reference was always in the first absolute or relative positions. Quantitative similarities obtained were from 85% to 100%. Besides, high-ranking boundaries were qualitatively very similar to their manual reference (see Figure 1).
(a) Input
(b) Reference
(c) High-ranking (d) Low-ranking
(e) Weighted
Fig. 1. CCP maps and boundaries. From left to right: binary input map, manual reference boundary, high-ranking boundary, low-ranking boundary, weighted map.
The goal of the method proposed in this paper is to assign a weight to each CCP in the input map that should reflect its potential to end-up on high-ranking boundaries. If properly done, such a process shall enhance CCPs from an object boundary with respect to distractor CCPs in the binary input map. The weighted map obtained could then replace the binary input map in the object detection and localization method. This would likely help reduce both the number of generated boundaries and the time required to generate and rank them. Typically, near 90% of the CCPs of an input map are distractors, either internal texture primitives or external background primitives (see Figure 1(a)). Figure 1(e) presents a weighted CCP map obtained by computing the relative number of times each CCP from the binary input map is used in the thousand or so boundaries generated by the object detection and localization method. The darker a CCP in that weighted map, the more it is used in the generated boundaries. That weighted CCP map could only be computed after all boundaries were generated. In contrast, the method proposed in this paper directly transforms a binary map into a weighted map to be used as input to a more efficient object detection and localization method. A method with a similar goal was proposed recently [2]. The weight of a given contour primitive was computed using a number of criteria measuring the quality of groups made by pairing the primitive with all other primitives in the map. All but one criteria considered only the geometry or topology of the contour primitive pairs. The last criterion considered the coherence in local appearance of the paired primitives. That is, no criterion considered the appearance of the region enclosed by the boundary primitives. The method proposed in this paper precisely takes this type of complementary criterion into account. Since the method it to operate before object (boundary) detection, the actual object region is unknown and evidence for it must be obtained first. More specifically, the binary input CCP map, made-up of straight-line segments and circular arcs [3], is superposed on a region map obtained from a lowlevel segmentation of the original intensity image. From this superposition, five
Enhancing Boundary Primitives Using a Multiscale Quadtree Segmentation
47
criteria are applied in order to weight each contour primitive and eliminate the weakest ones. Generic low-level segmentation methods group and select image data according to basic image parameters, irrespective of high-level knowledge as to what constitutes an object of interest [4,5]. Unfortunately, they tend to produce results that suffer from both under-segmentation and over-segmentation. That is, no single region covers the whole object and some regions overlap the object and the image background. The latter is more problematic since the whole object is no more recoverable by grouping connected regions. By comparison, constant-curvature contour primitives (CCPs) are amenable to strict over-segmentation, which is at the basis of the object detection method in [1]. Raw images are trivially strictly over-segmented since each pixel is either on the object or not. However, pixels are both too numerous and local to be a primitive of choice for the superposition. The selected trade-off in the proposed method is a quadtree segmentation producing uniform regions of different sizes. In the current implementation, uniformity of regions is related to the statistical deviation of the pixel intensities from their average. Other definitions based on color or texture are also possible but they are left for future work. In order to still be robust to noise and detailed texture, the CCP map is superposed on a number of image segmentations obtained at different scales. The enhancement problem addressed by the proposed method is related to the figure-ground segmentation problem [6,7]. In both cases, one needs to identify which portions of an image contain important information and which are only distractive to the ultimate processing goal. Whereas generic low-level segmentation methods suffer from both under-segmentation and over-segmentation, specific high-level segmentation methods are more likely to extract significant segments of the image given their specialization to known objects of interest [8,9,10]. While much progress has been made to high-level methods recently, they still impose specific constraints on the object pose or appearance, which is not in line with our goal. Recent model-based object detection methods are also quite powerful in finding discriminating features of objects [11,12]. However, they need a training phase specific to each object category. Finally, a number of generic perceptual grouping methods directly attempt to extract closed object boundaries [13,14,15]. A comparative study of [13] and two competing methods was made by Wang et al. [16]. For natural images of animals, the optimal contour always had a simple near-convex shape not representative of the animal shape. This limitation to near-convex shapes, also typical of previous saliency-based methods [17], is not appropriate for our problem. The following section describes the proposed method in more details. Experimental results are presented in Section 3. A final section concludes the paper.
2
Proposed Method
As explained earlier, the main innovation of the proposed method is a superposition of the binary input CCP map on a quadtree segmentation of the original intensity image. Pixels forming quadtree regions are meant to have similar intensities. Five segmentations are obtained, from fine to coarse, using five related
48
R. Bergevin and V. Bergeron
thresholds controlling whether a given quadtree region is decomposed into four smaller regions. The finest segmentation, among the first three, with a number of regions significantly higher than the number of regions in the next coarser segmentation is selected as the main quadtree segmentation. The values of five criteria are combined into a weight for each CCP of the binary input map. The values of the first four criteria are computed using a superposition of the CCP with the main quadtree segmentation. The value of the last criterion is based on a superposition with the next two coarser segmentations. Constraints, corresponding to limit values for the criteria, are enforced to directly eliminate some CCPs from the output map. Five simple and complementary criteria were selected. A short description of each criterion is given next. Thresholds and parameters are determined empirically by maximizing quantitative performance values for a representative set of images. Local Coherence is the product of the CCP length and the number of quadtree regions it intercepts. The idea is to verify that the CCP is on the border between the region enclosed by the boundary and the external background region. If so, the CCP is to intercept a large number of small quadtree regions. Only the best one-third CCPs are retained. Contrast is the difference between the average intensities on the two sides of the CCP. On each side, the average intensity of the quadtree regions intercepted by a band parallel to the CCP is used. A CCP with a too small contrast is eliminated. Salience is the difference between the average size of the quadtree regions intercepted by the CCP and the average size of the quadtree regions on either side. On each side, the average size of the quadtree regions intercepted by a band parallel to the CCP is used. A CCP with a too small salience is eliminated. Global Coherence is a binary criterion that obtains a true value if and only if the CCP intercepts a computed main region. Computation of the main region is explained in Section 2.4. Scale Coherence is a ternary criterion that obtains a true or doubly true value when the CCP is among the best 20% CCPs in one or two of the additional coarser segmentations, respectively. For each coarser segmentation, the CCPs are sorted according to their local coherence, as long as they are not eliminated due to their poor contrast or salience. The weight of a CCP is its local coherence, doubled when it is globally coherent and further multiplied by 1.5 when it is scale coherent or 2.25 when it is doubly scale coherent. Eliminated CCPs have zero weight. For example, the binary input map in Figure 4 has 552 CCPs. Local coherence eliminates 370 CCPs (two-thirds), contrast further eliminates 49 CCPs, and salience eliminates 20 CCPs. The number of CCPs in the output map is thus 113, about 20% of the original. Prior to evaluating the five criteria for each CCP in the input map, a number of computations are performed as described in the following sections.
Enhancing Boundary Primitives Using a Multiscale Quadtree Segmentation
2.1
49
Quadtree Segmentation
Since images have arbitrary dimensions, quadtree regions are rectangular instead of square. Otherwise, the classical quadtree segmentation algorithm is used, starting with the complete image and recursively dividing it into four regions until either the standard deviation of the pixel intensities in a region is smaller than a given threshold or a region has less than four pixels. Let us define Delta1 = 20% ∗ Stotal et Delta2 = 10% ∗ Stotal , where Stotal is the standard deviation of the pixel intensities for the whole image The five segmentations are obtained using the following thresholds: Stotal − Delta1 , Stotal − Delta2 , Stotal , Stotal + Delta2 , and Stotal + Delta1 . The pixel intensities are obtained from the three channels of the color image using Y = (0.299 ∗ R + 0.587G + 0.114 ∗ B). Quadtree segmentations at five different scales are presented in Figure 2.
(a)
(b) 6688
(c) 3745
(d) 2389
(e) 1501
(f) 1072
Fig. 2. From left to right: the original intensity image and five quadtree segmentations. The number of rectangular regions is indicated for each segmentation.
2.2
Region Intersections
Each of the five criteria needs a list of the quadtree regions intercepted by a given CCP. The intersections between each CCP and the quadtree regions are computed only once and stored in a look-up table. In order to compute the intersections, each quadtree region is represented by four straight-line segments, its four sides. Intersections are computed between a CCP, a straight-line segment or a circular arc, and each side. There are three possible cases: the CCP is fully contained in the region, the CCP intersects one or more sides, or the CCP is outside the region. The first two cases correspond to a CCP intercepting the quadtree region. 2.3
Bands of Parallel CCPs
On each side of a CCP, two parallel CCPs form a band to be used in computing the contrast and salience criteria. The distance between the CCPs is 5 pixels. When the original CCP is a circular arc, the CCPs in the band have different radii. 2.4
Main Region
The main region is a coarse object segmentation obtained using a region growing process on the main quadtree segmentation. In the present implementation, the main region is simply grown by recursively adding connected regions, starting from a small region seed. An empirical limit of 30 pixels is imposed on the area of
50
R. Bergevin and V. Bergeron
an added region. The seed region has the smallest area in the quadtree. Once the first region is grown, another seed region is selected from the remaining regions, as far as its area is smaller than 30 pixels. A second region is grown from the new seed, and so on until all remaining regions are larger than 30 pixels. The largest grown region is retained as the main region. On most tested images, the main region covers the object at least in part. Future implementations of the proposed method should improve the main region extraction. 2.5
Algorithm
The complete primitive enhancement algorithm consists in the following steps: 1. 2. 3. 4. 5. 6. 7. 8. 9.
3
compute the five quadtree segmentations from the intensity image. select the main segmentation. compute the local coherence, the contrast, and the salience of each CCP. eliminate the CCPs with smaller local coherence. eliminate the CCPs with too small contrast. eliminate the CCPs with too small salience. compute the main region. compute the global coherence and the scale coherence of each retained CCP. compute the weight of each retained CCP in the output map.
Experimental Results
Ten images of variable complexity were tested. They are displayed in Figure 3. Images c, f, and g are considered simple since they have limited background structure and texture. Images a, e, h, i, and j are considered complex since they have normal texture and background. Finally, images b and d are considered difficult mainly because of the poor constrast between the object and the background in some areas. Experiments used a fixed set of parameter values.
(a) 552 (b) 585 (c) 188 (d) 745 (e) 229 (f) 640 (g) 178 (h) 304 (i) 243 (j) 165 Fig. 3. Ten tested images. The number of CCPs in the input map is indicated.
Two types of qualitative results are presented. A detailed result for a complex image is presented first, showing maps at different intermediate steps. Then, a final result is presented for one sample image in each category of simple, complex, and difficult images. In this case, the binary input CCP map, the manual reference, and the thresholded output map are shown. Finally, a table of quantitative performance values is presented for the ten images.
Enhancing Boundary Primitives Using a Multiscale Quadtree Segmentation
3.1
51
Qualitative Results
A detailed result is presented in Figure 4. Each criterion is applied in turn and the binary map showing the retained CCPs is displayed. For criteria 4 and 5, maps before and after the application of the criterion are thresholded to the same number of CCPs in order to show the ranking differences. Sample results for simple, complex, and difficult images are shown in Figure 5.
(a)
(b) 1
(c) 2
(e) 4
(d) 3
(f) 5
Fig. 4. Sequence of binary maps obtained by applying the five criteria. For criteria 4 and 5, input and output maps are thresholded at 60 CCPs and 20 CCPs, respectively. Circular arcs are red and straight-line segments are blue in the binary input map.
(a) Simple
(b) Complex
(c) Difficult Fig. 5. Input, reference, and thresholded output maps for three sample images
52
3.2
R. Bergevin and V. Bergeron
Quantitative Results
Table 1 presents the precision (PRE) and recall (REC) values for thresholded output maps. The number of CCPs in the input map (CCP), the reference map (SGT), and the thresholded output map (OUT) are indicated for each image. The precision and recall values are computed using a unit weight for each CCP, irrespective of its size and actual computed weight. The equal-error-rate (EER) and area-under-curve (AUC) values are computed from a recall versus precision curve (RPC), where the number of retained CCPs in the output map varies from one to the number obtained after applying the first three criteria. When CCPs included in the reference map are eliminated, the maximum recall value is less than 100%. For this reason, the RPC curve and the associated performance values are more challenging than the usual in object detection or image retrieval. Table 1. Quantitative performance values. The first three columns indicate the number of CCPs in the input, reference, and thresholded output maps, respectively. The last four columns indicate the performance values expressed as percentages. Precision and recall are obtained from thresholded output maps. Equal-error-rate and area-undercurve are obtained from a recall versus precision curve. Higher values are better. Image CCP SGT OUT PRE REC EER AUC a 552 50 60 73 60 65 66 b 585 33 45 31 40 32 17 c 188 32 27 80 84 84 78 d 745 45 55 38 40 41 32 e 229 33 32 84 77 84 71 f 640 44 44 72 72 72 63 g 178 30 28 70 80 70 69 h 304 41 43 46 45 45 41 i 243 32 29 62 52 55 45 j 165 29 30 40 35 31 20
The computing time for the complete algorithm, excluding the generation of the binary input map, varies depending on the input map size and the image contents. For the ten tested images, the range is from about half a minute to a minute and a half with a Visual.NET C# implementation running on an IBM ThinkPad with a 2.0 GHz Intel Pentium M processor and 1GB of RAM.
4
Conclusion
The goal was to assign a weight to each CCP in a binary input map according to its potential to be on an object boundary of proper complexity. Typically, near 90% of the CCPs of an input map are distractors, either internal texture primitives or external background primitives. Given that objects in tested images have unknown shape and appearance, only generic elimination and weighting
Enhancing Boundary Primitives Using a Multiscale Quadtree Segmentation
53
criteria could be proposed for CCPs. The weight of each retained CCP was computed by taking into account the coherence between a quadtree segmentation of the original image and the input map. Five simple criteria used to combine region and contour segmentations are the main contribution of the method. Results obtained on images of variable complexity are significant. Even for image b, which obtained the worst quantitative performance values, the retained CCPs with highest final weights provide a useful starting point for an object detection algorithm, as one may conclude by comparing the maps in Figure 5(c). In general, quality of results is as expected by the intuitive categorization of the images as simple, complex, or difficult. Excellent results are obtained for all simple images. Results for most complex images are at least good. Image j did not perform well but this might partly be an artifact of the evaluation process. Indeed, most of the longest boundary primitives are retained with a high weight, along with a number of small distractors. A thresholded output map provides a clear qualitative improvement with respect to the input map. Finally, performance values for difficult images were low but the resulting weighted maps are nevertheless an improvement, as just mentioned. Various techniques were proposed recently for generic segmentation and object detection with standard image datasets, e.g. the Berkeley Segmentation Dataset and the PASCAL Visual Object Classes, used for comparison. Given that our method is only a preprocessing step and that it considers a very generic object category with no supervised training, it is more appropriate to compare it to grouping methods. Unfortunately, no standard image dataset was proposed for them. Some of our test images are from the Berkeley Dataset. Typically, salient boundaries extracted by state-of-the-art methods are of simpler shapes. A number of changes could be made to the proposed method in order to improve its performance. A study of the relative importance of the five criteria could result in a different algorithm with, for instance, a different way to combine the criteria, the addition of new criteria, or the removal of the less useful ones. As mentioned earlier, an improved segmentation algorithm adding color or texture uniformity could also improve the results, especially for the difficult images. Similarly, regions of arbitrary shapes could be better adapted to other types of criteria. Finally, a fusion of the weighted output map with the map obtained by a complementary perceptual grouping method [2] would likely improve the respective performances of the two methods.
Acknowledgment This work is supported by a NSERC discovery grant.
References 1. Bernier, J.-F., Bergevin, R.: Generic Detection of Multi-Part Objects. In: Proc. of the 18th International Conference on Pattern Recognition (ICPR 2006), Hong Kong, China (2006)
54
R. Bergevin and V. Bergeron
2. Bergevin, R., Filiatrault, A.: Enhancing Contour Primitives by Pairwise Grouping and Relaxation. In: Kamel, M.S., Campilho, A. (eds.) ICIAR 2007. LNCS, vol. 4633, pp. 222–233. Springer, Heidelberg (2007) 3. Mokhtari, M., Bergevin, R.: Generic Multi-Scale Segmentation and Curve Approximation Method. In: Proc. of the Third International Conference on Scale-Space and Morphology in Computer Vision, Vancouver, Canada, pp. 000–007 (2001) 4. Martin, D.R., Fowlkes, C.C., Malik, J.: Learning to Detect Natural Image Boundaries Using Local Brightness, Color, and Texture Cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 530–549 (2004) 5. Lau, H.F., Levine, M.D.: Finding a small number of regions in an image using low-level features. Pattern Recognition 35, 2323–2339 (2002) 6. Hérault, L., Horaud, R.: Figure-Ground Discrimination: A Combinatorial Optimization Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 899–914 (1993) 7. Grigorescu, C., Petkov, N., Westenberg, M.A.: Contour and boundary detection improved by surround suppression of texture edges. Image and Vision Computing 22, 609–622 (2004) 8. Yu, S.X., Shi, J.: Object-Specific Figure-Ground Segregation. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2003), Madison, WI, pp. 39–45 (2003) 9. Arora, H., Loeff, N., Forsyth, D.A., Ahuja, N.: Unsupervised Segmentation of Objects using Efficient Learning. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007), Minneapolis, Minnesota (2007) 10. Borenstein, E., Ullman, S.: Class-Specific, Top-Down Segmentation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 109–122. Springer, Heidelberg (2002) 11. Kushal, A., Schmid, C., Ponce, J.: Flexible object models for category-level 3d object recognition. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007), Minneapolis, Minnesota (2007) 12. Savarese, S., Fei-Fei, L.: 3D generic object categorization, localization and pose estimation. In: Proc. of the 11th International Conference on Computer Vision (ICCV 2007), Rio de Janeiro, Brazil (2007) 13. Elder, J.H., Zucker, S.W.: Computing Contour Closure. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 399–412. Springer, Heidelberg (1996) 14. Jacobs, D.W.: Robust and Efficient Detection of Salient Convex Groups. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 23–37 (1996) 15. Estrada, F.J., Jepson, A.D.: Perceptual Grouping for Contour Extraction. In: Proc. of the 17th International Conference on Pattern Recognition (ICPR 2004), pp. 32– 35 (2004) 16. Wang, S., Wang, J., Kubota, T.: From Fragments to Salient Closed Boundaries: An In-Depth Study. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2004), Washington, DC, pp. 291–298 (2004) 17. Mahamud, S., Williams, L.R., Thornber, K.K., Xu, K.: Segmentation of Multiple Salient Closed Contours from Real Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 433–444 (2003)
3D Object Modeling and Segmentation Based on Edge-Point Matching with Local Descriptors Masahiro Tomono Chiba Institute of Technology, Narashino, Chiba 275-0016, Japan
[email protected]
Abstract. 3D object modeling is a crucial issue for environment recognition. A difficult problem is how to separate objects from the background clutter. This paper presents a method of 3D object modeling and segmentation from images for specific object recognition. An object model is composed of edge points which are reconstructed using a structure-from-motion technique. A SIFT descriptor is attached to each edge point for object recognition. The object of interest is segmented by finding the edge points which co-occur in images with different backgrounds. Experimental results show that the proposed method creates detailed 3D object models successfully. Keywords: Object Modeling, Segmentation, Object Recognition.
1
Introduction
The recognition of objects in 3D space from camera images is indispensable for an autonomous systems or robot to perform tasks in the real world, e.g., finding and manipulating objects. 3D object models are crucial for this purpose. A promising approach to 3D object modeling for recognition is the one based on 3D geometric models combined with 2D local descriptors [12,16,18]. This approach provides robust 2D recognition and precise 3D localization using 2D-3D combined object models. 3D geometric models are created using 3D reconstruction methods such as structure-from-motion techniques. However, the object modeling process is still problematic. A difficult problem is how to separate objects from the background clutter. Techniques have been proposed for 2D object segmentation so far, but 3D object segmentation has yet to be developed. This paper presents a method of 3D object segmentation to create a 3D object model from images for specific object recognition. We create a scene model from images which are reconstructed using a structure-from-motion technique. The scene model is composed of edge points. A SIFT descriptor [7] is attached to each edge point for object recognition. Since the scene model contains both the target object and background, we separate the target object from the background using training images with different backgrounds by extracting the edge points which co-occur in the images. In experiments, we found that target objects were segmented well using a small number of training images. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 55–64, 2008. c Springer-Verlag Berlin Heidelberg 2008
56
M. Tomono
The contribution of this paper is an edge-point based method of building and segmenting a 3D object model. The proposed method provides the detailed shape of the target object since we employ edge points to represent it. Moreover, our approach achieves robust recognition for non-textured objects since many edge points can be detected even from non-textured objects while corner points or blobs could not be detected sufficiently.
2
Related Work
Recently, partially invariant features with discriminative descriptors have been studied for robust object recognition [7,9]. Similarity-invariant or affine-invariant features provide robust 2D recognition. SIFT features [7] are the most practical one from the point of view of discrimination power and processing speed. Interest points are usually detected using the DoG (Difference of Gaussian) or Harris operators. The integration of such 2D features and 3D object models have been studied for 3D object recognition [12,16]. Edge points with SIFT descriptors have been also utilized for 2D object recognition [10] and 3D object recognition [18]. However, no segmentation methods have been proposed in these studies. Object segmentation methods have been developed thanks to the highly discriminative features [5,8,13,11]. Object regions can be extracted by clustering the features that co-occur in images. For example, if the target object is contained in two images with different backgrounds, it could be extracted by finding the features which appear in both images. Many of the object segmentation methods aim at category recognition, and their main goal is learning object categories to discover and segment a large variety of objects in 2D images. The precise segmentation of specific 3D objects in real environments is another important issue especially for real-world applications such as robotics.
3
Problem Statements
Our goal is that a robot learns specific objects (not object category) from images while moving in real environments. A typical scenario is that the robot captures images of the scene to discover objects by itself, and that it builds the 3D models of the objects from the images. In this paper, however, we focus on the process of building object models from given images. The major problem here is how to segment the target object from the background which contains many other objects. This would be critical for object modeling in the real environment. The difficulty of this problem is that the system has no knowledge about the target object since the system is learning it at the very time. Therefore, it is impossible to segment the target object from the background using only one image sequence if all the objects in the scene is stationary. To solve this problem, we need additional images which contain the same target object and different backgrounds. We refer to these images as training images. As mentioned above, the target object can be detected by finding the features which co-occur in images with different backgrounds. Spatial constraints can
3D Object Modeling and Segmentation Based on Edge-Point Matching
57
Fig. 1. The procedure of object segmentation
be utilized to eliminate false feature matches. We utilize edge points to represent detailed object shape. Since edge points are not as discriminative as corner points, we perform feature matching between images via a 3D model in order to cope with distortion caused by perspective projection. The basic procedure of our approach is shown in Fig. 1. We create a scene model from an image sequence. The scene model contains both the target object and background. By matching the scene model and training images, we separate the target object from the background. Performing this procedure, the background edge points in the scene model will decrease, and only the foreground edge points, i.e. target object, will finally remain.
4
Scene Modeling
A scene model consists of a 3D model and 2D models. The 3D model is composed of the 3D points reconstructed from edge points in images. The 2D models are used to recognize images captured from various viewpoints based on an appearance-based approach. 4.1
3D Model
The 3D model is built using a structure-from-motion technique. First, we find feature correspondences between images using the KLT tracker [14]. Note that we utilize corner points at this stage. Then, we reconstruct the camera motion from the features using the factorization method [17]. We assume the camera internal parameters are known, and find only the external parameters (x, y, z, roll, pitch, yaw). Since the factorization method provides an approximate reconstruction, the camera motion is refined using a non-linear minimization technique [4]. After obtaining the camera motion, we reconstruct edge points, which are extracted from each image using the Canny detector [2]. We find edge-point correspondences between images based on epipolar geometry and patch matching around edge points. Based on the camera motion and the edge point correspondences, 3D edge points are reconstructed by a triangulation method.
58
4.2
M. Tomono
2D Model
A 2D model consists of an image in the input image sequence, edge points in the image, and the camera pose from which the image was taken. We refer to the image of a 2D model as model image. The camera pose is estimated at the 3D modeling stage mentioned above. For edge-point matching, each edge point has a SIFT descriptor, which is orientation histograms over 4 × 4 sample regions around the edge point [7]. The SIFT descriptor is invariant to small viewpoint changes, and thus we create 2D models for the images extracted at every 30 [deg] in the camera angle space in order to recognize images captured from various viewpoints. The scale space analysis is necessary in order to detect edge points invariant to scale change [6]. The obtained scale of each edge point is proportional to the size of the object in the image, and so it is used to determine the size of the region in which the SIFT descriptor is calculated. Since the scale is proportional to object size in the image, the SIFT descriptor is invariant to object size. Using the scale-invariant descriptors, edge-point matching is performed robustly when the object size in the training image is different from that in the model image. 4.3
Correspondence between 3D Model and 2D Models
We find correspondences between edge points in the 2D models and edge points in the 3D model. As mentioned above, we create 2D models at every 30 [deg] in the camera angle space. For each 2D model, we reproject the 3D model onto the model image using the camera pose estimated at the 3D modeling stage. Then, we create a pair of a 3D edge point and the 2D edge point which is the closest to the reprojection of the 3D edge point onto the model image.
5 5.1
3D Object Segmentation Procedure
We create an object model from a scene model by eliminating the background using training images which have different backgrounds. The procedure for one training image is as follows. The following sections will explain each step in detail. (1) 2D matching between the scene model and the training image In the manner mentioned in Section 4.2, we detect edge points from the training image. Then, we perform 2D recognition by matching edge points between 2D models and the training image using the SIFT descriptors. The large cluster of matched edge points roughly represent the target object. (2) 3D matching between the scene model and the training image We estimate the camera pose from which the training image was captured by minimizing the reprojection errors of the 3D model onto the training image.
3D Object Modeling and Segmentation Based on Edge-Point Matching
59
(3) Calculate the scores of edge points We calculate the selection score of each edge point in the scene model based on the SIFT matching score and the distance between the edge point in the training image and the reprojection of the 3D edge point. Repeating this process for each training image, the scene model will be refined into the object model of the target object. The object model is composed of the edge points which have selection score bigger than a threshold. 5.2
2D Matching between the Scene Model and a Training Image
We perform 2D recognition to find the target object roughly in the training image by matching edge points between the training image and the 2D models of the scene model. The SIFT descriptor of each edge point is utilized for nearest neighbor indexing [1]. This indexing scheme efficiently retrieves the edge points in the 2D models which are matched with edge points in the training image. Then, we count the number of the matched edge points in each 2D model, and choose the 2D models having a large number of edge points as candidate match. These candidate 2D models have false correspondences, many of which are caused by coincidentally matching edge points in the background with those in the training image. We eliminate them using the geometric constraint, in which the matched edge points in the training image must be located consistently with the shape of the target object. For this purpose, we employ a pose clustering technique in terms of similarity transform. For each pair of edge points matched between the training image and the selected 2D model, we calculate the similarity transform parameters, and vote in the parameter space. Then, the 2D model which has the largest cluster in the voting space is selected as the best candidate. We select the edge-point pairs consistent with the largest cluster. The selected 2D model and the consistent edge-point pairs are the output of this stage. 5.3
3D Matching between the Scene Model and a Training Image
We cannot obtain all of the correct edge-point matches in 2D matching since the object shapes in 2D models and the training images are different due to viewpoint changes. To increase edge-point matches, we perform precise matching using the 3D model. In this process, the camera pose from which the training image was captured is estimated by reprojecting the 3D scene model onto the training image. First, we find the correspondences of edge points between the 3D model and the training image. This is easily done since we already have the correspondences of edge points between the 3D model and the 2D model as mentioned in Section 4.3, and also we have the correspondences of edge points between the 2D model and the training image by 2D matching. Then, we calculate the camera pose so that the average reprojection errors of the 3D edge points onto the training image may be minimum. This is a non-linear minimization problem, and is solved using a gradient descent method. The initial value given to the method is the
60
M. Tomono
camera pose of the selected 2D model. We employ RANSAC [3] to eliminate false correspondences. 5.4
Selection Score of Edge Points
Based on the estimated camera pose, we find the correspondences of edge points between the scene model and the training image. Since only one training image is not enough to complete the object model, we employ a scoring scheme for selecting the edge points of the target object. Let E be the set of 2D edge points detected from training image I, and P be the set of 3D edge points in the scene model. We denote the camera pose estimated for I by x = (x, y, z, roll, pitch, yaw). Let rj be the reprojected point of pj ∈ P to I under x, and qj be the edge point in the selected 2D model that corresponds with pj . qj is obtained as mentioned Section 4.3. We can refine P by eliminating background edge points using E if training image I contains the target object. There are two factors for checking correspondence between ei ∈ E and pj ∈ P . – The locations of ei and rj Because of errors in the camera pose x, rj will not be reprojected exactly onto the true location. The distance between ei and rj is a factor of the matching score. – The SIFT descriptors of ei and qj The distance between the SIFT descriptors of ei and qj is used to eliminate false correspondences. We define a score for indicating that 3D point pj belongs to the target object. 1, if d1 (ei , rj ) ≤ th1 ∧ d2 (ei , qj ) ≥ th2 for ∃ei ∈ E g(pj , E) = (1) 0, otherwise. Here, d1 (ei , rj ) is the Euclidean distance between ei and rj , and d2 (ei , qj ) is the normalized correlation between the SIFT descriptors of ei and qj . th1 and th2 are thresholds, which are determined empirically. In implementation, th1 = 2 [pixel] and th2 = 0.8. Only one training image is not enough for accurate segmentation since part of the target object might be invisible in some training images if the part is occluded or outside of the frame. Moreover, a certain amount of false matches are coincidentally generated by background clutter even if we evaluate the score function g using the SIFT descriptors. To cope with these problems, we integrate the scores for multiple training images. We define selection score G by accumulating g for a set of training images {In } (n = 1 to N ) as follows. En is the set of edge points detected in In . G(pj ) =
N
g(pj , En ) .
(2)
n=1
The 3D object model is composed of the 3D edge points such that G(pj ) ≥ th3 .
3D Object Modeling and Segmentation Based on Edge-Point Matching
61
Finally, we reproject the selected 3D edge points onto each model image, and find the 2D edge points which are matched with the 3D edge points. The 2D models of the target object are created using these 2D edge points. The generated object model consists of a 3D model and 2D models similarly to the scene model, and can be utilized for the recognition of the target object.
6 6.1
Experiments Scene Modeling
Fig. 2 shows an example of 3D scene model. Fig. 2 (a) shows a subset of the input images. 60 images were captured by a human. The image size is 320 × 240. Fig. 2 (b) shows the scene model generated from the image sequence. The target object is a desk, but the scene model contains objects such as boxes, stationery, and other pieces of furniture. This scene model is composed of about 14,000 edge points. As shown in the figures, edge points can represent the detailed shape of the object. The drawback of the edge-point reconstruction is that it can generate a lot of noise due to false matches because edge points are not as discriminative as corner points. The noise can be reduced significantly in the object segmentation process. 6.2
Object Segmentation
Fig. 3 shows the process of the object segmentation. The target object is a desk. 10 desks of this type are in our laboratory, and we can easily obtain the images of them with different backgrounds. The top row of the figure shows the matching results between a scene model and a training image. Fig. 3 (b) shows the edge points (gray dots) in the model image which are matched by 2D matching. In Fig. 3 (d), the 3D scene model (white dots) is reprojected onto the training image. Since the 3D matching is based on perspective transform, the target object of the 3D scene model is matched well with the training image. On the other hand, other objects around the desk in the scene model are not matched with the training image.
Fig. 2. 3D scene model reconstructed from an image sequence
62
M. Tomono
Fig. 3. Experimental result. The target object is a desk. Top: Matching of a scene model with a training image. Bottom: Segmentation result.
Fig. 4. Segmentation result. The target object is a mobile robot.
Fig. 5. Segmentation result. The target object is a small box.
Fig. 3 (e) shows the scene model again, which is depicted in Fig. 2. Fig. 3 (f) shows the 3D object model separated by one training image. As can be seen, the background edge points are not completely eliminated and some edge points are lost in the case of using only one training image. Fig. 3 (g) shows the 3D object model separated by 10 training images with threshold th3 = 5 for G(pj )
3D Object Modeling and Segmentation Based on Edge-Point Matching
63
Fig. 6. Left: the relationships between th3 and the number of the true edge points. Middle: the relationships between th3 and the number of the false edge points. Right: recall-precision curves for edge point matching.
in Eq.(2). In this case, the background edge points are reduced and the target object is segmented clearly. Fig. 4 and Fig. 5 show the 3D models of a mobile robot and a small box, respectively. Fig. 6 shows the accuracy of the proposed method. The left graph depicts the relationships between th3 and the number of the true edge points in the 3D model obtained. For each object, we segmented 3D models using 10 training images with th3 = 0 to 10. The middle graph depicts the relationships between th3 and the number of the false edge points in the 3D model. The right graph shows the recall-precision curves. We define precision and recall for edge-point segmentation as follows. precision = Nt /Ne , recall = Nt /Ng , where Nt is the number of true positives in the 3D model, Ne is the number of all the edge points in the 3D model, and Ng is the number of all the edge points in the ground truth model. The ground truth model is created by manually segmenting the scene model. The graphs implicate that the mobile robot is easy to segment. This is probably because the background is relatively simple in the images.
7
Conclusions
This paper has presented a method of 3D object modeling and segmentation from images. We create a scene model from images which are reconstructed using a structure-from-motion technique. The scene model is composed of edge points to represent detailed 3D object shape. A SIFT descriptor is attached to each edge point for robust object recognition. The target object is segmented using training images with different backgrounds. In experiments, we found that target objects were segmented well using a small number of training images. Future work includes the integration of object discovery and the proposed segmentation method in order for the robot to create object models automatically. The combination of active vision and object modeling would be an important issue.
64
M. Tomono
References 1. Beis, J.S., Lowe, D.G.: Shape Indexing Using Approximate Nearest-Neighbour Search in High-Dimensional Spaces. In: Proc. of CVPR (1997) 2. Canny, J.: A Computational Approach to Edge Detection. IEEE Trans. PAMI 8(6), 679–698 (1986) 3. Fischler, M., Bolles, R.: Random Sample Consensus: a Paradigm for Model Fitting with Application to Image Analysis and Automated Cartography. Communications ACM 24, 381–395 (1981) 4. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 5. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: Proc. of CVPR 2006 (2006) 6. Lindberg, T.: Feature Detection with Automatic Scale Selection. Int. J. of Computer Vision 30(2), 79–116 (1998) 7. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. Int. J. of Computer Vision 60(2), 91–110 (2004) 8. Marszalek, M., Schmid, C.: Spatial Weighting for Bag-of-Features. In: Proc. of CVPR 2006 (2006) 9. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350. Springer, Heidelberg (2002) 10. Mikolajczyk, K., Zisserman, A., Schmid, C.: Shape recognition with edge-based features. In: Proc. of BMVC 2003 (2003) 11. Parikh, D., Chen, T.: Unsupervised Identification of Multiple Objects of Interest from Multiple Images: dISCOVER. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part II. LNCS, vol. 4844. Springer, Heidelberg (2007) 12. Rothganger, F., Lazebnik, S., Schmid, C., Ponce, J.: 3D Object Modeling and Recognition Using Affine-Invariant Patches and Multi-View Spatial Constraints. In: Proc. of CVPR 2003 (2003) 13. Russell, B.C., Efros, A.A., Sivic, J., Freeman, W.T., Zisserman, A.: Using Multiple Segmentations to Discover Objects and their Extent in Image Collections. In: Proc. of CVPR 2006 (2006) 14. Shi, J., Tomasi, C.: Good Features to Track. In: Proc. of CVPR 1994, pp. 593–600 (1994) 15. Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering objects and their location in images. In: Proc. of ICCV 2005 (2005) 16. Skrypnyk, I., Lowe, D.G.: Scene Modelling, Recognition and Tracking with Invariant Image Features. In: Proc. of ISMAR 2004 (2004) 17. Tomasi, C., Kanade, T.: Shape and Motion from Image Streams under Orthography: A Factorization Approach. Int. J. of Computer Vision 9(2), 137–154 (1992) 18. Tomono, M.: 3-D Object Map Building Using Dense Object Models with SIFTbased Recognition Features. In: Proc. of IROS 2006 (2006)
Cumulus Cloud Synthesis with Similarity Solution and Particle/Voxel Modeling Bei Wang, Jingliang Peng, and C.-C. Jay Kuo Department of Electrical Engineering, University of Southern California
[email protected],
[email protected],
[email protected]
Abstract. A realistic yet simple 3D cloud synthesis method is examined in this work. Our synthesis framework consists of two main components. First, by introducing a similarity approach for physics-based cumulus cloud simulation to reduce the computational complexity, we introduce a particle system for cloud modeling. Second, we adopt a voxel-based scheme to model the water phase transition effect and the cloud fractal structure. It is demonstrated by computer simulation that the proposed synthesis algorithm generates realistic 3D cumulus clouds with high computational efficiency and enables flexible cloud shape control and incorporation of the wind effect . . .
1
Introduction
Clouds are among the most natural atmospheric phenomena in our daily life. To generate vivid outdoor scenes, realistic cloud synthesis and animation is often a key component in many 3D applications such as flight simulation, gaming and 3D movie production. Realistic cloud synthesis, albeit important, is a challenging task due to the infinite variations in cloud shape. Physics-based simulation gives the most natural-looking appearance of synthesized clouds. However, the physics-based simulation demands a high computational complexity to solve the Navier-Stokes partial differential equations (PDEs). In contrast with the physics-based approach, another approach of lower complexity known as the procedural method has been studied as well. Nevertherless, the procedure approach has less flexibility in cloud shape control corresponding to dynamics. We adopt the physics-based simulation of cumulus clouds in this work and aim to lower the complexity. The goals and contributions of our current research are summarized as: 1)Computational efficiency: by adopting the similarity solution approach, we are able to catch the general characteristics of clouds with a set of constant parameters without solving PDEs explicitly; 2)Realistic visual appearance: the proposed method is able to create the general shape and amorphous appearance of a cloud as well as fine details such as fractal substructures around boundary regions; 3) Flexible user control: by modifying the setting of the constant parameters which describes the general characteristics of cloud, users can control the cloud shape flexibly. Technically, the above features are accomplished by the adoption of the similarity simulation approach as well as two powerful graphic modeling tools; G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 65–74, 2008. c Springer-Verlag Berlin Heidelberg 2008
66
B. Wang, J. Peng, and C.-C.J. Kuo
namely, particles and voxels. The particle system provides a natural model for thermal-based cloud representation. In the first stage, each particle is used to simulate the movement of a thermal. Then in the second stage, a grid structure within a bounding box is utilized to model the accumulation and condensation of water vapor particles. The rest of this paper is organized as follows. Previous work is reviewed in section 2. Three major components of the proposed cloud synthesizer are described in sections 3-5. Finally, experimental results and conclusion are given in section 6 and section 7.
2
Review of Previous Work
Research on cloud synthesis, including simulation, modeling and rendering, has history of more than two decades. In early years, methods of low complexity were prevailing in the field due to the limitation of computing resource. With the increase in computing power in recent years, methods of high complexity have begun to thrive. Generally speaking, previous cloud synthesis methods can be classified into two main categories: visual-based and physics-based methods. Visual-based methods generate clouds using the amorphous cloud property rather than the physical process of cloud formation. Examples include procedural modeling [1], fractals modeling [2], qualitative simulation [3], and texture sprite modeling [4]. These methods have attractive advantage with low computation and easy implementation. They also provide the ability to change the cloud shape through the parameters adjustment. However, modification of the cloud shape is often achieved by trial-and-error, which has limited dynamics extension. On the contrary, Physics-based methods [5] incorporate fluid dynamics in cloud synthesis. They usually solve a set of PDEs in the simulation process (i.e. the Navier-Stoke’s equation) so that their computational complexity is very high. The computational cost was significantly reduced by Stam [6], who solved the N.S. equations with simplified fluid dynamics. Thereafter, his work was applied on cloud simulation with important physics process(phase transition) in cloud formation considered by Harris [7], besides, Harris implemented Stam’s stable fluid solver on GPU to achieve real-time simulation. Similarly, Miyazaki [8] used the Coupled Map Lattice (CML) to generate various types of clouds. In Harris and Ryo’s work, cloud simulation depends on initial water vapor condition and a set of parameters such as viscosity. However, they don’t provide a straightforward solution to cloud shape control, which is often a desirable feature. When compared with previous physics-based method, our proposed cumulus synthesis method does not solve PDEs explicitly, but adopt an analytic solution with some parameters consistent to the Navier-Stokes equations, based on which we had another cloud simulation work on decoupled modeling [9], and take the water phase transition into account. By choosing the parameters properly, we can generate different clouds efficiently.
Cumulus Cloud Synthesis with Similarity Solution
3 3.1
67
Similarity Solution to Turbulent Thermal Motivation and Justification
Clouds can be classified according to appearance as cumulus, cirrus, and stratus. From the formation viewpoint, clouds can be categorized into convection and non-convection dominant clouds based on the major physics process in cloud development. Cumulus cloud is the type that has major characteristics of the convective process. The convective process is driven by the buoyancy that is an upward force acting on an air parcel in response to the density difference between the air parcel and the surrounding air [10]. Initially, an air parcel gets heated and lifted by the buoyancy from the ground. It continues to rise as long as it is warmer than the surrounding air. Eventually, the air parcel reaches the equilibrium level(EL), at which the temperature of the air parcel is equal to that of the surrounding air. Since the air parcel carries a certain amount of momentum, it continues to ascend to surpass the EL. When the air parcel gets cooler than the surrounding air, it sinks toward EL. Subsequently, the air parcel oscillates vertically around the EL, damping with time. While oscillating, an air parcel also spreads laterally. With accumulation and condensation of air parcels, a cumulus cloud forms. Note that cumulus clouds usually form in a stable stratified environment that leads to fair weather. Based on lab experiments, the dynamic process of air parcel in a stable stratified environment can be well described by the turbulent thermal model, where thermal is “a discrete buoyant element in which the buoyancy is confined to a limited volume of fluid” [10], and buoyancy is the driven force for a thermal’s movement. In cloud simulation, the atmosphere can be viewed as a simple fluid. While for simple fluid flows under certain assumptions on the thermal, we are able to characterize fluid’s behavior through a similarity solution without solving the governing equations explicitly. This applies to the case of our interest. Dimensional analysis is useful in deriving and representing the similarity solution. Actually, similarity theory and dimensional analysis have been used for the investigation of simple convective thermals and plumes for more than a half century, e.g., [11,12,13]. The derivation of the similarity solution for a thermal is briefly reviewed below. For more details, please refers to [10]. 3.2
Derivation of Similarity Solution
Normally, cumulus clouds form in a stable atmospheric environment, where the atmosphere is stratified with its density as a function of altitude. To derive the similarity solution for a turbulent thermal in a stable stratified fluid, we make the assumptions as [10]: 1) The radial profiles of the mean vertical velocity and the mean buoyancy are similar at all time; 2) The mean turbulent inflow velocity is proportional to the mean vertical velocity; 3) The density perturbation in thermal is small as compared with the mean density (Boussinesq approximation).
68
B. Wang, J. Peng, and C.-C.J. Kuo
The governing equations for the momentum, heat and mass of the turbulent thermal are given, respectively, as du 1 dB = − ∇p + Bkˆ + ν∇2 u; = κ∇2 B; ∇·u =0 (1) dt ρ0 dt Here, ρ0 is the density, u is velocity vector, ν is the kinetic viscosity coefficient, κ is the heat diffusion coefficient and B is buoyancy. Following the first assumption, we can link radial profiles of the mean vertical velocity and the mean buoyancy. Under the second assumption, we have u = −αw where u, w and α are the mean turbulent radial velocity, the mean vertical velocity and a constant proportional to the fractional entrainment of mass, respectively. By integrating Eqs. (1) over the volume of the thermal, a new set of governing equations for the momentum, the heat and the mass continuity of the turbulent thermal is derived as d 4 4 d 4 3 4 d 4 3 ( πR3 w) = πR3 αB; ( πR B) = − πR3 αwN 2 ; ( πR ) = 4πR2 αw dt 3 3 dt 3 3 dt 3 (2) Another equation needed in integration is dz = w. In Eq. (2), a new parameter dt N is introduced, which is the Buoyancy frequency - a frequency at which an infinitesimal sample of fluid oscillates if it is displaced vertically. And It is used as a measure of the atmosphere’s stratification. The new set of variables in Eqs. (2) used to describe thermal’s attributes includes: thermal radius R, thermal buoyancy B, mean thermal vertical velocity w, thermal vertical height z. They are functions of the independent variable time t. By applying the dimensional analysis to the new set of governing Eqs. (2), we obtain dimensionless solutions R , w , B and z , where the prime is used to denote that they differ from R, w, B and z in scaling constants. The curves of dimensionless solution are shown in Figure 1. We see from the figure that an individual thermal overshoots the level, at which the buoyancy vanishes (B =0) at the first time, and decelerates to the zero velocity afterwards. Then, a thermal oscillates around its EL as illustrated by wave-shaped curves and they get dampened along time. Meanwhile, the thermal continues to expand with increasing radius. The curves of R , w , B , and z describe an accurate behavior of the convective process, which matches experimental data well [12,13].
Dimensionless Height
5 4 R’-t
3
W’-t
B’-t
Z’-t
2 1 0 -1
5pi
10pi
15pi 20pi Time(t)
25pi
30pi
Fig. 1. Dimensionless radius (R ), height (z ), buoyancy (B ), and mean vertical velocity (w ) of a thermal obtained from a stable stratified fluid
Cumulus Cloud Synthesis with Similarity Solution
69
The dimensionless solution depicts the kernel of a turbulent thermal movement and serves as the basis in our cloud simulation. By proper scaling, we can obtain the actual dimensional values of R and z as z=
1 3 1/4 1/4 1/4 −1/2 3 3/4 ( ) α F0 N z ; R = (( )3/4 α3/4 F0 N −3/2 R )1/3 4 π π
(3)
where R and z are dimensionless solutions, and N (the atmosphere stratification), F0 (the heat source)and α (the entrainment-of-mass coef.) are a set of fluid parameters. For given values of N , F0 and α, we can obtain the corresponding functions R and z that depict the variation of the thermal’s radius and height with time as shown in Eqs. (3).
4 4.1
Thermal Movement Modeling by Particles Parameters in Particle Systems
Choosing particle system to model thermal is because particles’s attributes bear a lot of similarity to those of a thermal. Here, we use an individual particle to model thermal’s lifetime. As it moves, it follows the motion specified by the similarity solution. The attributes of a particle include the velocity, radius, lifetime, etc. Another attribute of our interest is the density of particles, which will be used in the phase transition process. Each time step, water vapor particles are generated at the ground. The number of water vapor particles and the interval to release vapor particles can be flexibly selected by users. Even though the similarity for water vapor particles is considered in the model, a realistic cloud may have an amorphous shape, which indicates that there are other factors to give rise to the shape difference. They are explained below. 4.2
Effect of Water Vapor Life Time
Due to the randomness of the initial momentum, different particles exhibit various oscillating patterns. We assign each particle a condensation time, which is the lifetime of the water vapor particle. When the condensation time reaches, the particle is “dead“ and condensed into a voxel. Even though the amount of heat that each water vapor particle gets from the same ground area is similar, it still varies for different particles, which results in different heights by Eq. (3). As a consequence of the water vapor transition process, the higher the altitude is, the less likely the water vapor will remain. Thus, water vapor particles with higher momentum tend to reach its condensation level faster, i.e. shorter condensation time. In contrast, for water vapor particles heated by mean F0 at the ground, they tend to oscillate more around the EL level. During the oscillation, it moves laterally to form the cloud anvil. Since the condensation time is related to the heat source, we can use the heat source to control the condensation time.
70
B. Wang, J. Peng, and C.-C.J. Kuo
(a)
(b)
Fig. 2. Comparison of cloud shapes controlled by different initial water vapor distributions
(a)
(b)
Fig. 3. Comparison of cloud shapes (a) without and (b) with the wind effect during its formation
4.3
Cloud Shape Control
The general cloud shape is related to the water vapor source on the ground. Depending on the vegetation such as barren or grass, the water vapor can be generated distinctively. According to the atmospheric measurement of water content concentration, the horizontal cumulus concentration without the wind effect is Gaussian distributed [10], a Gaussian-shaped vapor source is typically chosen on the ground. The model controls the water vapor momentum as well as the emission frequency. We can specify the mean value of the water vapor source as a function of time and space. The actual water source can be perturbed by Perlin noise to generate the effect of randomness. Figure 4 shows a pair of cloud formation results with a different water vapor initial distribution. Figure 2(a) has quite a few particles with short condensation time so that the cloud is with an obvious top. For comparison, if fewer particles have high heat flux, the resultant cloud does not have a obvious top as shown in Figure 2(b). 4.4
Wind Effect
When a wind is blown to a cloud, the effect is more obvious at the top (rather than the bottom) of a cloud [10]. Thus, we add one more attribute to a particle, i.e. the vertical distance from the approximate cloud base, which is determined by the difference from the initial heat flux of the current particle to the minimum heat flux of water vapor particles in our model. The degree of horizontal movement due to the wind can be controled by this distance. In Figure 3, we show a pair of clouds without and with the wind effect using this model.
Cumulus Cloud Synthesis with Similarity Solution
(a)
71
(b)
Fig. 4. Comparison of clouds (a) with and (b) without the fractal structure
5
Phase Transition Modeling by Voxel Dynamics
Besides thermal’s movement, another important factor in cloud’s formation is phase transition. Water vapor particles oscillate until condensation happens in the cloud formation process. In this stage, users can choose the grid size of the bounding box flexibly in the vorxel model. The size is decided by the initial parameter range. The larger the resolution, the finer the details we can observe. The water vapor/droplet density in a voxel is updated at each time step. When water phase transition occurs, the released latent heat could be the new source of water vapor generation, which leads to the cloud fractal substructure. 5.1
Water Phase Transition
Water vapor accumulates along time. When it reaches a certain amount socalled the saturation point, water phase changes and condensation occurs. This saturation point is related to the environmental pressure and temperature. The maximum water vapor density we used in the model is determined by [7] wmax (T, p) =
380.2 17.67T exp( ) p T + 243.5
(4)
where T is the temperature and p is the pressure. As the density of water vapor particles is higher than the maximum amount given by Eq. (4), the water vapor transits into visible water droplet. By Eq. (4), one can create a saturation density look-up table off-line, and compare the water vapor density in the voxel with the table to decide if the water phase transition occurs or not at each time step. Besides the saturation point, there is another important factor to consider that affects the water droplet density; namely, the cloud condensation nucleus (CCN), also known as the cloud seed. When no CCNs are present, the water vapor can be supercooled below 0o C before droplets spontaneously form. To take the factor into account, we assign a continuous random number to each voxel at the initialization stage to indicate the distinctive condensation capability. 5.2
Fractal Substructure and Entrainment
A procedural cloud yields a good-looking cloud image with different levels of noise since its structure is fractal-like. It is assumed by many researchers that
72
B. Wang, J. Peng, and C.-C.J. Kuo
the fractal-like cloud appearance is due to the fractal nature of turbulence. Furthermore, the water vapor is not isolated in the atmosphere. It actually attracts ambient mass and mixes with the environmental air. Then, the environmental air becomes part of the cloud, this is called entrainment. It is observed experimentally that entrainment happens mostly at the cloud boundary, where there is more turbulence. This offers physical theory on the birth of a new thermal due to latent heat release at the cloud boundary. In our model, the latent heat released from water phase change provides a local heat source for buoyancy. And the released water vapor particle is governed by the similarity solution with the heat source parameter from the phase change. Only that compared with the vapor particles released from ground, vapor particles released due to latent heat has a short lifetime. The substructure is only generated at the surface due to the consideration of atmospheric science and computational efficiency. We compare a pair of clouds with and without the fractal structure in Figure 4 while other parameters are kept the same. It is obvious that the cloud with the fractal structure looks more realistic. 5.3
Efficient Update of Water Phase Change
In our model, the computation consists with two parts. The first part is the update of particles in particles system at every time step. The second is the update of the water phase change in the voxel stage. If we divide the bounding box into N 3 voxels in the water phase change computation, the complexity is O(N 3 ) as we need to traverse all voxels to check whether the water vapor/droplet has changed. We can improve the update procedure by only calculating voxels that have new water particles dies from the last time step. By updating a Boolean check-in table for water phase change at every time step, our computational complexity can improve 10 − 20% according to the different voxel resolution. Finally, a two-pass illumination and rendering method is used in our work, which is similar to the work of Harris [14].
6
Experimental Results
We present a pair of cloud results with a different Buoyancy frequency value in Figure 5. The buoyancy frequency values in Figures 5(a) and (b) are equal to 0.5 and 0.01, respectively. By the definition of buoyancy frequency, it is the temperature difference for a unit step vertically. A larger value in N means the temperature difference is bigger. Then, more mixing will happen during water vapor’s lifting so that the height a particle can reach is lower. Since the buoyancy frequency in Figure 5(a) is larger than that in Figure 5(b), the vertical extents of the resultant cumulus cloud become smaller while other parameters are same. We show a pair of clouds with different entrainment coefficient values in Figure 6. The mass entrainment values in Figures 6(a) and (b) are set to 0.5 and 0.03, respectively. The mass entrainment coefficient represents the ability to interact with the environment air. For a large α value, there is more mixing
Cumulus Cloud Synthesis with Similarity Solution
(a)
73
(b)
Fig. 5. Clouds with different buoyancy frequency values: (a) N = 0.5 and (b) N = 0.01
(a)
(b)
Fig. 6. Clouds with different entrainment coefficients: (a) α = 0.5 and (b) α = 0.03
Fig. 7. A scene of multiple clouds
with the surrounding air, which in return limits the height a particle can reach. The α value in Figure 6(a) is larger than that in Figure 6(b). Thus, the vertical extent of Figure 6(a) is smaller. Its oscillation height is smaller and its ability to hold the water vapor is higher, which means the saturation point for water phase transition is higher than that in Figure 6(b). As a result, the cloud in Figure 6(a) is not as fluffy as that in Figure 6(b). Furthermore, the grid number in the water phase transition stage increases to 80 × 80 × 64 in Figure 6. As compared with Figure 6, there are more particles to render and more details to
74
B. Wang, J. Peng, and C.-C.J. Kuo
observe in Figure 6. The cloud synthesis time for the result shown in Figure 6 is 27 sec. This time can be further reduced to around 24 sec (about 11% improvement) using a more efficient water phase update as presented in Section 5.3. Finally, a scene containing several synthesized clouds is shown in Figure 7.
7
Conclusion and Future Work
A novel cloud synthesis process was proposed in this work. The synthesis framework consists of a similarity solution to simulate the physical behavior of turbulent thermals. Since parameters in the similarity approach have physical meanings, users can control the cloud shape and properties flexibly by manipulating these parameters. Furthermore, we adopted a two-stage modeling scheme during the cloud formation process. That is, the particle model is used in the first stage and the voxel model is used in the second stage. The cloud water vapor movement was modeled by particles while the water phase change is conducted by the voxel model. We provide several examples to demonstrate user’s freedom in controlling the cloud shape, appearance and visual properties by choosing the corresponding parameters easily.
References 1. Ebert, D.S., Musgrave, F.K., Peachey, D., Perlin, K., Worley, S.: Textureing & Modeling: A Procedural Approach, 3rd edn. Morgan Kaufman, San Francisco (2002) 2. Gardner, G.: Visual simulation of clouds. In: ACM SIGGRAPH (1985) 3. Neyret, F.: Qualitative simulation of convective cloud formation and evolution. In: Eurographics Workshop on Animation and Simulation, pp. 113–124 (1997) 4. Wang, N.: Realistic and fast cloud rendering. In: ACM SIGGRAPH (2003) 5. Stam, J., Fiume, E.: Turbulent wind fields for gaseous phenomena. In: ACM SIGGRAPH, pp. 369–376 (1993) 6. Stam, J.: Stable fluids. In: ACM SIGGRAPH (1999) 7. Harris, M.J., Baxter III, W.V., Scheuermann, T., Lastra, A.: Simulation of cloud dynamics on graphics hardware. In: EUROGRAPHICS (2003) 8. Miyazaki, R., Yoshida, S., Dobashi, Y., Nishita, T.: A method for modeling clouds based on atmospheric fluid dynamics. IEEE Computer Graphics and Applications (2001) 9. Wang, B., Peng, J.L., Kwak, Y., Kuo, C.C.: Efficient and realistic cumulus cloud simulation based on similarity approach. In: ISVC (2007) 10. Emanuel, K.: Atmospheric Convection, 1st edn. Oxford University Press, Oxford (1994) 11. Batchelor, G.: Heat convection and buoyant effects in fluid. Quart. J. Roy. Meteor. Soc. 80 (1954) 12. Morton, B.: Buoyant plumes in a moist atmosphere. J. Fluid Mech. 2 (1957) 13. Turner, J.: The starting plume in neutral surroundings. J. Fluid Mech. 13 (1962) 14. Harris, M., Lastra, A.: Real-time cloud rendering. In: Chalmers, A., Rhyne, T.M. (eds.) EG 2001 Proceedings, vol. 20(3), pp. 76–84. Blackwell Publishing, Malden (2001)
An Efficient Wavelet-Based Framework for Articulated Human Motion Compression Chao-Hua Lee and Joan Lasenby Department of Engineering, University of Cambridge Abstract. We propose a novel framework for compressing articulated human motions using multiresolution wavelet techniques. Given a global error tolerance, the number of wavelet coefficients required to represent the motion is minimized. An adaptive error approximation metric is designed so that the optimization process can be accelerated by dynamic programming. The performance is then further improved by choosing wavelet coefficients non-linearly. To handle the footskate artifacts on the contacts, a contact stabilization algorithm which incorporates an Inverse Kinematics solver is adopted. Our framework requires far fewer computations, and achieves better performance in compression ratio compared to the best existing methods.
1
Introduction
One of the most complex and difficult tasks in computer animation is to generate life-like motions for human figures. The signals in the channels tend to be very non-linear which makes designing computer-aided tools for the animators very difficult. The process of generating human-like animation by animators’ hands is therefore very expensive and sometimes not even possible. An alternative solution to this problem is so-called performance animation where we acquire the motions from real actors. This process is also known as motion capture. A whole new set of challenges arises with motion capture technology, such as dealing with the huge amount of data in real time, editing and reusing the recorded data to synthesize new motions, transmitting motion capture data in bandwidth-limited channels etc. In this paper, we propose a novel framework for finding a more compact representation for articulated human motion data to ease the problems stated above. The framework is based on wavelet decomposition techniques and we model the problem in such a way that dynamic programming can be used to optimize efficiently. A non-linear refinement algorithm is then used to further enhance the performance. Footskate artifacts on the contacts are then removed by a contact stabilization algorithm. Our framework achieves a high compression ratio without noticeable visual artifacts in the reconstructed motions.
2
Related Work
For the purpose of compressing pure 3D human motion, the MPEG-4 standard comprises a standard compression pipeline for efficient compression of the Body G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 75–86, 2008. c Springer-Verlag Berlin Heidelberg 2008
76
C.-H. Lee and J. Lasenby
Animation Parameters (BAP) [1]. However it is computationally costly, produces unacceptable distortion, and does not provide efficient compression since it fails to exploit structural information of the human. [2] propose a sparseindexing technique to compress BAP data which results in a better compression ratio and less computational complexity than existing MPEG-4 methods. A high compression ratio and fast decompression are reported in [3]’s framework. Here a motion is approximated using Bezier curves and clustered PCA. [4] develops a framework based on wavelet decomposition, where more wavelet coefficients are used to represent the most significant joints in a given motion. Footskate artifacts on the contacts are observed in very low bit rate cases and a modified framework with a contact stabilization algorithm is then developed to enhance the performance in [5]. [6] takes a similar approach of using wavelet techniques to reduce the size of motion capture clips. However, their approach suffers from an expensive optimization process for longer sequences as the search space increases dramatically. The same problem is observed in a recent work based on Bezier approximation [7]. Building on [4][5], we here propose a new approach to provide a more accurate error approximation which improves robustness and efficiency.
3
Proposed Methods
A captured performance can be represented by either a series of rotation angles of the joints, or a series of joint positions. Since errors on the motion data will be introduced in our approach, it is preferable to work with rotation angles rather than position data as this ensures the bone lengths remain consistent throughout the processed motions. Our aim is to develop a framework using the minimum number of wavelet coefficients of the joint rotation angles to represent the original signals while maintaining a desired visual quality specified by the user. However, this constrained optimization problem becomes extremely expensive to solve as the lengths of the signals increase. Moreover, the search space of the wavelet coefficients is in angle space and the global error to evaluate the performance is computed in position space. An expensive transformation between these two spaces is required every time a coefficient selection is made in order to evaluate the induced distortion. To simplify the optimization problem, our idea is to first roughly allocate the coefficient budget to each joint instead of doing piecewise optimization over all the coefficients. Higher coefficient budgets should be given to highly active joints thus suggesting a multiresolution wavelet decomposition as the tool to analyze the joint activities. We first look for a solution to the optimization problem on a level basis. The idea is to replace the original signal in each channel with one of the wavelet approximations which requires fewer samples to reconstruct, and then to analyze the distortion introduced. Once the solution is available, we will assign each channel the number of coefficients of the approximation as the coefficient budget.
An Efficient Wavelet-Based Framework
3.1
77
Multiresolution Wavelet Decomposition
For typical motion capture data, each joint consists of 1 to 6 channels (degrees of freedom) depending on the motion capture system setup. The data in each channel is a one-dimensional signal. The wavelet decomposition can be used in a de-noising process. It filters out the high frequency oscillations and preserves the basic shape of the curve. The number of samples required to reconstruct this approximation is considerably less than the number of samples in the original signal. Suppose the length of each decomposition filter is equal to 2N , and the length of the signal is n, then the approximation and detail coefficients are of length n−1 f loor +N (1) 2 3.2
Problem Definition
Suppose there are m joints and p channels in a given motion M . We wish to find an optimal set of decomposition levels D = {dk |k = 1, ..., p} for all the channels p such that the total number of frames N = k=1 nk (dk ) required to reconstruct the motion is minimized while the overall global error Eg = m j=1 Ej satisfies Eg ≤ Ec
(2)
where Ej is the positional error of the jth joint, Ec is a global error constraint, dk is the decomposition level for the kth channel, and nk (dk ) is the number of frames required to reconstruct the level dk approximation for kth channel. Although selecting an optimal combination of wavelet decomposition levels is a simpler problem than piecewise selection of the wavelet coefficients, the optimization process is still very expensive and slow due to the large number of frames in motion capture data. In an example motion of 10 seconds capturing at 120 frames per second, it usually takes hours to find the optimal solution on a upto-date machine. In modeling this problem as a discrete-discrete dynamic programming problem we expect that the optimal solution can be found efficiently. However, there are some modifications we need to make to our problem before we can cast it in the dynamic programming framework. In typical dynamic programming problems, choosing a choice variable introduces a fixed cost to the system. However the situation in our problem is different. The choice variable is the wavelet decomposition level decision dk (for the kth channel), and the cost is the error introduced in its child joints by making this decision. Since the joints are connected in a hierarchical skeleton structure, the errors on any joints are dependent on the parent joints. Furthermore, the error on a channel of a joint is in angle space while the the global error that is used to evaluate the performance is the sum of errors in position space. The transformation is computationally costly. We therefore wish to design an error metric to approximate the actual global positional error with only the knowledge of the angular errors, such that it both identifies the significant joints in a motion and leaves the errors in this metric independent of each other.
78
3.3
C.-H. Lee and J. Lasenby
Adaptive Error Approximation Metric
The relationship between the rotation angle errors and the global error can be described as a function F , Eg = F (e1 (d1 ), ..., ep (dp ))
(3)
where ek (dk ) represents the angular error of the kth channel caused by choosing the dk th level wavelet approximation. [4] proposes a linear model to approximate F which works well in most cases but suffers from inaccuracies when the individual errors are large. The method is later modified to address this issue in [5]; this approach incorporates bone lengths which improves the approximation accuracy. However, the performance still deteriorates when the motions contain rapid movements. The approach proposed here designs a weighting function that gives weights W = wk |k = 1, ..., p to all the channels such that p
wk · ek ≈ F (e1 (d1 ), ..., ep (dp )) = Eg
(4)
k=1
Figure 1 shows a simplified example of 3 connected joints.There are three joints J0 , J1 and J2 in this example. J0 is the root joint that has J1 and J2 as its children. Due to the rotation error introduced in the wavelet decomposition process, J1 and J2 move to J1 and J2 after we reconstruct the rotation angles of their parents using the approximations.
Fig. 1. The error approximation metric
From this figure we can see that the position error on the kth joint, Ek , is the magnitude of the distance between Jk and Jk . Note this position error is introduced by the rotation angle error on its parent joint, ek−1 . The total error in this example is 2 Eg = Ek (5) k=1
An Efficient Wavelet-Based Framework
79
4
x 10 4 10 3.5 20 3
Channel ID
30 40
2.5
50
2
60
1.5
70 1 80 0.5 90 1
2
3
4
5
6
7
8
Wavelet Decomposition Level
Fig. 2. The weighted error table generated from the adaptive error approximation metric. Warmer colors represent higher errors.
As mentioned, the position error E1 is introduced by the rotation angle error of its parent joint J0 , denoted by e0 . We use the arc a1 as the approximation for E1 , similarly the arc a2 is used as the approximation for E2 . The total error now becomes: Eg = E1 + E2 ≈ a1 + a2 = dist(J0 , J1 ) · e0 + dist(J0 , J2 ) · (e0 + e1 )
(6)
The actual global positional error is now approximated in the form of equation (4), where w0 = dist(J0 , J1 ) + dist(J0 , J2 ) and w1 = dist(J0 , J2 ) are the weights assigned to J0 and J1 respectively. In this way, the weight of each channel in a full body motion can be computed. To make the approximations more accurate for handling different motions, we adaptively update the weights for each frame since the distances between joints change frame by frame. This approach is later proven to outperform previous methods. By performing multiresolution wavelet decomposition on the channels of all the joints in an example motion (01 motion of subject 07) from the CMU mocap database[8], and computing the error approximation metric based on the method described above, we obtain the weighted error table shown in figure 2. There are 6 channels for the root joint and 1 to 3 channels for each of the other 30 joints which results in total of 62 channels. The highest possible wavelet decomposition level is 8 since there are 316 frames in this example. It can be seen that some channels produce much higher errors than others in higher wavelet decomposition levels which means that these channels are more active in this example motion. We will use this weighted error table as an important guideline for making decisions. It is also essential to point out that the error approximations (or cost functions in optimization problems) in this table are independent of each other. We can then model our problem as a dynamic programming problem.
80
C.-H. Lee and J. Lasenby
3.4
Dynamic Programming
Our problem is a discrete-discrete dynamic programming problem since the global error constraint (the state variable), and the decisions relating to decomposition levels (the control variables) are discrete and finite numbers. The inputs to our system are: – – – –
a a a a
set of p channels, C={C1 , ..., Cp }, gain function Gk (.) for each channel, k = 1, ..., p, cost function Ek (.) for each channel, k = 1, ..., p, global error constraint Ec .
For each channel Ck , there is a gain function Gk (dk ) and a cost function Ek (dk ). The gain function Gk (dk ) represents the savings on the number of frames needed to reconstruct the signal by choosing dk wavelet decomposition level. The cost function Ek (dk ) represents the positional error introduced by this decision, and will be approximated by the weighted angular error described in the previous section. The solution we are looking for is a set of decisions for all the channels – D = {dk |k = 1, ..., p}. such that – the total gain Gg = – the total cost Eg =
p k=1 p
Gk (dk ) is maximized. Ek (dk ) ≤ Ec
k=1
3.5
Implementing Dynamic Programming
Before proceeding, some notation is introduced for convenience: – P [k, e], the subproblem that contains k channels with global constraint e, – G[k, e], the maximal gain (number of frames saved in our case) of the problem P [k, e]. We first calculate a gain table that contains all the gains G[k, e], where 0 ≤ e ≤ Ec . Each gain G[k, e] represents the optimal solution to the subproblem: what would the gain be if the global error constraint is just e and we are making decisions among the first k channels? Assume for some given values of k and e we already had a correct solution to a subproblem P [k − 1, e] stored in G[k − 1, e]. Suppose for each joint, we can decompose the signal up to level j. The pseudo-code to calculate the value for G[k, w] for given values of k and w looks like: We create an initial gain table which consists of p rows, and the columns represent an increasing value of the error constraint e from 0 up to the global error constraint of the overall problem Ec . We know the values in the first row
An Efficient Wavelet-Based Framework
81
Algorithm 1. Pseudo-code for generating gain table if (no budget to decompose the kth channel) then G[k, e] = G[k − 1, e] // best result for previous k − 1 channels else G[k, e] = max(Gk (1) + G[k − 1, e − Ek (1)], Gk (2) + G[k − 1, e − Ek (2)], ..., Gk (j) + G[k − 1, e − Ek (j), G[k − 1, e]) end if
and first column are all equal to 0. With the pseudo-code described above, we can then compute the entire gain table with maximal gains for each subproblem in bottom-up fashion. The value in the most lower-right corner is the maximal gain, and the optimal solution to our overall problem P [p, Ec ]. We can then find the decisions Dl = dk |k = 1, ..., p that have been made by tracking “backwards” through the values in the table, starting at G[p, Ec ]. At each point of the trace, we can tell by the values what decision (choice of wavelet decomposition level) is given to the current channel. 3.6
Nonlinear Refinement
Suppose our algorithm has made the decision dk for the kth channel so that nk (dk ) is the minimum number of frames we need to store, which is also the budget for this channel. We know that nk (dk+1 ) < nk (dk ) as the number of frames is reduced when we proceed to the next decomposition level. This means if we choose the approximation in the next level, we will have some extra budget. The basic idea of nonlinear reconstruction is to use this extra coefficient budget to pick up the nk (dk )−nk (dk+1 ) largest coefficients among all the detail coefficients. Suppose Mk (dk ) represents the reconstructed motion for the kth channel using the level dk approximation and we call the reconstructed motion using the nonlinear method based on the next level Mk (dk+1 ). By checking the errors produced by Mk (dk ) and Mk (dk+1 ), we know which one is the better approximation. Note that the numbers of frames needed to reconstruct these two signals are exactly the same. The nonlinearly reconstructed motion based on next level Mk (dk+2 ) is then computed and compared with Mk (dk ) and Mk (dk+1 ). This process is repeated until all the levels are checked. We want to find min[E(Mk (dk )), {E(Mk (dk+n|n=1,...,j−k) )}]
(7)
where E(A) denotes the operator that computes the error between A and the original signal. Whichever signal gives the lowest error is our best pick.
4
Footskate Artifact Removal
A major drawback of having distortions on rotation angles in a motion is the footskate artifact. Any “sliding” artifacts on the contacts can easily be picked
82
C.-H. Lee and J. Lasenby
Fig. 3. Comparison of the compressed motion and IK correction
up by the audience and cause undesired perceptual degradation. Numerous approaches incorporating inverse kinematics have been developed in order to address this problem [9][10][11]. However, applying IK separately to each frame often produces discontinuities and requires further motion smoothing or blending. There are also cases when changes of the bone lengths have to be made to satisfy the positional constraints. Although these methods are capable of removing footskate artifacts, the modified motions may be undesirable for compression purposes where “closeness” to the original motion is important. For the reasons stated above, we encode the entire trajectories of the contacts by thresholding the derivatives of the curves, and use them as the positional constraint in the IK solver. In our experiments, this curve simplification method works very well since these curves tend to have many flat segments. Figure 3 shows an example result using 6 IK iterations.
5
Results
We test our compression algorithm on the entire CMU motion capture database [8]. The skeleton setup has 31 joints, 7 end effectors and 62 degrees of freedom. Various kinds of motions are included in our experiment to demonstrate the flexibility of the algorithm. Coiflet4 wavelet is chosen as it provides smooth reconstruction and good resolution in time domain. 5.1
Distortion Metric
A proper error metric is needed in order to evaluate the rate-distortion performance of a lossy compression framework. However, designing a metric to determine the perceptual quality of a motion is a difficult task. Since it is widely agreed that an audience is more sensitive to positional differences than angular differences, a common approach is to use the positional errors on the joints as the distortion measurement instead of the angular errors. A simple RMS error metric may not fully represent the visual degradation in all situations, but serves
An Efficient Wavelet-Based Framework
83
well as an indication of the “closeness” between the compressed motions and the original motions. Let the positions of p joints in the original motion at frame t be represented by a set of vectors x1 (t), ..., xp (t), and the positions of the corresponding joints in the compressed motion be x1 (t), ..., xp (t). The RMS error metric is defined as: p n 1 x = (xj (t) − xj (t))2 (8) np j=1 t=1 where n is the total number of frames in the motion. This metric measures the average positional difference of a joint between the original motion and the compressed motion. In [6] a weighted RMS error metric is proposed to solve the optimization problem: n 1 p lj x = (xj (t) − xj (t))2 (9) n j=1 t=1 l p where lj is the length of the jth bone and l = j=1 lj is the total length of all the bones. This metric assumes that the positional errors on longer bones have more impact on overall perceptual quality. However, the positional error of a joint is already the consequent result of the accumulated influence from its parent joints. It would make more sense to give higher weights to the angular distortions as opposed to the positional distortions for longer bones. We therefore choose the standard distortion metric described in equation 8 to evaluate the performance of our algorithm, and use equation 9 for a direct performance comparison. 5.2
Rate-Distortion Performance
Figure 4 shows the rate-distortion comparison of different compression schemes. The horizontal axis represents the file size compression ratio while the vertical axis represents the average error of a joint in centimeters. The first baseline method we compare to is the standard wavelet compression scheme. Without any knowledge of the skeleton structure, it performs poorly as expected. Then we look at the optimized wavelet method (BWC) mentioned in [6], which is a very recent work and has reported high performance. On average, our method achieves more than two times higher compression ratio compared to BWC. We also include the results using gzip as the entropy coder. Note that the results from BWC uses only 15 selected motions from the CMU mocap database while our results are obtained from processing the entire CMU mocap database. The performance of motion compression in general varies quite dramatically on different motions. We believe our results are therefore more accurate by performing more experiments. The distortion metric from equation 9 is adopted here for a direct performance comparison.
84
C.-H. Lee and J. Lasenby 1.4
Average Distortion εx (cm)
1.2
1
0.8
0.6
0.4 Standard Wavelet BWC gzip BWC Our Method gzip Our Method
0.2
0
0
20
40
60 80 100 Compression Ratio
120
140
160
Fig. 4. Average rate-distortion performance comparison of different methods 2.5
Average Distortion εx (cm)
2
1.5
1
0.5 Our Method gzip Our Method 0
0
200
400
600 800 Compression Ratio
1000
1200
1400
Fig. 5. Average rate-distortion performance on motions performed by subject 31
In order to explore the variability of our performance over different motions, the results of 21 motions performed by subject 31 from the CMU mocap database are presented in figure 5. In this example, each curve represents the rate-distortion performance of a motion. It can be observed that our algorithm significantly favors some motions over others. The best reports a 1000:1 compression ratio when the average error for each joint is 1.5 centimeters. There are two major factors that affect the compression ratio: the length of the motion and the complexity of the motion. In general higher compression ratios are achieved for longer motions because of greater savings in lower level reconstructions. Smoother and simpler movements in the motion also help the performance as fewer wavelet coefficients are needed to approximate the motion.
An Efficient Wavelet-Based Framework
5.3
85
Execution Time
One of the most noticeable advantages of our method over other methods is that it requires far fewer computations to solve the optimization problem. The adaptive error approximation metric allows us to avoid the expensive transformation between angular space and positional space during the optimization process. Moreover, error approximations are made to be independent so the optimization problem can be solved very efficiently by dynamic programming. The implementation is done in MATLAB and the simulations are run on a 3GHz Pentium 4 machine. Table 1 shows the execution time on various motion files.
Table 1. MATLAB execution time per frame for motions (CMU database) with different lengths Motion file Length (frames) Execution time/frame (ms) 07-03 415 1.033 13-01 2,313 1.615 31-21 4,086 1.833 31-09 14,654 1.977
6
Conclusion
We have proposed a low computation, flexible, and highly efficient framework based on multiresolution wavelet decomposition for compressing articulated human motion data. The framework involves four main stages: 1. generating a weighted error table based on an adaptive error approximation metric which significantly reduces the complexity of the optimization problem, 2. solving the resulting optimization problem using dynamic programming, 3. nonlinear wavelet coefficient reconstruction, 4. footskate artifact removal Our method shows a more than two times higher compression ratio compared to existing methods and also has far lower computation cost which results in much shorter processing times. The decompression is very fast and can be done in real-time. One of the future research goals is to develop a better footskate removal algorithm suitable for compression purposes which does not require extra information from the original motion to be stored. In addition we would like to explore the influence of using different wavelets on different motion data. Certain wavelets may produce better compression results for particular motions. A sequential compression scheme for human motion also remains a possible direction for future research.
86
C.-H. Lee and J. Lasenby
References 1. Capin, T.K., Petajan, E., Ostermann, J.: Efficient modeling of virtual humans in mpeg-4. In: Proceedings of ICME, vol. 2, pp. 1103–1106 (2000) 2. Chattopadhyay, S., Bhandarkar, S.M., Li, K.: Efficient compression and delivery of stored motion data for avatar animation in resource constrained devices. In: Proceedings of ACM Virtual Reality Software and Technology, pp. 235–243 (2005) 3. Arikan, O.: Compression of motion capture databases. In: Proceedings of SIGGRAPH, vol. 25, pp. 890–897 (2006) 4. Lee, C.H., Lasenby, J.: 3d human motion compression. In: International Conference on Computer Graphics and Interactive Techniques archive ACM SIGGRAPH 2006 Research posters (2006) 5. Lee, C.H., Lasenby, J.: A compact representation for articulated human motion. In: International Conference on Computer Graphics and Interactive Techniques archive ACM SIGGRAPH 2007 Research posters (2007) 6. Beaudoin, P., Poulin, P., van de Panne, M.: Adapting wavelet compression to human motion capture clips. In: Graphics Interface 2007, pp. 313–318 (2007) 7. Lin, Y., McCool, M.D.: Nonuniform segment-based compression of motion capture data. In: International Symposium on Visual Computer 2007 (2007) 8. CMU graphics lab motion capture database (2008), http://mocap.cs.cmu.edu 9. Charles Rose, M.C., Bodenheimer, B.: Verbs and adverbs: Multidimensional motion interpolation. IEEE Computer Graphics and Applications (1998) 10. Lee, J., Shin, S.Y.: A hierarchical approach to interactive motion editing for humanlike figures. In: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 39–48 (1999) 11. Kovar, L., Schreiner, J., Gleicher, M.: Footskate cleanup for motion capture editing. In: Proceedings of the 2002 ACM Symposium on Computer Animation, pp. 97–104 (2002)
On Smooth Bicubic Surfaces from Quad Meshes Jianhua Fan and J¨ org Peters Dept CISE, University of Florida
Abstract. Determining the least m such that one m×m bi-cubic macropatch per quadrilateral offers enough degrees of freedom to construct a smooth surface by local operations regardless of the vertex valences is of fundamental interest; and it is of interest for computer graphics due to the impending ability of GPUs to adaptively evaluate polynomial patches at animation speeds. We constructively show that m = 3 suffices, show that m = 2 is unlikely to always allow for a localized construction if each macro-patch is internally parametrically C 1 and that a single patch per quad is incompatible with a localized construction. We do not specify the GPU implementation.
1
Introduction
Quad(rilateral) meshes are used in computer graphics and CAD because they capture symmetries of natural and man-made objects. Smooth surfaces of degree bi-3 can be generated by applying subdivision to the quad mesh [CC78] or, alternatively, by joining a finite number of polynomial pieces [Pet00]. When quads form a checkerboard arrangement, we can interpret 4 × 4 grids of vertices as B-spline control points of a bi-cubic tensor product patch. Then we call the central quad ordinary and are guaranteed that adjacent ordinary quad patches join C 2 . The essential challenge comes from covering extraordinary quads, i.e. quads that have one or more vertices of valence n = 4 as illustrated in Fig. 1, left. While this can be addressed by recursive subdivision schemes, in many scenarios, for example GPU acceleration, localized parallel constructions of a finite number of patches are preferable [NYM+ 08]. Here localized, parallel means that each construction step is parallel for all quads or vertices and only needs to access a fixed, small neighborhood of the quad or vertex. Due to the size limitations, this paper does not discuss GPU specifics, but addresses the fundamental lower bound question: how to convert each extraordinary quad into a macro-patch, consisting of m × m bi-cubic pieces, so as that a general quad mesh is converted into a smooth surface. Prompted by the impending ability of GPUs to tessellate and adaptively evaluate finitely patched polynomial surface at animation speeds, there have recently been a number of publications close to this problem. Loop and Schaefer[LS08] propose bi-cubic C 0 surfaces with surrogate tangent patches to convey the impression of smoothness via lighting. Myles et al. [MYP08] perturb a bi-cubic G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 87–96, 2008. c Springer-Verlag Berlin Heidelberg 2008
88
J. Fan and J. Peters p4 p3
p2 k g10
p0
p1
g00 0
p
2n0 −1
p2n
v0k,μ k g11
v1k,μ
uk,μ 0
bk,μν 00 w0k,μ
v2k,μ
uk,μ uk,μ 1 2 k,μν b20 w1k,μ w2k,μ
v3k,μ
w3k,μ
Fig. 1. left: extraordinary vertex p0 with n0 direct neighbors p2k−1 , k = 1 . . . n0 . midk k dle: limit point g00 , tangent points g10 and ‘twist’ coefficients g11 . right: BB differences.
base patch near non-4-valent vertices by coefficients of a (5,5) patch to obtain a smooth surface. PCCM [Pet00] generates smooth bi-cubic surfaces but requires up to two steps of Catmull-Clark subdivision to separate non-4-valent vertices. This proves that m = 4 suffices in principle. But bi-cubic PCCM can have poor shape for certain higher-order saddles (e.g. the 6-valent monkey saddle Fig. 5, row 3) as discussed in [Pet01]. Below we specify an algorithm that constructs smoothly connected 3 × 3 C 1 macro-patches without this shape problem; and discuss why the approach fails when m < 3.
2
Notation, and Why m = 1 Need Not Be Considered
We denote the kth bi-cubic Bernstein B´ezier (BB) patch, k = 1 . . . n0 , surrounding a vertex p0 of valence n0 (Fig. 1) by 3 3 k,μ,ν 3 k,μ,ν i 3−i 3 b (u, v) := bij u (1 − u) v i (1 − v)3−j . (1) i j i=0 j=0 Here μ, ν indicate a piece of the m × m macro-patch (see Fig. 2, left, for m = 3). The BB coefficients (control points) of the tensor-product patch bk,μ,ν are therefore labeled by up to 5 indices when we need to be precise (Fig. 2): bk,μ,ν ∈ R3 , ij
k = 1 . . . n0 ,
μ, ν ∈ {0, . . . , m − 1},
i, j ∈ {0, 1, 2, 3}.
(2)
For the two macro-patches meeting along the kth boundary curve bk,μ0 (u, 0) = bk−1,0μ (0, u), μ = 0, . . . , m−1, we want to enforce unbiased (logically symmetric) G1 constraints ∂2 bk,μ0 (u, 0) + ∂1 bk−1,0μ (0, u) = αki (u)∂1 bk,μ0 (u, 0),
i = 0 . . . , m − 1,
αki
(3)
where each is a rational, univariate scalar function and ∂ means differentiation with respect to the th argument. If αki = 0, the constraints enforce (parametric) C 1 continuity. The polynomial equalities (3) hold for the kth curve exactly when all m × n0 polynomial coefficients are equal. The coefficients are differences of the BB control points: vik,μ := bk,μ0 − bk,μ0 i1 i0 ,
wik,μ := bk−1,0μ − bk−1,0μ , 1i 0i
k,μ0 uk,μ := bk,μ0 i i+1,0 − bi0 .
On Smooth Bicubic Surfaces from Quad Meshes
89
bk,02 bk,01 bk,00 00
bk,00 bk,10 bk,20 bk,00 bk,20 30 00 01 1 0 01 1 0 11 1 11 00 00 1 0 0 0 1 0 00 11 00 11 00 0 11 01 1 11 0 0 1 00 11 0 1 00 1 01 00 1 00 11 00 11 0 11 1 00 11 000 11 00 11 00 n 11
bk−1,10
11 00 00 11 00 11 00 11 00 11 11 00 00 11 11 1 00 00 11 00 11 001 11 00 11 0n 00 11 0 00 1 11 0 1
Fig. 2. left: Patches bk,μν and coefficients bk,μν . right: Coefficients shown as ij black disks are determined by subdividing the initialization gij (see (11)), coefficients shown as small squares are determined by (14) and (15), coefficients shown as big yellow squares are determined by C 2 continuity of the boundary curve (16), (17), (18).
The differences need only have a single subscript since we consider curves (in u) and a simpler superscript since ν = 0. For example, if we choose αki (u) := λki (1 − u) + λki+1 u, then (3) formally yields 4m equations when μ = 0, . . . , m: v0k,μ + w0k,μ = 3(v1k,μ + w1k,μ ) 3(v2k,μ + w2k,μ ) v3k,μ + w3k,μ
= = =
λkμ uk,μ 0
(4)
k,μ k 2λkμ uk,μ 1 + λμ+1 u0 k,μ k λkμ uk,μ 2 + 2λμ+1 u1 λkμ+1 uk,μ 2 .
(5) (6) (7)
By definition, (7)μ=i = (4)μ=i+1 , i.e. constraint (7) when substituting μ = i is identical to constraint (4) for μ = i + 1. We need not consider m = 1, i.e. one bi-cubic patch per quad, since the vertex-enclosure constraint [Pet02, p.205] implies, for even n0 > 4 that the normal curvatures and hence the coefficients bk,00 (• Fig. 1,right) of the n0 curves 20 emanating from p0 are related for k = 1, . . . , n0 : the normal component of their alternating sum k (−1)k bk,00 20 must vanish. Since, for a bi-cubic patch, the conk,00 trol point b20 lies in the tangent plane of the kth neighbor vertex (Fig. 1,right), the vertex’s enclosure constraint constrains the neighboring tangent planes with respect to its tangent plane. Therefore, if we fix the degree of the patches to be bi-cubic and allow only one patch per quad then no localized construction is possible. For m > 1, the coefficients bk,00 no longer lie in the tangent plane of the 20 neighbor; so a local construction may be possible. We next give an explicit construction when m = 3.
90
3
J. Fan and J. Peters
Localized Smooth Surface Construction Using a 3 × 3 Macro-patch
We factor the algorithm into four localized stages. First, we define the central point g00 , the tangents g10 − g00 and the face coefficients g11 as an average (see Fig. 1) of – the extraordinary vertex p0 with valence n0 , and 0 – its 1-ring neighbors p1 , p2 , . . . , p2n . In a second stage, we partition the quad into a 3 × 3 arrangement (Fig. 2) and establish its boundary; in the third, we determine the cross-boundary derivatives so that pairs of macro-patches join G1 (Equation (3)) and in the final stage, we determine the interior coefficients. By this construction, a macro-patch joins at least parameterically C 1 with an unpartitioned spline patch (see Fig. 5, row 2, where the second entry displays each polynomial piece in a different color). 1. [Initialization] It is convenient (and shown to be effective to approximate the Catmull-Clark limit surface) to set gij according to [MYP08]. That is to set g00 to the limit of p0 under Catmull-Clark subdivision (red circle in Fig. 1 k middle) and place the g10 (blue circle in Fig. 1 middle) on the Catmull-Clark tangent plane: g00 =
k g00
n0 0 l=1 n p0 + 4p2l−1 + p2l := , n0 (n0 + 5)
k = 1 . . . n0 ,
(8)
σn0 αi p2j−1 + βi p2j , 3(2 + ωn0 ) j=1 n0
k k g10 := g00 + e1 ckn0 + e2 skn0 ,
k g11 :=
ei :=
1 (4p0 + 2(p2k+1 + p2k+3 ) + p2k+2 ), 9
k = 1 . . . n0 .
(9) (10)
where the scalar weights are defined as ckn0 := cos
2πk , n0
ωn0 := 16λn0 − 4, α1 := ωn0 cj−1 , n0
2πk , cn0 := c1n0 , n0 1 := (cn0 + 5 + (cn0 + 9)(cn0 + 1)), σn0 := 16
skn0 := sin λn0
β1 := cj−1 + cjn0 , n0
α2 := ωn0 sj−1 , n0
0.53 1 4λ 0 n
if n0 = 3, if n0 > 3,
β2 := sj−1 + sjn0 . n0
Symmetric construction of the other three corners of the quad yields 4 × 4 coefficients gij that can be interpreted as the BB coefficients of one bi-cubic patch g : [0, 1]2 → R3 in the form (1). 2. [domain partition and boundary] We partition the domain into 3 × 3 pieces (see Fig. 2 left) and let the 3 × 3 macro-patch inherit vertex position and tangents (black disks in Fig. 2 right) by subdividing g: k bk,00 00 := g00 ,
bk,00 10 :=
2 k 1 g + gk , 3 00 3 10
bk,00 11 :=
k k k k 4g00 + 2g10 + 2g01 + g11 . (11) 9
On Smooth Bicubic Surfaces from Quad Meshes
91
Each macro-patch will be parametrically C 1 (and C 2 in the interior). To enforce the G1 constraints (3) between macro-patches, we need to distinguish two cases when choosing αki depending on whether one of the vertices is regular, i.e. has valence 4. Of course, if both valences are 4 then we simply subdivide g and set αki = 0; and if all four corner points have valence 4, no partition is needed in the first place since g is then part of a bi-cubic tensor-product B-spline patch complex and therefore joins C 2 with its spline neighbors and at least C 1 with macro-patches (Fig. 5, row 2, second entry). 2a. [boundary n0 =4 = n1 ] We choose αki as simple as possible, namely linear if n0 =4 = n1 : λk0 := 2 cos(
2π ), n0
λk3 := −2 cos(
αki (u) := λki (1 − u) + λki+1 u, 2π ), n1
λk1 :=
2λk0
+ 3
λk3
, λk2 :=
(12) λk0
+ 3
2λk3
. (13)
The three cubic pieces of the boundary curve have enough free parameters to enforce Equation (5) for μ = 0 and Equation (6) for μ = 2 and (small squares in Fig. 2 right) by setting k,00 bk,00 20 := b10 +
k−1,00 k,00 k k,00 3(bk,00 − 2bk,00 11 + b11 10 ) − λ1 (b10 − b00 ) , 2λk0
(14)
k,20 bk,20 10 := b20 +
k,20 k,20 k,02 k,20 λk2 (bk,20 30 − b20 ) − 3(b21 + b12 − 2b20 ) k 2λ3
(15)
and joining the three curve segments C 2 (cf. large squares in Fig. 2 right) for μ = 0, 1 :
k,μ+1,0 bk,μ0 := (bk,μ0 )/2, 30 20 + b10 4 1 2 k,20 k,00 k,20 bk,10 10 := b20 − b20 + b10 − 3 3 3 2 2 4 k,20 k,00 k,20 bk,10 20 := b20 − b20 + b10 − 3 3 3
(16) 2 k,00 b , 3 10 1 k,00 b . 3 10
(17) (18)
2b. [boundary n0 = 4 = n1 ] If n0 = 4 = n1 , Lemma 2 in the Appendix shows k that we cannot choose all αi to be linear. (If λk1 = 0 in Lemma 2 then the 2π dependence appears at the next 4-valent crossing bk,10 00 .) We set λ := 2 cos( n0 ) and αk0 (u) := λk (1 − u) +
λk u, 2
αk1 (u) :=
λk (1 − u)2 , 2
αk2 (u) = 0.
(19)
Then bk,00 is defined by (5) and bk,20 by subdividing the cubic boundary of g: 20 10 2 k,00 1 k,00 4 k,20 4 k,20 bk,20 10 := − b00 + b10 + b20 − b30 . 9 3 3 9
(20)
92
J. Fan and J. Peters
From the remaining six G1 constraints across the macro-patch boundary, k k,0 3(v2k,0 + w2k,0 ) = λk uk,0 2 + λ u1 , 3 k,10 3(v3k,0 + w3k,0 ) = λk (bk,10 10 − b00 ), 2 3 k,10 9(v1k,1 + w1k,1 ) = λk (bk,10 30 − b20 ), 2 the two listed as (22) are linked to the remaining the macro-patches be internally C 1 :
for μ = 0, 1 : v2k,μ
+
3(v2k,1 + w2k,1 ) = 0
(21)
3(v3k,1 + w3k,1 ) = 0
(22)
9(v1k,2 + w1k,2 ) = 0.
(23)
four by the requirement that
μ+1,20 bk,μ0 = (bk,μ0 )/2 30 20 + b10
v1k,μ+1
=
2v3k,μ ,
w2k,μ
+
w1k,μ+1
(24) =
2w3k,μ .
(25)
Thus, by adding 3 times (21) to (23) and subtracting 6 times (22) and observing (25), we eliminate the left hand sides and obtain one constraint purely in the boundary coefficients multiplied by λk = 0. A second constraint arises since αk1 (u) being quadratic implies that the middle segment bk,10 (u, 0) is quadratic, i.e. its third derivative is zero: k,10 k,10 k,10 bk,10 30 − 3b20 + 3b10 − b00 = 0.
(26)
Both constraints are enforced by setting 41 k,00 4 4 k,00 36 k,00 9 4 k,00 bk,10 b20 + bk,20 bk,10 b20 + bk,20 10 := 10 − b10 , 20 := 10 − b10 . 25 25 5 25 25 5 Together with (24), this fixes the macro-patch boundary (Fig. 2, right). 3. [First interior layer, G1 constraints] Enforcing the remaining four G1 constraints in terms of the red coefficients in Fig. 3 is straightforward and our symmetric solution is written out below. 3a. [n0 =4 = n1 ] k,μ k,μ k k λkμ uk,μ 2λkμ uk,μ 2 + 2λμ+1 u1 1 + λμ+1 u0 k,μ0 , h2μ := b10 + 6 6 1 ˜k,00 ˜k−1,00 1 ˜k−1,00 ˜k,00 k,μ0 k−1,0μ μ=0, 1 : b21 :=h1μ + (b21 − b12 ), b12 := h1μ + (b12 − b21 ) 2 2 1 ˜k,10 ˜k−1,01 1 μ=1, 2 : bk,μ0 ), bk−1,0μ := h2μ+ (˜bk−1,01 − ˜bk,10 11 := h2μ+ (b11 − b11 11 11 ). 2 2 11
h1μ := bk,μ0 20 +
k k,0 λk0 uk,0 λk0 uk,1 2 + λ0 u1 2 , h2 : = bk,10 + 10 6 12 1 ˜k,00 ˜k−1,00 1 bk,00 ), bk−1,0j : = h1 + (˜bk−1,00 − ˜bk,00 21 := h1+ (b21 − b12 12 21 ) 2 2 12 1 ˜k,10 ˜k−1,01 1 bk,10 ), bk−1,01 : = h2 + (˜bk−1,01 − ˜bk,10 11 := h2 + (b11 − b11 11 11 ) 2 2 11 1 ˜k,10 ˜k−1,01 1 ˜k−1,01 ˜k,10 k,10 bk,10 ), bk−1,01 : = bk,10 − b21 ) 21 := b20 + (b21 − b12 12 20 + (b12 2 2 1 ˜k−1,02 ˜k,20 k,20 1 ˜k,20 ˜k−1,02 bk,20 ), bk−1,02 : = bk,20 − b11 ) 11 :=b10 + (b11 − b11 11 10 + (b11 2 2
3b. [n0 = 4 = n1 ] h1 := bk,00 20 +
On Smooth Bicubic Surfaces from Quad Meshes
93
˜k−1,00 , ˜bk,10 and ˜bk−1,01 are obtained by subdividing where the coefficients ˜bk,00 21 ,b12 11 11 the cubic Hermite interpolate to the transversal derivatives at the endpoints into three parts: ˜bk,00 : = − 4 bk,00 + 4 bk,00 + 1 bk,20 − 2 bk,20 , 21 9 01 3 11 3 21 9 31 ˜bk−1,00 : = − 4 bk−1,00 + 4 bk−1,00 + 1 bk−1,02 − 2 bk−1,02 , 12 9 10 3 11 3 12 9 13 ˜bk,10 : = − 20 bk,00 + 4 bk,00 + bk,20 − 16 bk,20 , 11 21 27 01 3 11 27 31 ˜bk−1,01 : = − 20 bk−1,00 + 4 bk−1,00 + bk−1,02 − 16 bk−1,02 . 11 12 27 10 3 11 27 13 ˜k−1,01 ,˜bk,20 and ˜bk−1,02 are defined analogously. The coefficients ˜bk,10 21 ,b12 11 11
5 4
20 4 − 27 3
1 2
3
1
1 2
1 − 12
3
k,10 k,10 k,20 Fig. 3. left: Once the red BB-coefficients bk,00 of the first interior layer 21 , b11 , b21 , b11 1 are set, the green coefficients are C averages of their two red neighbor points. Blue, pink and black coefficients are inner points computed by the rules on the right: (top) by subdivision and (bottom) so that the pieces join C 2 : the coefficient indicated by the large ◦ is a linear combination, with weights displayed, of the coefficients shown as •.
4. [macro-patch Interior] At the center (four black disks in Fig. 3, left) the coefficients are computed according to 3,right-top, bk,11 11 :=
k,01 k,21 1 k,21 k,12 1 k,12 4 k,01 20 k,10 4 k,10 (− 20 27 b01 + 3 b11 +b21 + 2 b31 )+(− 27 b10 + 3 b11 +b12 + 2 b13 ) , 2
and symmetrically for the other three corners. Coefficients marked in blue and pink are defined by the rules of Fig. 3 right-bottom and the remaining coefficients on the internal boundaries are the C 1 average of their neighbors, e.g. bk,11 := 10 k,10 1 2 (bk,11 + b )/2 so that the patches join C everywhere and C with the central 11 21 subpatch. This completes the local construction of C 1 3×3 macro-patches, one per input quad and so that neighbor macro-patches join G1 . Before we show examples, we discuss why we did not choose m = 2.
94
4
J. Fan and J. Peters
Can m = 2 Provide a Construction?
We show that an analogous construction is not possible for a 2 × 2 macro-patch. Since the degree of ∂2 bk,μ0 (u, 0) and ∂1 bk−1,0μ (0, u) is 3, choosing αkμ to be quadratic implies that ∂1 bk,μ0 (u, 0) must be linear, i.e. each boundary curve segment is piecewise quadratic. If both n0 and n1 are even and not 4, then the vertex enclosure constraint (see Section 2) implies that the shared endpoint of the two quadratic segments is determined independently from both sides – so a local construction is not possible just as in the case m = 1. We therefore choose αkμ (u) := λkμ (1 − u) + λkμ+1 u, μ = 0, 1 with the unbiased choice (we do not prefer one sector over another) λk0 := 2 cos
2π , n0
λk2 := −2 cos
2π . n1
(27)
As in Section 3 (11), we enforce (4)μ=0 and (7)μ=1 of the eight G1 continuity constraints by initializing position and tangents (black filled circles in Fig. 4 left) by subdividing g: k bk,00 00 := g00 ,
bk,00 10 :=
k k g00 + g10 , 2
bk,00 11 :=
k k k k g00 + g10 + g01 + g11 . 4
(28)
Lemma 1. If each macro-patch is parametrically C 1 , αk0 and αk1 are linear with λk0 and λk2 unbiased then the G1 constraints can only be enforced for all local k,00 k,00 initialization of bk,00 if n0 = n1 . 00 , b10 , b11 Proof. Due to the internal C 1 constraints, adding (6)μ=0 and (5)μ=1 and subtracting six times (7)μ=0 yields 3(v2k,0 + w2k,0 )+ 3(v1k,1 + w1k,1 )− 6(v3k,0 + w3k,0 ) = 0 and therefore the right hands sides satisfy k k,0 k k,1 k k,1 k k,0 λk0 uk,0 2 + 2λ1 u1 + 2λ1 u1 + λ2 u0 = 6λ1 u2 .
(29)
That is, for an internally C 1 macro-patch, G1 constraints across the macropatch’s boundary imply a constraint exclusively in terms of uk,μ i , i.e. derivatives along the boundary! Since initialization fixes the local position, tangent and twist coefficients at each vertex, (5)μ=0 determines bk,00 and (6)μ=1 determines 20 k,00 k,00 k,10 1 bk,10 ; and C continuity determines b := (b + b 10 30 20 10 )/2. Thus all vectors of (29) are fixed and the remaining single free scalar λk1 cannot always enforce k,1 k (29). But if n0 = n1 then λk0 = −λk2 and uk,0 2 = u0 ; and λ1 = 0 solves (29).
bk,01
bk,11
bk,00 bk,10 k,00 k,00 k,10 b b b 0011 00 11 00 11 0 1 20 1 10 1 00 11 00 0 0 00 11 00 0 1 00 11 00 11 0 011 00 1 00 1 11 0 00 11 0011 0 1 1 1 n0 11 k−1,01 n k−1,00 b b Fig. 4. left: Coefficients initialized according to (28). right: Indexing.
On Smooth Bicubic Surfaces from Quad Meshes
95
Fig. 5. 3 × 3 macro-patch construction. left: Quad mesh and surface; middle(top four): Gauss curvature distribution on the surface, right: Highlight lines.
96
5
J. Fan and J. Peters
Conclusion
Curvature distribution and highlight lines on the models of Fig. 5 illustrate the geometric soundness of the m = 3 macro-patch construction. Choosing αk1 and hence the middle boundary curve segment to be quadratic, avoids the PCCM shape problem which is due to αk0 and hence the first segment being quadratic. Conversely, Section 4 suggests that there is no obvious construction for m = 2; whether a more complex Ansatz can yield a localized construction for m = 2 remains the subject of research.
References [CC78] [LS08] [MYP08] [NYM+ 08] [Pet00] [Pet01] [Pet02]
Catmull, E., Clark, J.: Recursively generated B-spline surfaces on arbitrary topological meshes. Computer Aided Design 10, 350–355 (1978) Loop, C., Schaefer, S.: Approximating Catmull-Clark subdivision surfaces with bicubic patches. ACM Trans. Graph. 27(1), 1–11 (2008) Myles, A., Yeo, Y., Peters, J.: GPU conversion of quad meshes to smooth surfaces. In: Manocha, D., et al. (eds.) ACM SPM, pp. 321–326 (2008) Ni, T., Yeo, Y., Myles, A., Goel, V., Peters, J.: GPU smoothing of quad meshes. In: Spagnuolo, M., et al. (eds.) IEEE SMI, pp. 3–10 (2008) Peters, J.: Patching Catmull-Clark meshes. In: Akeley, K. (ed.) ACM Siggraph, pp. 255–258 (2000) Peters, J.: Modifications of PCCM. TR 001, Dept CISE, U Fl (2001) Peters, J.: Geometric continuity. In: Handbook of Computer Aided Geometric Design, pp. 193–229. Elsevier, Amsterdam (2002)
Appendix: Linear Parameterization and Valence 4 Lemma 2. If k = 4 curves meet at a vertex without singularity and αk0 i is linear with λk1 := cos n2πk , for fixed scalar > 0 and valence nk , then the G1 constraints (3) can only be enforced if nk = nk+2 for k = 0, 1. k Proof. For k = 1, 2, 3, 4, λk0 = 0 and therefore αk0 i := λ1 u. Equation (5)μ=0 then simplifies to k−1,00 k k,0 3(bk,00 ) = 6bk,00 11 + b11 10 + λ1 u0 , k = 1 . . . 4.
Since
4
k k,00 k=1 (−1) (b11
0=
4 k=1
+ bk−1,00 ) = 0 and 11
4
k k,00 k=1 (−1) b00
k,00 k k,0 (−1)k 6(bk,00 10 − b00 ) + λ1 u0 =
4
= 0,
(−1)k (6 + λk1 )uk,0 0
(30)
k=1
implying λk1 = λk+2 for k = 0, 2 since uk,0 = −uk+2,0 . Therefore nk = nk+2 1 0 0 must hold.
Simple Steps for Simply Stepping Chun-Chih Wu, Jose Medina, and Victor B. Zordan University of California, Riverside {ccwu,medinaj,vbz}@cs.ucr.edu
Abstract. We introduce a general method for animating controlled stepping motion for use in combining motion capture sequences. Our stepping algorithm is characterized by two simple models which idealize the movement of the stepping foot and the projected center of mass based on observations from a database of step motions. We draw a parallel between stepping and point-to-point reaching to motivate our foot model and employ an inverted pendulum model common in robotics for the center of mass. Our system computes path and speed profiles from each model and then adapts an interpolation to follow the synthesized trajectories in the final motion. We show that our animations can be enriched through the use of step examples, but also that we can synthesize stepping to create transitions between existing segments without the need for a motion example. We demonstrate that our system can generate precise, realistic stepping for a number of scenarios.
1
Introduction
Motion capture playback, blending, and modification have become the standard suite of motion synthesis tools used for animating video games and increasing number of feature animations. A standard practice associated with such applications is the generation of ‘transitions’ which combine two motion sequences together to create one longer sequence. Transitions also play important role in increasing responsiveness of character in interactive applications. However, the quality of the resulting motion depends on the choice of method(s) for creating the transition. Tell-tale artifacts of a poor transition in general include unnatural foot sliding and physically implausible motion. Groups of researchers have addressed these issues by explicitly removing so-called ‘foot skate’ [1,2] and by enforcing various physical characteristics during motion transition [3,4]. In video games especially, foot skate often appears when a character transitions from standing in one configuration to another. Unless the feet are perfectly aligned, naive transition techniques will induce unnatural foot sliding as the character goes from the initial to the ending stance. A difficult problem in this scenario is to create a transition which accounts for the differences in the placements of the feet while also accounting for the movement of the body in a realistic manner. We observe that a human actor would tackle similar scenarios by shifting the weight of the body and taking a step to re-position the feet. We G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 97–106, 2008. c Springer-Verlag Berlin Heidelberg 2008
98
C.-C. Wu, J. Medina, and V.B. Zordan
propose that a similar mechanism for stepping is necessary to generate a plausible transition. Based on this insight, a new problem arises during the special conditions of transitions where a character begins and ends standing. This paper introduces a general method for synthesizing stepping actions in humanoid characters. While the technique is showcased in conjunction with a motion database of examples which enrich the final motion, the power of the approach comes from our models which drive the character using idealized, parametric trajectories for the stepping foot and projected center of mass. We show that these trajectories can be simple mathematical functions built empirically from observations of example stepping movements and parameterized to be controlled by key features of the desired action. The result is a stepping system that allows a character to ‘transition’ from one double-stance pose to another automatically by stepping. The simplicity of the approach lends itself to being adopted easily and immediately by game developers and technical animators alike. To show the generalness of the results, we demonstrate example animations for a variety of scenarios.
2
Related Work
Editing motion capture data has been the focus of an immense number of publications in the past two decades, too many to mention here individually. However, we focus our discussion of the related work on the domain of motion transitions and previous research related to balanced stepping. Since the introduction of automatic data re-use using motion graphs [5,6,7], the concept of generating transitions has become increasingly popular. Some contributions highlight ideal lengths for transitions [8], use of multiple frames to generate high quality transitions [9] and ways to generate transitions with constraints [10]. Transitions are typically performed by blending the root and joint angles over time. Cleaning up transitions is usually done with an algorithm for correcting foot motion [1,2] and inverse kinematics (IK). Tools for correcting balance ensure that motion transitions remain physically plausible by controlling the center of mass (COM) or zero moment point [11,12]. Closely related research includes balance maintenance for stepping activities [13,14]. However, these projects are focused on choosing how to step in response to a disturbance as opposed to our goal which is to generate a specific step based on a desired foot placement. In this manner, we are more aligned with researchers interested in driving animation with footprints, as in [15]. Researchers in robotics have also proposed techniques for automatically generating stepping motion for use in control of humanoid robots. Similar challenges in this area include choosing step location and maintaining balance. Various numerical values have been introduced to define balance (for a summary see [16]) and many balance control papers have appeared. One big difference between our goals and those of roboticists is that we care more about the simplicity of the solution than on precise control as long as our technique does not sacrifice visual quality. One simplifying model is to treat the dynamics as an inverted
Simple Steps for Simply Stepping
99
pendulum [17] and to control the robot to perform stepping based on a point mass and massless leg. We use such a conceptual model to dictate the motion of the COM for a stepping action. Another group of researchers introduce the concept of the “capture point” which is the step placement (point) which yields a single step to recover from a perturbation [18]. We feel a similar model could be used in conjunction with our technique to choose the position of the foot and the timing, which are the two required inputs to our system.
3
Stepping Algorithm
The algorithm we employ has two particular foci. First, we control the stepping foot, both its path and its speed along that path. And second, we control the COM, again both position and velocity. We choose to control the COM in order to make the visible weight shift that corresponds to stepping actions in humans. We propose that these two factors alone encapsulate many of the physical attributes of a single step. While we also include motion examples of stepping to enrich the upper-body movement of the final motion, our hypothesis is that by moving the stepping foot and COM realistically we can generate believable stepping simply. Our technique incorporates these two components into an animation transition using optimization. In particular, we employ a frame-by-frame optimizer which takes a frame from starting blend as input and produces a modified posture that enforces the desired stepping foot and COM trajectories. We expect the choice of timing and position for the final stepping foot be provided as input to our system. Along with the character’s current motion, we extract an example motion from our database and use the sequence to compute the initial paths for the foot and COM. We break the description of our algorithm into two phases, preprocessing and step generation. In the preprocessing stage, we determine the necessary inputs to the stepping algorithm, specifically: 1. 2. 3. 4.
Input (from the user) the final stepping foot position and step duration Identify the closest example from the step database Adjust the ending pose from example using IK to place foot Extract the COM ending place from the (adjusted) end pose
Steps 3 and 4 are used solely to determine the final position of the COM based on the motion sequence selected. Alternatively, we can force the system to transition to a specific motion sequence. In this case, Steps 2 and 3 can be skipped. Motion generation follows a straightforward sequence: 1. 2. 3. 4.
Compute the stepping foot path, Pf , and speed profile, Vf Compute the COM path, Pc and speed profile, Vc Blend to example with support foot as root Modify blend with optimizer to meet COM/foot trajectories
100
C.-C. Wu, J. Medina, and V.B. Zordan
The starting blend from Step 3 is treated as the input to the optimizer. To keep the support foot from moving during transition, the blend is performed by treating the chosen support foot as the fixed root of the branching chain for the body. All other parts of the body move by smoothly interpolating the included joint angles. More details about each part of our algorithm are described in the following sections.
4
Stepping Foot Control
To define the stepping foot motion appropriate for the desired step/transition, we determine the foot’s path and its speed along that path. We assume that the path and speed profile are related by the distance covered from start to finish. That is, the total path displacement must equal the integral of the function chosen for the speed. We also assume that distance covered is monotonically increasing along the path. We follow a similar set of definitions and assumptions for COM control.
Fig. 1. Foot speed, Vf . The black, dashed curve is the idealized normal speed fit using a Gaussian centered at 0.5. The rest are normalized sample profiles taken from various examples in our step database.
We model stepping as if it is a point-to-point reach. Upon inspection of our database of examples, we found remarkable uniformity - nearly-linear, point-to-point paths for each stepping foot. There has been in-depth investigation performed on hand point-to-point movement for reaching tasks (see [19,20]) and, in this body of work, it is commonly accepted that the hand traverses an approximately straight-line path with a bell-shaped speed profile. For our foot model, we adopt a similar estimate for the foot trajectory, Pf , by forcing the foot to traverse the line segment formed by its starting and ending position while using a normal Gaussian to serve as the bell-shaped speed curve. That is Vf = ae
−(x−0.5)2 2w2
(1)
defines the speed along Pf . This idealized speed curve is plotted in comparison to several speed profiles taken from our database in Figure 1. When the recorded
Simple Steps for Simply Stepping
101
curves are normalized in both time and amplitude, they show remarkable similarity, independent of the stepping direction, the length of the step, or the duration. We adjusted the shape (width) of our normalized Gaussian shown by manually setting the constant, w, to be 0.08. This value is used for all our results using stepping examples. To align the speed profile with the path, we must control the area under the curve to be equal to the distance from the start to the end of the footstep. We automatically tune the Gaussian by scaling amplitude, a, after integrating the curve for the normalized amplitude shown in the figure. Note this integration need only be done once and can then be scaled by a to match the specific (known) distance covered in the to-be-synthesized motion.
5
Center of Mass Control
As with the foot, to control the COM we will define a simplified model which captures the features of the human examples. Again our path and speed are both idealized from observations about the stepping examples recorded in our database. The COM path follows a parabola-like trajectory starting and ending at known points and moving toward and away from the pivot foot. For Pc , we found empirically that a simple quadratic Bezier curve which uses the start, end, and pivot as control points reasonably maps out the path of the COM found in examples in our database. Comparisons appear in the results section. For the speed, we observe in the recorded motions consistent trends in the trajectory of the COM velocity broken down into three phases. 1) Push off. In this phase, before the foot is lifted, the COM begins to accelerate at a fairly constant rate toward the support foot. 2) Free fall. The second stage has the foot off the ground and we see a trajectory that mimics an unactuated inverted pendulum with the center of mass accelerating uniformly away from the support foot (now out of static balance.) 3) Landing. The stepping foot reaches the ground and the motion induced in the second stage is dissipated with a slow changing acceleration back toward the (original) support foot. What we infer from these observations is that three stages with constant acceleration reasonably describe the observed velocity profiles. Note, these phenomena are described in the coordinate frame oriented toward the support foot. An idealized inverted pendulum which pivots about the support foot models both our path and speed observations reasonably. Based on a pendulum model, our choice of path trajectory, Pc , is sensible since an idealized pendulum moves its body on a ballistic, quadratic path. To fit the velocity characteristics for the COM, we could approximate the effects of the stepping leg as applying a uniform ‘push-off’ and ‘landing’ force before and after the step. (The minimum jerk theory for reaching partially supports this proposition [20].) An idealized constant force would yield a constant acceleration for push-off and landing. Constant acceleration is also reasonable for the middle phase when the body feels only the effects of gravity. Thus, for the model of our COM speed profile, Vc , we choose the piecewise linear function shown in Figure 2. We derive
102
C.-C. Wu, J. Medina, and V.B. Zordan
Fig. 2. COM speed, Vc . Ideal and actual speed profiles in the direction of the support foot. That is, only motion toward and away from the pivot foot contribute to the data plotted, plus signs are derived from a real example. The timing information, t0 − t3 , which delimit the stages (push off, free fall, and landing, respectively) can be estimated from the motion example by detecting when the stepping foot leaves and reaches the ground. Based on the pendulum model, m2 is set to gsin(θ) where θ is the lean angle between vertical plane and support leg and g is gravity. Areas, A0 and A1 , link D, the displacement of the COM derived from Pc , to the slopes m1 and m3 .
the terms of the velocity segments shown from known (or approximate) values for timing, t0 − t3 , and the COM displacement, D, which is extracted from the Bezier curve, Pc .
6
Synthesis for Stepping
To generate the starting blend given the stepping action parameters, we propose a simple, but effective interpolation synthesis technique. The problem here is to concatenate the motion the character is currently following with the stepping motion in the example. To be successful, the transition should not introduce any unwanted artifacts. The most straightforward solution is to align the transitionto motion globally to the character’s current position and facing direction and to blend the root position and orientation as well as the joint angles. However, in general, this approach introduces undesirable foot sliding. Instead, we align the support foot of the before and after motion and use this as the fixed root for the blend. The system then performs the blend by interpolating over the errors for
Simple Steps for Simply Stepping
103
the root orientation and the joint angles across the transition sequence. Note we do allow the support foot to rotate across the transition. This rotation is usually small if the facing direction of the two motions are closely aligned and acts to pivot the foot if there is a larger discrepancy. We show that such rotations appear natural looking in our results. In our implementation, our system interpolates by ‘slerp’-ing quaternions, with a simple ease-in/ease-out (EIEO) time-based weighting across the transition. Once we have the starting blend, we modify it to uphold the stepping foot and COM trajectories determined for the transition. We accomplish this goal on a frame-by-frame basis, first applying IK to place the stepping foot at the desired position and then using an optimization to reach the desired COM. The optimizer works by moving the pelvis position in the horizontal plane and using an IK sub-routine [21] to generate adjustments for each leg which enforce the proper foot placement. A similar approach is described here [22]. Further, our solver constrains the height of the pelvis to maintain a set maximum length for each leg, lowering the pelvis automatically to avoid stretching either leg beyond its limit. The beauty of this optimization routine is that it chooses only the placement of the pelvis in the horizontal plane, but shifts the entire body and subsequently controls the projected COM. Our implementation uses Numerical Recipes’ BFGS routine [23] which employs a quasi-Newton, gradient-based search to determine the pelvis location. The initial location is taken from the starting blend. The performance index is simply the error between the desired and current projected COM. The position of the desired COM is extracted from Pc by moving along the Bezier curve until the path displacement satisfies Vc . Likewise, the stepping foot location is set each frame to follow Pf while also satisfying the rate determined from Vf .
7
Implementation and Results
Our final implementation includes additional details that need to be described. Our stepping database was recorded by systematically creating a series of normal steps taken in various directions. We include twenty stepping examples which begin and end with double-stance. We use our system in two modes: starting from a known stance and transitioning to a modification of one of the step examples; or by combining two existing clips, i.e. without using a step example. When generating a step animation that includes an example, we select the example in the database which is closest based on the (relative) desired step location. For results without using examples, we found that adjusting the width of Equation 1 is sometimes necessary to avoid abrupt movement of the stepping foot. The running time of our system is amply fast to be used at interactive rates. Results. We show two types of results in the accompanying video to demonstrate the range of animations that are possible using the technique. First, we include examples which use our stepping database. We show an animation of a series of steps using left and right feet alternately to create a careful navigation (see Figure 3.) We compare the quality of a second synthesized series with a
104
C.-C. Wu, J. Medina, and V.B. Zordan
Fig. 3. Careful stepping. Navigating an extreme environment by precisely placing steps shows off a series of four steps completely synthesized by our system.
Fig. 4. Animation of a series of steps pivoting on the left foot. The red X marks the consecutive end locations of the steps synthesized, the X values were taken from the motion capture sequence shown in the lefthand plot of Figure 5.
Fig. 5. Comparisons for stepping. Cartesian plots for the foot and COM paths shown in red and green lines respectively over three consecutive steps. On the left is a contiguous motion capture example held out of the database but used as target input for foot placement and timing. In the middle is motion resulting from our model (also shown in Figure 4). On the right is the starting blend, as described in the text.
continuous motion sequence of three steps held out of the database (see Figures 4 and 5). Next, we include two animation examples that are generated without the step database. The goal here is to breakdown the contributions of each component of our system and to show off the power of our technique for creating seamless transitions by stepping. In the video, we show a turning task which
Simple Steps for Simply Stepping
105
is derived from simply by rotating a contiguous motion of a “ready-stance” in martial arts. We contrast the optimized result with the starting blend. Next, we modify a series of fighting attacks to control the direction of one kick by changing the stepping motion preceeding the attack.
8
Discussion and Conclusions
We have demonstrated the power of our simple method for generating controlled stepping movement. The underlying assumptions in our system are motivated by motor theorists and are supported by comparisons with motion capture examples of stepping. While the technique is very simple, used in combination with a stepping motion database we can generate rich motion that is comparable to unmodified motion captured stepping. Our approach does include certain limitations. First, the system does not make any modifications to the upper body. While we know the upper body will respond to the movement of the lower body during stepping, we rely on the upper-body response embedded in the stepping example. When we remove this example, the motion of the upper body is computed solely from the interpolants and their blend. There is no guarantee that this will result in realistic motion. Likewise, the pivoting of the support foot is derived solely from the starting blend and we feel it is acceptable but not truly reflective of what we see in the motion database. The piecewise linear model for the velocity of the COM is likely too over-simplified to match human motion tightly, although we found it acceptable for our purposes. And finally, if user inputs a final stepping foot position where the distance is farther than the reaching limitation of single step for current subject, our system currently is unable to automatically generate multiple steps to achieve the final destination since this requires motion planning which is beyond the scope of this paper. Motion generation from a system like the one proposed is useful for directly animating a character. However, we believe it is also potentially valuable for informing a control system when employed in the activation of a physical character or robot. We see this as a promising direction for future work. As is, the system we describe is easy to implement and fast to run, and so we hope it is adopted by game developers and animators who need to take steps toward stepping.
References 1. Kovar, L., Schreiner, J., Gleicher, M.: Footskate cleanup for motion capture editing. In: Symposium on Computer animation, pp. 97–104 (2002) 2. Ikemoto, L., Arikan, O., Forsyth, D.: Knowing when to put your foot down. Interactive 3D graphics and games, 49–53 (2006) 3. Rose, C., Guenter, B., Bodenheimer, B., Cohen, M.F.: Efficient generation of motion transitions using spacetime constraints. In: ACM Siggraph, pp. 147–154 (1996) 4. Shin, H., Kovar, L., Gleicher, M.: Physical touch-up of human motions. In: Proceedings of the 11th Pacific Conference on Computer Graphics and Applications, p. 194 (2003)
106
C.-C. Wu, J. Medina, and V.B. Zordan
5. Arikan, O., Forsyth, D.: Interactive motion generation from examples. In: ACM Siggraph, pp. 483–490 (2002) 6. Kovar, L., Gleicher, M., Pighin, F.: Motion graphs. In: ACM Siggraph, pp. 473–482 (2002) 7. Lee, J., Chai, J., Reitsma, P., Hodgins, J., Pollard, N.: Interactive control of avatars animated with human motion data. In: ACM Siggraph, pp. 491–500 (2002) 8. Wang, J., Bodenheimer, B.: Computing the duration of motion transitions: an empirical approach. In: Symposium on Computer animation, pp. 335–344 (2004) 9. Ikemoto, L., Arikan, O., Forsyth, D.: Quick transitions with cached multi-way blends. Interactive 3D graphics and games, 145–151 (2007) 10. Gleicher, M., Shin, H., Kovar, L., Jepsen, A.: Snap-together motion: assembling run-time animations. Interactive 3D graphics, 181–188 (2003) 11. Boulic, R., Mas, R., Thalmann, D.: Position control of the center of mass for articulated figures in multiple support. In: Eurographics Workshop on Animation and Simulation, pp. 130–143 (1995) 12. Tak, S., Song, O., Ko, H.: Motion Balance Filtering. Computer Graphics Forum 19, 437–446 (2000) 13. Yin, K., Pai, D., van de Panne, M.: Data-driven interactive balancing behaviors. Pacific Graphics (2005) 14. Kudoh, S., Komura, T., Ikeuchi, K.: Stepping motion for a humanlike character to maintain balance against large perturbations. In: Proc. of Intl Conf. on Robotics and Automation (2006) 15. van de Panne, M.: From footprints to animation. Computer Graphics Forum 16, 211–223 (1997) 16. Popovic, M., Goswami, A., Herr, H.: Ground Reference Points in Legged Locomotion: Definitions, Biological Trajectories and Control Implications. The International Journal of Robotics Research 24, 1013 (2005) 17. Kajita, S., Kanehiro, F., Kaneko, K., Yokoi, K., Hirukawa, H.: The 3D linear inverted pendulum mode: a simple modeling for a bipedwalking pattern generation. Intelligent Robots and Systems (2001) 18. Pratt, J., Carff, J., Drakunov, S., Goswami, A.: Capture Point: A Step toward Humanoid Push Recovery. In: Proceedings of the IEEE-RAS/RSJ International Conference on Humanoid Robots (2006) 19. Abend, W., Bizzi, E., Morasso, P.: Human arm trajectory formation. Brain 105, 331–348 (1982) 20. Flash, T., Hogan, N.: The coordination of arm movements: an experimentally confirmed mathematical model. Journal of Neuroscience 5, 1688 (1985) 21. Tolani, D., Goswami, A., Badler, N.I.: Real-time inverse kinematics techniques for anthropomorphic limbs. Graph. Models Image Process. 62, 353–388 (2000) 22. Metoyer, R., Zordan, V.B., Hermens, B., Wu, C.C., Soriano, M.: Psychologically inspired anticipation and dynamic response for impacts to the head and upper body. IEEE Transactions on Visualization and Computer Graphics (TVCG) (2008) 23. Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C. Cambridge University Press, New York (1994)
Fairing of Discrete Surfaces with Boundary That Preserves Size and Qualitative Shape ˇ ara, and Martina Mat´yskov´a Jana Kostliv´a, Radim S´ Center for Machine Perception, CTU in Prague, Czech Republic {kostliva,sara}@cmp.felk.cvut.cz, http://cmp.felk.cvut.cz
Abstract. In this paper, we propose a new algorithm for fairing discrete surfaces resulting from stereo-based 3D reconstruction task. Such results are typically too dense, uneven and noisy, which is inconvenient for further processing. Our approach jointly optimises mesh smoothness and regularity. The definition is given on a discrete surface and the solution is found by discrete diffusion of a scalar function. Experiments on synthetic and real data demonstrate that the proposed approach is robust, stable, preserves qualitative shape and is applicable to even moderate-size real surfaces with boundary (0.8M vertices and 1.7M triangles).
1 Introduction In recent years, 3D scene reconstruction from stereo became a feasible problem in computer vision [1,2,3]. Its result is typically a triangulated mesh of the scene surface, which however is often too dense, irregular and corrupted by noise (cf. Fig. 1). Mesh fairing is a general problem that occurs in surface reconstruction, mesh interpolation or blending. We will represent discrete surface by a triangulated mesh, a complex, which is a union of 0-simplices (vertices), 1-simplices (edges), and 2-simplices (triangles). Vertex w is a direct neighbour of vertex v if there exists an edge between them. We assume, each edge is incident with at most two triangles. If it is incident with only a single triangle, we call it a boundary edge. Typically, fairing involves the following tasks: (1) mesh simplification, i.e. redundant triangle removal, (2) mesh smoothing, i.e. (random or discretisation) noise suppression, and (3) mesh regularity improvement. In this paper, we address (2) and (3), (1) has been studied e.g. in [4,5,6,7]. Many approaches focused on (2). They are algorithms minimising surface energy, which is based on surface curvature. Curvature is defined only on continuous surface, however, and thus some approximation is required for a discrete surface. In one group of approaches the energy is defined on a continuous surface [6,8,9,7]. The energy [6,8] or its Laplacian [9,7] is then discretised. The second group of approaches define curvature approximation on a discrete surface directly [9,5,10,11]. The last group of approaches is based on different principles, such as image processing [12,13] or position prediction [14]. Improving mesh regularity (3) has not been investigated so deeply as surface smoothing, since most of approaches assume approximately uniform meshes. It could be based on vertex re-positioning [15] or on changing mesh topology [4,5,6,7], which belongs to mesh simplification. We propose an alternative approach which belongs to the second group. Our curvature definition is simpler, which brings computational simplicity over [9]. We formulate G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 107–118, 2008. c Springer-Verlag Berlin Heidelberg 2008
108
ˇ ara, and M. Mat´yskov´a J. Kostliv´a, R. S´
Fig. 1. Results of 3D scene reconstruction task. Histograms of edge lengths and triangle areas (note x-logarithmic scale) demonstrate a huge disproportion in these measures, the red lines correspond to the size of average edge length Eavg and area of equilateral triangle with this edge. Hence, the mesh is full of very small area triangles which used to be thin and long.
fairing as surface optimisation by two criteria: curvature consistency, and mesh regularity. The curvature consistency term fairs locally high curvature and the irregularity term fairs local variations in edge lengths. Since these criteria affect each other, a joint optimisation is required. We say that the surface is ideally faired, if the curvature in each vertex is equal to average curvature of its neighbours and local irregularity vanishes. Our method tends to preserve qualitative shape (convex/concave/elliptic/hyperbolic). To eliminate surface shrinking during fairing, we employ scalar curvature diffusion (not curvature flow diffusion). The main contributions of this paper are: (1) Diffusion of a scalar function (unlike of a curvature flow), which does not result in surface contraction and under certain conditions it provably preserves qualitative shape [16]. (2) Fairing formulation as a joint maximisation of curvature consistency and mesh regularity. (3) Solving the problem with non-monotonic behaviour of our curvature. Fairing surfaces with boundary involves both curve and surface fairing. Hence, formulation of fairing curves is in Sec. 2, of closed surfaces in Sec. 3 and finally by combining these is hinted in Sec. 4. In Sec. 5, we demonstrate proposed approach performance and discuss its properties, Sec. 6 concludes the paper.
2 Curve Fairing ∂H 2 Let C be C 2 continuous closed curve minimising J = ds, where H = H(s, t) ∂s is the curvature, in which s is the curve parametrisation and t represents time. A necessary condition for the existence of an extreme of J gives Laplace’s equation ΔH = ∂2H ∂s2 = 0. Consequently, function H minimising J is a harmonic function, and it can be ∂2 H ∂H found as a solution of diffusion equation ∂H ∂t = λ ∂s2 , which fulfils limt→∞ ∂t = 0. The λ is a diffusion coefficient, 0 < λ < 1. Since H is a harmonic function, by discretisation of the Gauss’s Harmonic Function ¯ i − Hi , where Hi is (scalar) curvature in vertex i and Theorem we derive ΔH = H 1 ¯ Hi = 2 (Hi−1 + Hi+1 ) is the mean of curvatures of direct neighbours. We discretised
Fairing of Discrete Surfaces
(a)
(b)
109
(c)
Fig. 2. The direction of curvature flow (a, b) and irregularity flow (c) at vertex xi
as Hik+1 − Hik . Together with ΔH = 0 this gives a formula for iterative curvature diffusion at vertex i: ∂H ∂t
¯ ik , 0 < λ < 1. Hik+1 = (1 − λ)Hik + λH
(1)
where k represents the iteration step number. 2.1 Curvature Definition Let C = {x1 , . . . , xn }, xj = [x, y, z], be a given closed discrete curve represented by a vertex sequence (i.e. x0 = xn ). Without loss of generality we assume that the curve is in a three-dimensional space. Let xi−1 , xi+1 be direct neighbours of xi . The curvature flow Hi = [hix , hiy , hiz ] of curve C at vertex xi is defined as ∇ d2i1 + d2i2 Hi = , where (2) d2i1 + d2i2 di1 = di1 = xi − xi−1 , di2 = di2 = xi − xi+1 , (3) 2 2 and ∇ di1 + di2 represents formal gradient with respect to the three coordinates of xi . By substituting (3) to (2) we express curvature flow at vertex xi as: di1 + di2 Hi = 2 · 2 . (4) di1 + d2i2 The direction of curvature flow equals to the direction of vector di1 + di2 . Fig. 2 shows the geometric interpretation of Hi . We denote Hi = Hi (2-norm). In Eq. (1), we use only curvature Hi instead of curvature flow. If we used the flow, curves would contract during the fairing process. The proposed curvature has the following properties: (1) Hi = 0 iff di1 = −di2 ; (2) Hi is invariant to curve rotation and translation; (3) in a regular polygon, Hi is oriented from its centre and Hi = 1r , where r is the radius of a circumscribed circle; (4) Hi is not a monotonic function of the vertex angle φ (see Fig. 2(a,b)). Fig. 3(a) shows the plot of Hi vs φ when di1 = di2 and the vertex xi is moved in direction di1 + di2 ; (5) The curvature does not vanish when xi lies on bi = xi+1 − xi−1 unless xi = xi−1 +xi+1 (cf. Fig. 3(b)). Thus, diffusion of Hi also partially controls mesh regularity. 2 On the other hand, we observed if xi is slightly off the line bi , the diffusion tends to pull xi away from the line. Hence, additional term controlling regularity is required; (6) the curvature diffusion preserves inflection points (Fig. 3(c)), hence the qualitative ¯ k = 0 we curve shape: For symmetric inflection points (solid), it holds Hik = 0. Since H i k+1 k k k ¯ have Hi = 0. If xi is slightly deflected (dashed), Hi < 0, since Hi−1> Hi+1 and thus Hik+1< Hik , until Hik= 0.
110
ˇ ara, and M. Mat´yskov´a J. Kostliv´a, R. S´
(a)
dd
(b)
(c)
Fig. 3. Curvature properties: (a) Curvature Hi in dependence on the size of angle φ, when di1 = di2 . (b) Curvature in dependence on vertex xi position, when di1 and di2 are colinear, di1 +di2 = 1, and xi is moved from xi−1 towards xi+1 . (c) Inflection point is preserved.
2.2 Irregularity Definition The second fairing criterion is irregularity, which is treated similarly to curvature. Curve definition as well as other variables are the same as in Sec. 2.1. The irregularity flow Ui = [uix , uiy , uiz ] of curve C at vertex xi is 1 bi bTi Ui = − Pi ∇ d2i1 + d2i2 , Pi = , bi = xi+1 − xi−1 . (5) 4 bi 2 Hence, the irregularity flow becomes: Ui = −
1 Pi (di1 + di2 ) . 2
(6)
Irregularity flow direction is collinear with bi (Fig. 2(c)), again we denote Ui = Ui , which captures distance difference between vertex xi and its direct neighbours. We setup discrete diffusion with equilibrium at zero: Uik+1 = (1 − λu )Uik , 0 < λu < 1. (7) The proposed irregularity has the following properties: (1) Ui = 0 iff di1 = di2 ; (2) Ui is invariant to curve rotation and translation; (3) Ui points directly to equilibrium xi −xeq i state xeq ; (4) Ui is a monotonic function of vertex position on the i , and Ui = 2 line represented by Ui . 2.3 Algorithm The goal of the algorithm is to iteratively move curve vertices so that both criteria, curvature and irregularity, are in equilibrium. The algorithm works with the criteria independently. Fairing under curvature and irregularity are swapped after a given number of iterations. In each iteration, the algorithm visits all curve vertices in turn. For each vertex, new values, irrespective to the used criterion, are computed. The new curve is described only by new scalar values. The new positions of vertices, which correspond to these values, have to be reconstructed. Due to a non-monotonic behaviour of Hi illustrated in Fig. 3 there may exist up to two solutions for a given curvature Hik+1 . We accept the one which corresponds to the ascending interval, details are found in [16].
Fairing of Discrete Surfaces
111
The algorithm has two termination mechanisms: The first one is defined as the minimal average vertex position move θmov . The second criterion is the maximal number of iterations kmax . Both these thresholds are defined by the user. The algorithm flow is given in Alg. 1. Note that the curve need not be oriented for Alg. 1 to work correctly. Local orientation is facilitated by signs s1 , s2 in Step 3. A detailed algorithm description together with equation derivation can be found in [16]. Algorithm 1. Curve Fairing Set k = 1. 1: Set i = 1. If the current fairing criterion is irregularity, go to step 6. k dk k k i1 +di2 k 2 , Hi = Hi . 2 (dk i1 ) +(di2 ) k+1 k k k k ¯ i , where H ¯ ik = 1 ski Hi−1 Compute Hi = (1 − λ)Hi + λH + sk2 Hi+1 and sk1 , sk2 are 2 k k k signs with which the curvature vectors Hi−1 , Hi+1 are transfered to xi . They are positive, if the curvatures are identically oriented with Hki , and negative otherwise. Update xk+1 ← xki + αki (dki1 + dki2 ), where αki is the solution of quadratic equation, i T k T k (dk ) d dk dk (dk i2 i1 ) di2 P = (dk )i12 +(dki2)2 , Q = di1 : k + dk , cos φ = k dk i1 i2 i2 i1 i1 ·di2 k+1 k+1 k+1 k+1 k 2 k (αi ) (2Hi + 4Hi P ) + αi (2Hi + 4Hi P −2Hik ) + Hik+1− Hik = 0. cos φ cos φ Accepted αki ∈ − 12 − 12 Q−2 , − 12 + 12 Q−2 . Q+2 cos φ Q+2 cos φ
2: Evaluate Hki = 2 · 3:
4:
5: Set i = i + 1. If i ≤ n, go to step 2. Otherwise, go to step 10. k −Pi (dk i1 +di2 ) 6: Evaluate Uki = , Uik = Uki . 2 k+1 k 7: Compute Ui = (1 − λu )Ui . 8: Update xk+1 ← xki + λu Uki . i 9: Set i = i + 1. If i ≤ n, go to step 6. 10: If average vertex move is smaller than a threshold θmov or k > kmax , terminate. 11: Set k = k + 1, go to step 1.
3 Closed Surface Fairing In this section, we formulate the task for closed discrete surfaces. Discretisation is performed in the same way as for curves (Sec. 2), only in higher dimension. The definition for curvature diffusion at current vertex is: ¯ k , 0 < λ < 1, H k+1 = (1 − λ)H k + λH (8) ¯ 1 n Hi is the curvature mean of direct neighbours. k represents iteration, and H= i=1 n 3.1 Curvature Definition Let S be the given discrete closed surface represented by a triangular mesh, V = {x1 , . . . , xm }, xj = [x, y, z] be a set of its vertices. The current vertex is labelled x, its direct neighbours N = (x1 , . . . , xn ), xn+1 = x1 , are indexed by i. The curvature flow H = [hx , hy , hz ] of the surface S at vertex x is defined as: n 3 ∇ i=1 A2i H = · n , (9) 2 2 i=1 Ai
ˇ ara, and M. Mat´yskov´a J. Kostliv´a, R. S´
112
dd
(a)
(b)
dd
(c)
Fig. 4. Curvature definition: (a) Neighbourhood of the vertex x. (b) Curvature with dependence on the height of a pyramid with regular 6-polygon base, bi = 1. (c) Curvature with dependence on vertex x position, all vertices are in the same plane. For clearness, the vertex x was moved only along the line p, again bi = 1.
where (cf. Fig. 4(a)) Ai = 12 hi bi = 12 hi bi is the area of a triangle formed by vertices x,xi and xi+1 , h i nis the height at vertex x, bi = xi − xi+1 is the side opposite n x, and ∇ i=1 A2i = 14 i=1 b2i ∇h2i . Let us define Pi = E −
bi bT i b2i
, hi = Pi (x − xi ), h2i = (x − xi )T Pi (x − xi ), Bi =
b2i Pi = b2i E − bi bTi , where Pi is a projection matrix and thus P2i = Pi and PTi = Pi . By substituting to (9) we get curvature flow at vertex x: H=
n 3 i=1 Bi (x − xi ) · n . 2 (x − xi )T Bi (x − xi ) i=1
(10)
n The direction of curvature flow in vertex x equals to the direction of vector i=1 hi and H = H. As in curves, in Eq. (8) we diffuse only curvature H instead of curvature flow. The curvature definition has the following properties: (1) H is invariant to surface rotation and translation; (2) in a regular polyhedron, H is oriented from its centre and H = 1r , where r is the radius of a circumscribed sphere; (3) H is not a monotonic function of the height of the point x neighbourhood. This is illustrated in Fig. 4(b) showing the plot of H vs height of pyramid with regular 6-polygon base; (4) the curvature does not necessarily vanish when x and its direct neighbours lie on the same plane (cf. Fig. 4(c)); (5) curvature definition preserves inflection points. This property is demonstrated in experiments in Figs. 5, 6. 3.2 Irregularity Definition The second fairing criterion is irregularity, treated similarly to curvature (as for curves). Surface definition as well as other variables are the same as in Sec. 3.1. The irregularity flow U = [ux , uy , uz ] of the surface S at vertex x is U=
n 1 1 Ui , Ui = − Pdi ∇ x − xi−1 2 + x − xi+1 2 , n i=1 4
(11)
Fairing of Discrete Surfaces
113
d dT
where n is the number of direct neighbours of vertex x and Pd i = di i i2 , di = xi+1 − xi−1 . Note that x0 = xn and xn+1 = x1 . Pdi is a projection matrix into direction given by the vector di . Consequently, 1 Pdi ((x − xi−1 ) + (x − xi+1 )) . 2n i=1 n
U=−
(12)
The magnitude of irregularity, U = U, defines the distance of the vertex x from imaginary centre of its direct neighbours. We define discrete diffusion as for curves: U k+1 = (1 − λu ) U k , 0 < λ < 1.
(13)
The properties of the irregularity definition are as follows: (1) U is invariant to surface rotation and translation; (2) U points directly to equilibrium state; (3) irregularity is a monotonic function of vertex position on the line given by U. 3.3 Algorithm The principle of fairing surfaces is the same as for curves (Sec. 2.3). The only small difference is in the sign computation: Explicit global surface orientation is not possible if we want to fair non-orientable surfaces (e.g. M¨ obius strip, Klein bottle). Hence, we will orient the surface only locally, which is always possible and guarantees correct determination of signs si in Step 3 of Alg. 2. The sign will be positive, if the curvature vectors are on the same side of the surface (given by the orientation). will used to simplify algorithm description: B = nThe following nsubstitution be n k k T k T −1 ˆ k = Bxk − B , b = B x , c = i i i i=1 i=1 n n i=1 (xi ) Bi xi , d1 = cn − b B b, x 2 1 b, P = n i=1 Pd i , p1 = n i=1 Pd i xi−1 , p2 = n i=1 Pd i xi+1 . The algorithm flow is given in Alg. 2. A detailed description with equation derivation can be found in [16].
4 Fairing of Surfaces with Boundary In our approach we are able to cope also with surfaces with boundary. Vertices of such surfaces cannot be treated equally, however. Surface vertices are either boundary vertices (which belong to at least one boundary edge) or inner surface vertices. For each boundary, the fairing algorithm for curves Alg. 1 is applied (Sec. 2.3). For inner surface vertices, the fairing algorithm for surfaces Alg. 2 is used (Sec. 3.3) with a single differ¯ k , only curvatures of direct neighbours which are not boundary ence: when evaluating H vertices are used. Note that inner surface does not affect the boundary, while boundary affects the inner surface. Stopping criteria controls the same properties as for individual curve or surface fairing.
5 Experiments We tested the algorithm stability and robustness on synthetic data, to be comparable with other fairing algorithms on standard datasets1 . To the data, we have added 1
By visual comparison with published results.
ˇ ara, and M. Mat´yskov´a J. Kostliv´a, R. S´
114
Algorithm 2. Surface Fairing Set k = 1. 1: Set i = 1. If the current fairing criterion is irregularity, go to step 6. k 2: Evaluate Hk = 32 · (ˆxk )T Bxˆ−1 xˆk +d , H k = Hk . ¯ k , where H ¯ k = 1 n ski · Hik and ski are signs with 3: Compute H k+1 = (1 − λ)H k + λH i=1 n which the curvature vectors Hki are transfered to xk . They are positive, if the curvatures are on the same side of the surface as Hk , negative otherwise. (ˆ xk )T B−1 x ˆk 4: Update xk+1← xk+αk B−1x ˆk, αk is the solution of quadratic equation, P = (ˆ : xk)TB−1x ˆk+d
5: 6: 7: 8: 9: 10: 11:
k k+1 (αk )2 (H k+1 · P ) + αk (2H k+1 · P − − H k = 0. H ) + H k d d Accepted αi ∈ (−1 − xˆk )T B−1 xˆk , −1 + xˆk )T B−1 xˆk ). Set i = i + 1. If i ≤ n, go to step 2. Otherwise, go to step 10. Evaluate Uk = 12 · (−Pxk + p1 + p2 ), U k = Uk . Compute U k+1 = (1 − λu )U k . Update xk+1 ← xk + λu (−Pxk + p1 + p2 ). Set i = i + 1. If i ≤ n, go to step 6. Otherwise, go to step 10. If average vertex move is smaller than a threshold θmov or k > kmax , terminate. Set k = k + 1, go to step 1.
noised input
150 iterations
3000 iterations
Fig. 5. Saddle fairing, added noise with σ = 2Eavg . Plot shows decreasing criteria with increasing iteration, qualitative shape preserved despite extreme noise and many iterations.
(a)
original
(b) input,σ1
= Eavg
(c)
faired
(d) noise σ= 12 σ1 added
(e)
faired again
Fig. 6. Vase fairing: to original solid, extreme noise was added which has been faired out. Then again we added noise and fairing ended in very similar shape to that after the first fairing.
random noise with μ = 0 and σ given in each test. The noise was generated independently for each vertex and coordinate. Finally, on real datasets obtained from 3D scene reconstruction, we show our algorithm usefulness for this kind of application and demonstrate its ability to cope with datasets of up to 1.7M triangles. Parameters were set to: λ = 0.1, λu = 0.01 for all tests, iterations are given in each experiment.
Fairing of Discrete Surfaces
(a)
original
(b)
noised input
Fig. 7. Bunny: noise σ = 15 Eavg
original
Fig. 8. Dragon: noise σ = 15 Eavg
(c)
10 iterations
(d)
20 iterations
(e)
115
1000 iterations
(original mesh courtesy of Stanford University Computer Graphics Laboratory)
noised input
50 iterations
(original mesh courtesy of Stanford University Computer Graphics Laboratory)
Synthetic Data. The first experiment, Fig. 5, studies algorithm robustness to even extreme noise (triangles even intersect each other). The plot on the right shows decreasing of average criteria Havg , Uavg demonstrating surface improvement: After 150 iterations, it is already well-faired, more iterations do not improve neither the shape nor the criteria, but still the qualitative shape is preserved. The second experiment, Fig. 6, demonstrates both robustness and stability. To an original object, noise with σ = Eavg (the average edge length) was added, which has been faired out after 100 iterations. Then, noise with σ = 12 Eavg was added again, which has been faired after 50 iterations. The final shape is very similar to the one before noise adding demonstrating the stability. We see the qualitative shape (convex/concave/hyperbolic/elliptic) has been preserved. Standard Datasets. We have selected widely-used datasets: Bunny (34.8k vertices), Dragon (437.6k vertices) and Horse (48.5k vertices). The experiment with Bunny, Fig. 7, shows the whole fairing process: to the original mesh, noise with σ= 15 Eavg was added.
116
ˇ ara, and M. Mat´yskov´a J. Kostliv´a, R. S´
original mesh
noised input
faired result: 70 iterations
Fig. 9. Horse: noise σ = 14 Eavg , best visible electronically (original data courtesy of Cyberware, Inc.)
dd
input model
dd
after 20 iterations
Fig. 10. Face fairing: details demonstrate high improvement in mesh quality
After 10 iterations, most of the noise is already faired out. After 20 iterations it corresponds almost exactly to the original one, after 1000 iterations surface details, such as bumps, disappeared, but still the qualitative shape is preserved even after so many iterations. Running-time of one iteration was about 0.85s on CPU Intel C2 2.83GHz. In comparison to [9,12], we are able to better control the fairing process and thus not to over-smooth the result. The second experiment, Fig. 8, shows our ability to preserve even fine features: To the original mesh, noise of σ = 15 Eavg was added and faired out. The result corresponds to the original model very well. One iteration lasted 2min. Unlike in [14], we were able to keep not just sharp edges, but also finer features, e.g. scales. With more iterations, we are able to smooth these features too. Closeups show improvement in both mesh smoothness and regularity as compared to the original model. The third test, Fig. 9, demonstrates both curvature and irregularity fairing: To the original mesh, bigger noise with σ = 14 Eavg was added. The resulting object has improved curvature (it corresponds to the original one) as well as irregularity, which is better than in the input (note the poor triangulation on the neck and hind leg), and even than in [15] which moreover cannot deal with noise. One iteration took about 1s. Real Data. Finally, performance on real data obtained from our stereo 3D reconstruction pipe-line [3] is shown. The first model, Fig. 10, is of 26194 vertices and 51879
Fairing of Discrete Surfaces
input
faired result
117
faired
dd
c Fig. 11. W.Wilson 3D model fairing, best visible electronically (digitised plaster model APF)
triangles. Clearly, both smoothness and mesh regularity need fairing. Only 20 iterations sufficed to achieve an acceptable result. The last experiment (Fig. 11) demonstrates our ability to cope with moderate-size data (860571 vertices and 1718800 triangles). The overall view is the faired model, the closeups show effectivity of our algorithm.
6 Conclusion We have proposed a new fairing algorithm for triangulated meshes. Two criteria, curvature and irregularity, are defined on discrete surface and jointly optimised by discrete diffusion of a scalar function. Performed experiments proved the method is robust, stable, able to cope with extreme noise, preserves qualitative shape, does not shrink the object and is well applicable to real data of 0.8M vertices and 1.7M triangles. Despite using diffusion, which is in principle a slow process, our algorithm is efficient, since only a few iterations suffice to achieve a faired surface. Fairing may remove some surface structures. This behaviour can be reduced by a vertex weighting scheme which makes individual vertices more or less rigid. This rigidity can be controlled by input image gradient, for instance. It seems parameter tuning is not necessary to achieve good results for individual datasets, as confirmed by our experiments, where we used the same setting, except for the number of iterations (which, in fact, is not a parameter since it is specified by the user depending on the required degree of fairing). Our algorithm is not applicable to fairing only, for instance it can be used for surface blending without any change. Acknowledgements. We thank anonymous reviewers for their helpful comments and suggestions contributing to final paper quality. This work has been supported by Czech Academy of Sciences project 1ET101210406, by Czech Ministry of Education project MSM6840770012 and by STINT Foundation project Dur IG2003-2 062.
118
ˇ ara, and M. Mat´yskov´a J. Kostliv´a, R. S´
References 1. Nist´er, D.: Automatic dense reconstruction from uncalibrated video sequences. PhD thesis, Royal Institute of Technology KTH, Stockholm, Sweden (2001) 2. Strecha, C., Tuytelaars, T., van Gool, L.: Dense matching of multiple wide-baseline views. In: Proceedings International Conference on Computer Vision, pp. 1194–1201 (2003) 3. Kamberov, G., et al.: 3D geometry from uncalibrated images. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4292, pp. 802–813. Springer, Heidelberg (2006) 4. Duan, Y., Qin, H.: A novel modeling algorithm for shape recovery of unknown topology. In: Proceedings International Conference on Computer Vision (ICCV 2001), pp. 402–411 (2001) 5. Dyn, N., Hormann, K., Kim, S., Levin, D.: Optimizing 3D triangulations using discrete curvature analysis. In: Mathematical Methods for Curves and Surfaces, pp. 135–146 (2001) 6. Kobbelt, L.P.: Discrete fairing and variational subdivision for freeform surface design. The Visual Computer 16, 142–158 (2000) 7. Taubin, G.: A signal processing approach to fair surface design. In: Proceedings of the SIGGRAPH 1995, pp. 351–358 (1995) 8. Schneider, R., Kobbelt, L.: Geometric fairing of irregular meshes for free-form surface design. Computer Aided Geometric Design 18, 359–379 (2001) 9. Desbrun, M., Meyer, M., Schr¨oder, P., Barr, A.H.: Implicit fairing of irregular meshes using diffusion and curvature flow. In: Proceedings of the SIGGRAPH 1999, pp. 317–324 (1999) 10. Surazhsky, T., Magid, E., Soldea, O., Elber, G., Rivlin, E.: A comparison of gaussian and mean curvatures estimation methods on triangular meshes. In: IEEE International Conference on Robotics and Automation 2003 (ICRA 2003), pp. 1021–1026 (2003) 11. Delingette, H.: Simplex meshes: a general representation for 3D shape reconstruction. Research Report 2214, INRIA, Sophia Antipolis (1994) 12. Yagou, H., Ohtake, Y., Belyaev, A.: Mesh smoothing via mean and median filtering applied to face normals. In: Geometric Modeling and Processing, pp. 124–131 (2002) 13. Mashiko, T., Yagou, H., Wei, D., Ding, Y., Wu, G.: 3D triangle mesh smoothing via adaptive MMSE filtering. In: Int. Conf. on Computer and Information Technology, pp. 734–740 (2004) 14. Jones, T.R., Durand, F., Desbrun, M.: Non-iterative, feature-preserving mesh smoothing. In: Proceedings of the SIGGRAPH 2003, pp. 943–949 (2003) 15. Escobar, J.M., Montero, G., Montenegro, R., Rodr´ıguez, E.: An algebraic method for smoothing surface triangulations on a local parametric space. International Journal for Numerical Methods in Engineering 66, 740–760 (2006) ˇ ara, R., Mat´yskov´a, M.: Inflection point preserving fairing of discrete sur16. Kostliv´a, J., S´ faces with boundary. Research Report CTU–CMP–2008–19, Center for Machine Perception, K13133 FEE Czech Technical University, Prague, Czech Republic (2008)
Fast Decimation of Polygonal Models Muhammad Hussain Department of Computer Science, College of Computer and Information Science, King Saud University, KSA
[email protected]
Abstract. A fast greedy algorithm for automatic decimation of polygonal meshes is proposed. Two important components of an automatic decimation algorithm are: the measure of fidelity and the iterative framework for incrementally and locally simplifying a polyhedra. The proposed algorithm employs vertex-based greedy framework for incrementally simplifying a polygonal model. Exploiting the normal field of one-ring neighborhood of a vertex, a new measure of fidelity is proposed that reflects the impotence of the vertices and is used to guide the vertex-based greedy procedure. A vertex causing minimum distortion is selected for removal and it is eliminated by collapsing one of its half-edges that causes minimum geometric distortion in the mesh. The proposed algorithm is about two times faster than QSlim algorithm, which is considered to be the fastest state-of-the-art greedy algorithm that produces reliable approximations; it competes well with QSlim in terms of Hausdorff distance, and preserves visually important features in a better way.
1
Introduction
Polygonal meshes have become defacto standard for representing 3D spatial information. For simplicity, we consider only triangle meshes because any polygon can be decomposed into triangles. Due to the recent advances in scanning technologies and modeling systems like CAD/CAM, it has been possible to capture the fine detail of 3D objects; in addition, the pursuit of realism has motivated the acquisition of polygonal meshes with excessive detail. Highly complex polygonal meshes consisting of millions of triangular faces are now commonplace. They include redundant information that is really not needed by many applications and put forward rendering, storage and transmission problems because the throughput of graphics systems is not increasing as fast as the complexity of polygonal meshes. As such, in their crude form, they are less suitable for interactive graphics applications like virtual reality, video games and other multimedia applications. The solution of this problem is to get rid of redundant information and to generate its LODs (levels of detail) so that an application can employ suitable LODs in accordance with its needs and the limitations of the available resources for rendering, storage, or transmission. Simplification algorithms are at the heart of G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 119–128, 2008. c Springer-Verlag Berlin Heidelberg 2008
120
M. Hussain
the process of removing redundancy and generating LODs. Because of the gravity of this problem, a large number of researchers have been attracted towards this problem during the last decade and a large number of simplification algorithms exist in the literature, which broadly focus on geometric simplification, topology simplification, and view-dependent simplification. We concentrate only on geometric simplification because it has wide scope, and is a common denominator for other forms of simplification and our proposed algorithm also falls in this category. Most of the existing algorithms are based on error metrics that use some form of distance measure. Normal field is very important characteristic of a 3D object and plays fundamental role in its appearance. There are only a few simplification techniques that exploit normal field to control the simplification process. We employ normal field in a novel way for proposing a decimation algorithm, which we refer to as FSIMP in our onward discussion. This algorithm has better trade-off between quality and speed. FSIMP exploits normal field of the local neighborhood of a vertex to define a measure of fidelity that is employed to select greedily the vertices (instead of half-edges) of the mesh for removal. A vertex is selected for removal based on how much deviation it causes; it is removed and the resulting hole is filled so as to collapse one of the halfedges of the vertex that causes minimum geometric distortion; this geometric distortion is also measured using normal field. The remainder of this paper is organized as follows. Section 1.1 gives an overview of the most related algorithms. In Section 2, we describe measure of geometric fidelity, which is used to select vertices greedily for removal and also the one that is employed to determine the optimal half-edge corresponding to a vertex. Section 3 presents greedy decimation algorithm. In Section 4, results of FSIMP are presented and compared with the state-of-the-art algorithms like QSlim [5]. Section 5 concludes the paper. 1.1
Related Work
In this section we overview some of the most related algorithms. A large number of simplification algorithms exist in the literature. On one extreme of the spectrum of geometric simplification algorithms there lie the algorithms like those proposed in [7],[3] that create very nice approximations but they involve long running time, and on the other extreme there exist algorithms that are efficient in terms of memory consumption, and execution time but do not create good quality LODs [15]. The algorithms based on QEM (Quadric Error Metric) [5], [13] and those based on normal field like FMLOD [9] fill the middle part of the spectrum and have good trade-off between quality and speed. Memoryless simplification [13] and FMLOD have very low memory overhead as compared to QSlim; FMLOD is a bit slower in running time than QSlim but preserves visually important features in a better way. FMLOD employs normal field of the local neighborhood of a vertex to define a measure that is used to select halfedges for collapse. Other algorithms that employ normal field were proposed in [2], [16]. Broadsky and Watson [2] use normal field to partition the polygonal
Fast Decimation of Polygonal Models
121
mesh; their algorithm is very efficient in running time but does not create good quality LODs. Ramsey et al. [16] employ normal field and a threshold to select edges for collapse, but their treatment is very crude. Some of the recent proposals have presented in [14], [18]. The simplification algorithm proposed in [14] uses area-based measure of distortion and vertex removal. Though this algorithm generates good quality approximations, its time and space complexity is very high because first it performs convexity test for each vertex to decide whether it is suitable candidate or not for removal, then using area-based measure of distortion, optimal triangulation of the expected hole generated by vertex removal is determined and the corresponding amount of distortion is computed, lastly it greedily chooses a vertex, removes it; a lot of information is stored for each vertex. Recent simplification algorithms proposed by Tang et al. [18] use edge-collapse and error metrics based on surface and volume moments, which generalize the error metrics proposed in [1], [14]. The time and space complexity of these algorithms is very high, despite their contribution is not so significant. For thorough survey of decimation algorithms, please consult [4], [11], [12].
2
Measure of Geometric Fidelity
In this section, we present the details of the measures of geometric fidelity that are employed for greedily selecting vertices for removal and to decide which halfedge of a vertex is to be collapsed so that the resulted geometric distortion is minimum. A vertex v is flat if all faces in Nv (one-ring neighborhood of v) are coplanar. Intuition dictates that a vertex is less important if it is flat, and its importance increases with increasing its degree of departure from being flat. The normal field variation of two triangles ti and ti+1 across the common edge ei is defined as follows (see Figure 1(a)): 2 nti − nti+1 (s)2 ds N F D(ei ) = nti (s) − nti+1 ds + 1 2 Δti
=
1 2 Δti+1
2 1 (Δti + Δti+1 ) nti − nti+1 2
where Δti and nti are, respectively, the area and the unit normal vector of the 2 triangular face ti . Since nti is the unit normal and nti − nti+1 = (nti −nti+1 )· 2 (nti −nti+1 ), so nti − nti+1 = 2(1−nti ·nti+1 ), which is computationally more efficient and results in significant improvement in execution time. In view of this, N F D(ei ) cab be expressed as follows: N F D(ei ) = (Δti + Δti+1 )(1 − nti · nti+1 ), and so exploiting the normal field of Nv , the cost of the vertex v is defined as follows: Cost(v) = N F D(ei ) (1) ei
122
M. Hussain
Fig. 1. (a) The cost of each vertex v is calculated considering the deviation of normal field across each edge ei . (b) Vertex vo can be removed by collapsing any of the halfedges eoi , shown in red. It is removed by collapsing the half-edge - optimal half-edgethat causes minimum distortion in the normal field.
where summation is over all edges incident on the vertex v. In addition, we also consider how much deviation the local normal field of Nv undergoes relative to the original normal field of the mesh during the process of iterative simplification. Every time a vertex v is decimated, two triangles which are incident on v are removed and each remaining face incident on v is transformed. In this way, the local neighborhoods of the vertices continue to change during the iterative process. This transformation in the local neighborhoods, causes distortion in the local normal fields of the vertices. The distortion in the normal field of each face ti ∈ Nv relative to its normal field in the original mesh is measured as follows: 2 N F D(ti ) = nti (s) − ntio ds Δti 2
= Δti nti − ntio , or replacing nti − ntio 2 by (1 − nti · ntio ) N F D(ti ) = Δti (1 − nti · ntio ) where ntio is the original normal field of the face ti . Incorporating this measure into (1), the cost of vertex v is given by Cost(v) = N F D(ei ) + N F D(ti ). (2) ei
ti
It is obvious that if a vertex v is currently flat then Cost(v) = 0; Cost(v) assumes values greater than zero depending on how much the vertex v departs from being flat. This vertex cost is used to guide the greedy process. After the vertex v with minimum cost is chosen for decimation, it is removed by collapsing one of its half-edges (see Figure 1(b)) that causes the minimum distortion in Nv after its elimination- we call this half-edge as optimal half-edge. To select the optimal half-edge, we again use a measure that minimizes the normal field distortion. The collapse of the half-edge est (vs , vt ) removes two
Fast Decimation of Polygonal Models
123
faces from Nv and updates the remaining faces in Nv to have vt instead of vs .The normal field deviation caused by each updated face ti is defined as nct (s) − npt 2 ds ndev(ti ) = i i Δct
i
2 = Δcti ncti − npti . or alternatively ndev(ti ) = Δcti (1 − ncti · npti ). where Δcti , npti and ncti are, respectively, the area, the unit normal vectors before and after the half-edge collapse. This leads to the following measure for deciding whether the half-edge is optimal half-edge: N Dev(est ) = ndev(ti ). (3) ti
This measure causes to fill the hole that is generated by the removal of the vertex v by collapsing the optimal half-edge.
3
Decimation Algorithm
The proposed algorithm is based on the well-known greedy design paradigm, which has been widely used for simplification of polygonal models. The greedy rule is to remove the vertex that causes the least possible distortion. Though the proposed algorithm can be implemented using MCA [6] and RS-Heap [10] frameworks easily and will result in even better trade-off between speed and quality, we implement it using greedy paradigm because the available implementation of QSlim [17]is based on greedy framework; it provides fair basis for comparison. Exploiting normal-field-variation based cost specified by (2), vertices are ordered using vertex heap V H. The minimum cost vertex v is chosen for removal, its optimal half-edge evu = (v, u) is determined using (3) and v is removed by collapsing the optimal half-edge evu ; two faces incident on the edge evu = {v, u} are also eliminated. Then the next vertex with minimum cost is chosen and the iterative process continues until the required count of faces is met. The input to the algorithm is the original mesh M consisting of the set of vertices V , and the set of triangular faces F , which specify the geometry and topology, respectively, of M . The algorithm uses two very simple data structures F ace and V ertex to represent vertex and face objects, respectively, and two lists V L and F L for keeping vertex and face data, respectively. An instance of F ace object is representative of a triangular face and stores pointers to its three vertices, its original normal and its current normal. An instance of V ertex object stands for a vertex v and includes its geometric position ps (x, y, z), its cost Cost(v), the list of pointers to faces incident to this vertex adj f aces, and heap backlink. V ertex and F ace objects corresponding to the vertices and faces of M are created and stored in V L and F L.
124
M. Hussain
The pseudo code of the algorithm is as follows. FSIMP(Mesh M(V, F)) Input: Original triangular mesh M = (V, F ) and the target number of faces numf Output: An LOD with the given budget of faces For each vertex v ∈ V Create V ertex object and put in V L EndFor For each face f ∈ F Create F ace object and put in F L Add pointer of f to adj f aces corresponding to each of its vertices EndFor For each vertex v ∈ V Compute Cost(v) using (2) Push v into vertex heap V H EndFor While size of V F is greater than numf Pop v from V H Determine the optimal half-edge evu of v using (3) Collapse the optimal half-edge evu Update each left-over face in Nv by replacing v with u Recompute the cost of each vi ∈ Nv and update V H Remove v from V L EndWhile
4
Experimental Results and Evaluation
In this section, we present the results of our experiments performed on a number of polygonal meshes to evaluate the overall performance of FSIMP algorithm, the implementation of the proposed algorithm. For evaluation, we choose benchmark models of varying complexities: mechpart, propeller, oil-pump and Satva. For the sake of comparison, QSlim algorithm is chosen because it has better trade-off between speed and the quality of the generated LODs [12]. Though the other two state-of-the-art best algorithms - MS (memoryless simplification) [13] and FMLOD [9] - have less memory overhead than QSlim, they do have better tradeoff between speed and quality.
Fast Decimation of Polygonal Models
125
Fig. 2. (Left) Original mechpart model, #F:4998, #V:1900. Its decimated versions with 400 Faces generated by (Middle) FSIMP, and (Right) QSlim.
Fig. 3. (Top) Original oilpump model, #F:1140048, #V: 570018. Simplified versions of the model consisting of 10,000 faces created by (middle) FSIMP, and (bottom) QSlim.
4.1
Execution Time
Table 1 lists the execution times of the two algorithms to reduce the models of varying complexities to one face on a system equipped with Intel Centrino Duo 2.1GHz CPU and 2GB of main memory. It is obvious that the execution speed of FSIMP is about two times of that of QSlim. 4.2
Objective Comparison
For objective comparison of the quality of approximations created by FSIMP with those generated by QSlim, we use the well-known Hausdorff distance measure and compute it using version 2.5 of I.E.I-CNR Metro Tool, which is available in the public domain and has been widely used for this purpose. Five low level approximations of two benchmark models (mechpart and oil pump) are created using the both simplification algorithms, and the symmetric Hausdorff distance for each is computed using Metro tool. Figure 4 shows the plots of these
126
M. Hussain
Table 1. Execution time of each algorithm in seconds to decimate the triangle mesh to one face. Mesh size is the nubmfer of triangle faces in the mesh.
Mechpart Vaseloin Propeller Oil-pump Satva
#Faces
#Vertices
FSIMP
QSlim
4,998 400,000 456,362 1,140,048 3,631,628
1,900 200002 228195 570,018 1,815,770
0.047 3.172 3.906 10.573 38.547
0.078 5.625 6.594 20.828 65.922
Number of faces
1000
FSIMP QSlim
80 0 60 0 40 0 20 0
F S IM P Q S lim
15000
Number of faces
Model
12000 9000 6000 3000
0
0.05
0.1 Hausdorff Distance
0.15
0
0.002
0.004
0.006 0.008 Hausdorff Distance
0.01
0.012
0.014
Fig. 4. Plots of Hausdorff error for mechpart (left) and oil-pump (right) models
Fig. 5. (Left) Original propeller model, #F:456,362, #V:228195. Its simplified versions with 1,000 Faces created by (Middle) FSIMP and (Right) QSlim.
symmetric Hausdorff distances. A critical look at these plots reveals that FSIMP performs better than QSlim. 4.3
Subjective Comparison
For visual comparison, observe models shown in Figures 2, 5, and 3 generated by FSIMP and QSlim. It is quite apparent that decimated models generated by FSIMP are not only visually comparable with those by QSlim but in some cases preserve the salient features in a better way; observe, for example, the holes in
Fast Decimation of Polygonal Models
127
Fig. 6. (Left) Original Satva model, #F:3,6316,628, #V: 1,815,770. (right) Simplified version of the model consisting of 18,000 faces created by FSIMP.
mechpart model, the sharp edges in propeller model, these visually important features have been preserved in a better way in the simplified models generated by FSIMP. The simplified version of oil-pump model, consisting of more than one million faces, shows that FSIMP emphasizes the visually important sharp features and removes unnecessary low resolution detail. FSIMP can simplify huge models efficiently and faithfully, see for example, Figure 6.
5
Conclusion
A greedy algorithm for automatic simplification of polyhedra is proposed. The algorithm selects vertices greedily based on their cost value. Once a vertex is selected for decimation, it is removed by collapsing one of its outgoing optimal half-edge that causes minimum normal field deviation. The cost of each vertex is calculated by considering the normal field variation over its one-ring neighborhood and the normal field deviation relative to the original mesh. The proposed algorithm has better trade-off between speed and the quality of the generated approximations as compared to QSlim. Its memory overhead like MS and FMLOD is less than that of QSlim. This algorithm can be useful for real time applications.
128
M. Hussain
References 1. Alliez, P., Laurent, N., Sanson, H., Schmitt, F.: Mesh Approximation Using a Volume-based Metric. In: Proc. Pacific Graphics, pp. 292–301 (1999) 2. Brodsky, D., Watson, B.: Model simplification through refinement. In: Proceedings of Graphics Interface, pp. 221–228 (2000) 3. Ciampalini, A., Cignoni, P., Montani, C., Scopigno, R.: Multiresolution Decimation Based on Global Error. The Visual Computer 13, 228–246 (1997) 4. Cignoni, P., Montani, C., Scopigno, R.: A comparison of mesh simplification algorithms. Computer & Graphics 22(1), 37–54 (1998) 5. Garland, M., Heckbert, P.S.: Surface Simplification using Quadric Error Metric. In: Proc. SIGGRAPH 1997, pp. 209–216 (August 1997) 6. Wu, J., Kobbelt, L.: Fast mesh decimation by multiple-choice techniques. In: Proc. Vision, Modeling, and Visualization 2002, pp. 241–248 (2002) 7. Hoppe, H.: Progressive Meshes. In: Proc. SIGGRAPH 1996, pp. 99–108 (August 1996) 8. Hussain, M., Okada, Y., Niijima, K.: Fast, Simple, Feature-preserving and Memory Efficient Simplification of Triangle Meshes. International Journal of Image and Graphics 3(4), 1–18 (2003) 9. Hussain, M., Okada, Y.: LOD Modelling of Polygonal Models. Machine Graphics and Vision 14(3), 325–343 (2005) 10. Chen, H.-K., Fahn, C.-S., Tsai, J.J., Chen, R.-M., Lin, M.-B.: Generating highquality discrete LOD meshes for 3D computer games in linear time. Multimedia Systems 11(5), 480–494 (2006) 11. Luebke, D.: A Survey of Polygonal Simplification Algorithms. Technical Report TR97-045, Department of Computer Science, University of North Carolina (1997) 12. Oliver, M.K., H´elio, P.: A Comparative Evaluation of Metrics for Fast Mesh Simplification. Computer Graphics Forum 25(2), 197–210 (2006) 13. Lindstrom, P., Turk, G.: Fast and Memory Efficient Polygonal Simplification. In: Proc. IEEE Visualization 1998, pp. 279–286, 544 (October 1998) 14. Park, I., Shirani, S., Capson, D.W.: Mesh Simplification Using an Area based Distortion Measure. Journal of Mathematical Modelling and Algorithms 5, 309– 329 (2006) 15. Melax, S.A.: Simple Fast and Efficient Polygon Reduction Algorithm. Game Developer, 44–49 (November 1998) 16. Ramsey, S.D., Bertram, M., Hansen, C.: Simplification of Arbitrary Polyhedral Meshes. In: Proceedings of Computer Graphics and Imaging, vol. 4661, pp. 221–228 (2003) 17. QSlim, http://graphics.cs.uiuc.edu/∼ garland/software/qslim.html 18. Tang, H., Shu, H.Z., Dillenseger, J.L., Bao, X.D., Luo, L.M.: Moment-based Metrics for Mesh Simplification. Computers & Graphics 31(5), 710–718 (2007)
Visualizing Argument Structure Peter Sbarski1 , Tim van Gelder2 , Kim Marriott1 , Daniel Prager2, and Andy Bulka2 1 Clayton School of Information Technology, Monash University, Clayton, Victoria 3800, Australia {Peter.Sbarski,Kim.Marriott}@infotech.monash.edu.au 2 Austhink Software, Level 9, 255 Bourke St, Melbourne, Victoria 3000, Australia {tvg,dap,andy}@austhink.com
Abstract. Constructing arguments and understanding them is not easy. Visualization of argument structure has been shown to help understanding and improve critical thinking. We describe a visualization tool for understanding arguments. It utilizes a novel hi-tree based representation of the argument’s structure and provides focus based interaction techniques for visualization. We give efficient algorithms for computing these layouts.
1
Introduction
An argument is a structure of claims in inferential or evidential relationships to each other that support and/or refute the main proposition, called the conclusion [1,2]. The ability to create arguments, understand their logical structure and analyse their strengths and weaknesses is a key component of critical thinking. Skill and care is required to correctly identify the underlying propositions of an argument and the relationships between the different propositions, such as rebuttal and support. For this reason diagrammatic representations of arguments, called argument maps, have been suggested that more clearly display the evidential relationships between the propositions which make up the argument [3]. Although the earliest modern argument maps can be traced to J.H. Wigmore, who used it for complex evidential structures in legal cases [1], argument mapping is still not common. One reason for this is that revising maps sketched out with a pen on paper isn’t very practical [3]. Since the late nineties, however, specialised software for argument mapping has been developed. A recent study showed that argument mapping helped understanding arguments and enhanced critical thinking. The study also showed that the benefits were greater with computer based argument mapping [3]. However, current computer tools for argument mapping are quite unsophisticated. They provide, at best, poor automatic layout and do not utilize interactive techniques developed for other computer-based information visualization applications. In this paper we present a new computer tool for argument mapping that addresses these deficiencies. Three main technical contributions of the paper are: G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 129–138, 2008. c Springer-Verlag Berlin Heidelberg 2008
130
P. Sbarski et al.
Fig. 1. Example argument map: debating a possible interest rate increase. In fact, in October 2008, the Reserve Bank of Australia did increase interest rates during an election campaign and the incumbent political party lost office.
• A new visual representation for arguments. The key novelty in the representation is to allow propositions on the same level to be grouped into “compound nodes” which denote logical conjunction and which support multi-premise reasoning. An example is shown in Figure 1. (Described in Section 2) • Real-world arguments are often large and complex. Our tool provides focusbased contextual layouts for visualizing large argument maps. (Described in Section 3) • The final contribution of the paper are algorithms for laying out argument maps in standard and contextual layout styles. The algorithms are non-trivial extensions to algorithms for laying out standard layered trees. (In Section 4). Despite its potential usefulness, there has been relatively little work on computer-based visualization of arguments. In this section we critique the five most advanced tools that we are aware of are—Reason!Able [1], Athena Standard [4], Compendium [5], Araucaria [6], and Argunet [7]. The first difference between the tools is the basic visual representation for the argument structure. Reason!Able and Araucaria use a layered tree representation which forces the logical structure of the argument to be visible, while Athena Standard, Compendium and Argunet allow more free-form representation of arguments as networks with directed edges. Rationale uses a novel layered hitree representation which we believe better displays the logical structure of the diagram and handles multi-premise reasoning. It is fair to say that none of the existing tools provide good automatic layout. This is one of their biggest deficiencies. The two tools Reason!Able and Araucaria based on layered trees provide fully automatic layout which the user cannot control. Reason!Able uses a standard Sugiyama-based layered graph layout algorithm to repeatedly re-layout the graph after user interaction. This lead to narrow layouts but means that the user cannot control the order of a nodes’ children and successive layouts did not necessarily preserve the order of nodes in each layer, which is quite confusing for the user. Tree layout in Araucaria
Visualizing Argument Structure
131
appears to use a non-standard tree layout algorithm which creates quite uncompact layout and does not center nodes above their children Construction of arguments is difficult and requires the user to first create and load a text file with the text for the argument. It is difficult to re-arrange nodes on the canvas and the order of children in the node depends upon the construction order. The network based tools—Athena Standard, Compendium and Argunet – behave more like standard graphical editors in which the primary technique for layout is explicit user positioning of elements. Compendium and Argunet provide some simple placement tools which the user can explicitly invoke to re-layout parts of the diagram. However, these lead to overlapping nodes and generally do not seem well-integrated into the tools. Sophisticated aesthetically pleasing automatic layout is one of the great strengths of our tool and relies on the algorithms described in this paper. While fully automatic, ordering of children is totally under the user’s control and the user has the ability to control the compactness of the layout. The other main advantage of our tool over previous argument mapping tools is more powerful techniques for interactive exploration of larger argument maps. Like previous tools it provides an overview+detail view of the map, allows the user to pan+zoom the detail view, and to collapse and expand sub-arguments. But, in addition, it provides focus-based tools for emphasizing the structure of the argument around a focal node.
2
Representing Argument Structure
The first question we must answer is how to represent an argument visually. A fairly natural approach (previously used in the tools Reason!Able and Araucaria) is to represent argument maps by a tree in which nodes represent propositions, with propositions supporting and refuting their parent propositions. The root of the tree is the conclusion of the argument and the depth in the tree indicates the level of detail in the proof. A simple tree, however, is not the best structure for representing arguments. One issue is that it does not adequately represent multi-premise arguments. Typically a single proposition does not provide evidence for or against other propositions by itself, rather it does so only in conjunction with other propositions. For instance, if we have that proposition P is supported by P1 and P2 and by P3 and P4 then representing this as a tree with root P and the Pi ’s as the root’s children loses the information that it is the conjunction of the propositions that support P not the individual propositions. The second issue is that the reason for inferring a proposition from other propositions is–itself–a component of the argument and a well-designed representation should allow propositions supporting and refuting the inference itself. Thus, in our example, the reason “P1 and P2 imply P ” is also part of the argument and may need support. A straightforward approach is to modify the layered tree representation so as to use a bi-partite tree in which propositions alternate with inference rules. However, this is not a particularly compact representation and does not visually
132
P. Sbarski et al.
distinguish between propositions and rules of inference. Instead, we have chosen to use a representation for the bi-partite tree in which the child relationship is represented using containment then links on alternate levels. We call the resulting tree-like structure a hi-tree because of its similarity to a hi-graph [8]. We have not found examples of hi-trees before, although for our application they provide an elegant visual representation. An example of our hi-tree representation is shown in Figure 1. The root of the tree is the conclusion whose truth is to be determined by the argument. Propositions can have reasons supporting them. Usually these reasons are represented by a compound node that contains the propositions which are the basis for the reason, but they may also be a “basic” reason such as a quote or common belief. Notice in the example how words such as “oppose” and “support” indicate the evidential relationship of the reason to its parent proposition. This evidential relationship is also indicated using color: green for support, red for oppose and orange for rebuttals. The author may also indicate how strongly they believe a particular reason: this is shown visually by the thickness of the link between the reason and its parent proposition. Also note how the compound node represents the inference rule and can itself have supporting or refuting arguments as shown in Figure 1 where there is an argument opposing the inference that the high current CPI rate implies that the current inflation rate needs to be reduced. In our tool the standard layout for a hi-tree argument map is as a kind of layered tree. This seems like a natural and intuitive visual representation for argument maps that clearly displays their hierarchical structure since the main proposition is readily apparent at the top of the screen, with the layer indicating the depth of the argument. We considered the use of other layout conventions for trees including h-v trees, cone trees and radial trees but felt that none of these showed the structure of the argument as clearly as a layered tree representation. The standard layered drawing convention for ordered trees dictates that [9,10]: LT1. LT2. LT3. LT4. LT5. LT6. LT7.
The y-coordinate of each node corresponds to its level, Nodes on the same level are separated by a minimum gap, Each node is horizontally centered between its children, the drawing is symmetrical with respect to reflection, A subtree is drawn the same way regardless of its position in the tree, The edges do not cross, and The trees are drawn compactly.
We now discuss how we have modified this drawing convention to hi-trees. To make the discussion more precise we must formalize what a hi-tree argument map is. Like any ordered tree it consists of nodes, N = {1, ..., n}, and a function c : N → N ∗ which maps each node to the ordered sequence of nodes which are its children. No node, i, can be the child of more than one node: the node of which it is a child is called its parent and denoted by p(i). There is one distinguished node, the root node r, which has no parent. It is a bipartite tree: Nodes are partitioned into proposition nodes NP representing a single proposition, and compound nodes NC which represent reasons.
Visualizing Argument Structure
133
The root is required to be a proposition node. The children of a proposition node must be compound nodes and the children of a compound node must be proposition nodes. Every compound node has a single distinguished child that represents the proposition that the inference used in the reason is valid. This is the last child of the node. We say that the children of a compound node are its components. A basic reason, such as a quote, is modelled by a compound node which has one child with no children. While the underlying bi-partite tree of the argument map is important, so is the visual tree which models the visual representation of the hi-tree and captures the requirement that compound nodes and their components occur on the same visual level. If i is a node its visual children, vc(i), are, if i is a compound node, the children of its components and if i is a proposition its visual children are simply its children. The visual parent of a node i, vp(i), is simply the compound node of which i is the visual child. We define the visual level vl(i) of node i inductively as follows: • If i = r then vl(r) is 1; • If p(i) ∈ NP then vl(i) = 1 + vl(p(i)); • If p(i) ∈ NC then vl(i) = vl(p(i)) In the layered drawing convention for hi-trees requirement LT1 is modified to the requirement that the y-coordinate of each node corresponds to its visual level. It is natural to add a requirement to the hi-tree drawing convention that: HT7. Each compound node is drawn as compactly as possible with its component nodes separated only by a minimum gap. However, requirement LT3 that each node is placed midway between its children conflicts with this requirement. The problem is that forcing component nodes to be placed next to each other means that if each component is placed centrally between its children then the sub-trees may overlap. The problem and two possible solutions are shown in Figure 2. One solution would be to weaken requirement HT7 and allow components of a compound node to be separated by more than a minimum gap. Another possible solution is to weaken requirement LT3 and only require that component nodes are placed as close as possible to the midpoint of their children. This is the solution we have chosen. Thus, we define the standard layered drawing convention for hi-trees to be: HT1. The y-coordinate of each node corresponds to its visual level, HT2. Compound nodes on the same visual level are separated by a minimum gap1 , HT3. The root is midway between its children and each compound node is placed so that each of its components is, as far as possible, midway between its children, HT4. The drawing is symmetrical with respect to reflection, HT5. The edges do not cross, 1
Which in our tool is inversely proportional to how closely the nodes are related.
134
P. Sbarski et al.
P1
P2
(a)
P3
P1
P2
P3
(b)
P1
P2
P3
(c)
Fig. 2. Requirement LT3 that each node is placed midway between its children may conflict with requirement HT7 that components of a compound node be placed tightly togther. As shown in(a), the conflict between requirements LT3 and HT7 can lead to overlapping nodes. One solution, shown in (b), is to weaken requirement HT7. However this leads to larger compound nodes which may exaggerate their importance and can lead to clumping of components. In the example clumping suggests that P2 and P3 are somehow more connected or similar to each other than to P1 which is misleading. Another solution, shown in (c), is to weaken requirement LT3–this is the solution we have chosen.
HT6. The trees are drawn compactly, and HT7. Compound nodes are drawn as compactly as possible with component nodes separated only by a minimum gap. The argument map in Figure 1 illustrates this drawing convention.
3
Handling Large Argument Maps
Real-world argument maps can be quite large with up to a hundred nodes. While this may not sound very large, typical argument maps of this size will not fit on to a single screen or print on a single page using legible fonts because of the large amount of text in the nodes. Thus, a key requirement for any argument mapping tool are techniques to handle larger maps. Interactive techniques have been extensively studied for visualization of large networks and hierarchical structures and proven very effective, e.g. [11,12,13]. It is therefore natural to consider their use in argument mapping. Our tool provides two standard visualization techniques to allow the user to focus on the parts of the map that are of most interest. First, the argument mapping tool provides an overview and detailed view of the argument map. The detailed view can be zoomed and panned. Second, the tool allows the user to control the level of detail in the argument map by collapsing/expanding the subargument under a compound or proposition node. These techniques have also been used in some earlier argument mapping tools. More interestingly, the tool provides a novel focus-node based visualization for argument maps. It allows the user to select a node and choose a “contextual layout” tool which modifies the layout so as to better show the argument structure around that node. Contextual layout tools modify the drawing convention to require that a context of nodes around the focus node are laid out as nicely as possible so as to emphasize their relationship with the focus node, and nodes not in the context are moved slightly away form the contextual nodes and
Visualizing Argument Structure
135
Fig. 3. Screenshot of the tool displaying an example of “ancestors” contextual layout applied to the Interest Rate argument map from Figure 1
optionally faded. Focus nodes can be propositional nodes or compound nodes. While contextual layout bears some similarities to focus based navigation of large networks [13] the details appear to be novel. Contextual layout has a number of variations which differ in the choice of context: “show ancestors”, “show parent”, “show family” and so on. As an example, in “show ancestors” layout, when a node is selected its parent and other ancestors form the context nodes. They are re-positioned directly above it thus showing all ancestors in a direct line. All other nodes are pushed away from this particular “path”. This type of layout helps to understand the flow of logical reasoning from the selected premise or reason to the final claim, see Figure 3. Another example is “show family”. In this the context comprises all nodes immediately related to the focus node, i.e, parent, siblings and children. The layout centers the parent directly above the focus node, siblings are brought in as close as possible to the focus node and the focus node’s children are grouped directly beneath it.
4
Layout Algorithms
In this section we detail the algorithms we have used to provide the standard layered hi-tree layout and various contextual layouts discussed in the last section. 4.1
Layered Tree Layout
Our layout algorithms are based on those developed for drawing trees using the standard layered drawing convention. Wetherell and Shannon gave the first linear time algorithm for drawing of layered binary trees [14] and Reingold and
136
P. Sbarski et al.
Tilford improved this to meet the standard layered drawing convention while still maintaining linear complexity [15]. The Reingold-Tilford algorithm uses a divide and conquer strategy to lay out binary trees. It recursively lays out the subtrees bottom-up and then places them together with a minimum separation between them. The algorithm keeps for each subtree the left contour, i.e. a linked list of the leftmost node for each level in the subtree, and the right contour of the tree. These are traversed rather than the whole subtree to determine how close the subtrees can be placed together. Walker [16] gave a modification of the Reingold-Tilford Algorithm for n-ary tree layout but has quadratic worst case complexity. Buchheim, J¨ unger and Leipert [17] improved Walker’s algorithm to have linear time. In practice, however Walker’s algorithm is very fast and it and the Reingold-Tilford Algorithm are widely used for tree layout. 4.2
Standard Hi-Tree Layout
It is possible to modify the Walker Algorithm (and the other tree layout algorithms) to handle hi-tree layout using the layered drawing convention. The basic algorithm remains the same: recursively lay out the sub-trees bottom-up and then place them together with a minimum separation between the children. Again for efficiency, subtree frontiers and relative positioning of children to their parent and predecessor in the frontier is used. The key question is how a compound node u is positioned relative to its visual children. Assume u has component nodes v1 , ..., vk . By the time u is reached the children of each vi have been laid out and their frontier computed. There are three steps in computing the layout for the tree rooted at u. 1. For each vi , the tree Ti with root vi is computed using the standard Walker Algorithm: the children of vi are placed as close together as possible (spacing out the enclosed sub-trees where necessary) and vi is placed on top of the tree midway between its children. 2. Now the Ti s are placed together as closely as possible so as to compute the minimum gap gi required between vi and vi+1 to avoid Ti and Ti+1 overlapping 3. We arbitrarily place node u at the x-position ux = 0. This also fixes the position ci for the component in u corresponding to vi . We wish to find the k position vix for each vi which minimizes i=1 wi (ci −vix )2 subject to ensuring x that for all i, vix +gi ≤ vi+1 where the weighting factor wi for each component is some fixed value. For standard layout we have found that setting wi to the width of the top layer of Ti works well. We use the procedure optimal layout given in [18] to solve this simple constrained optimization problem. The complexity of the hi-tree layout algorithm remains the same as the complexity of the original Walker Algorithm. The only possible source of additional complexity is the call to optimal layout with the k children of node u. The algorithm has linear complexity in the number of variables passed to it. Since a variable is only passed once, the overall complexity of calling optimal layout is linear.
Visualizing Argument Structure
4.3
137
Contextual Layouts
Handling contextual layout appears more difficult. However, a nice property of the layout algorithm is that it is parametric in the minimum gap allowed between adjacent nodes and in the weight used to enforce placement of a compound node component near the corresponding sub-tree. By appropriately choosing these it is straightforward to extend them to handle contextual layout. 4.4
Algorithm Evaluation
To evaluate the efficiency of our algorithm we laid out 40 random hi-trees, with 50 to 2000 nodes. We measured layout time for standard layout. See figure 4 for the results. Our results verify our experience with our tool that all of the layout algorithms are more than fast enough to handle layout of even very large argument maps with 1000 nodes in less than 1 second.
Fig. 4. Performance results. The horizontal axis gives the number of nodes and the vertical axis gives the timing of the layout in seconds.
All experiments were run on a 1.83 GHz Intel Centrino with 1GB of RAM. The algorithms were implemented using Microsoft Visual C# Compiler version 8.00.50727.42 for Microsoft .NET Framework version 2.0.50727 under the Windows XP Service Pack 2 operating system. We used the original Walker algorithm rather than the linear version: as the results show performance of the original algorithm is very fast.
5
Conclusion
We have described a new application of information visualization argument mapping. Critical analysis of arguments is a vital skill in many professions and a necessity in an increasingly complex world. Despite its importance, there has been
138
P. Sbarski et al.
comparatively little research into visualization tools to help with understanding of complex arguments. We have described and motivated the design of a particular representation of the argument structure based on a new kind of diagram we call a hi-tree. We have then described basic interactions for visualizing argument maps, such as showing context. Finally, we have given novel algorithms for standard and contextual layout of hi-trees. The algorithms and techniques described here have been extensively tested and evaluated. They underpin a commercial argument mapping tool called “Rationale”2 . This program supports authoring and visualization of hi-tree based argument maps as well as a simpler standard tree based representation. The illustrations in this paper are from this tool.
References 1. van Gelder, T.J.: Argument mapping with reason!able. The American Philosophical Association Newsletter on Philosophy and Computers, 85–90 (2002) 2. van Gelder, T.J.: Visualizing Argumentation: software tools for collaborative and educational sense-making. Springer, Heidelberg (2002) 3. Twardy, C.: Argument maps improve critical thinking. Teaching Philosophy (2004) 4. Athena (2002), http://www.athenasoft.org/ 5. Compendium (2007), compendium.open.ac.uk/institute/ 6. Araucaria (2006), http://araucaria.computing.dundee.ac.uk/ 7. Argunet (2008), http://www.argunet.org 8. Harel, D.: On visual formalisms. In: Glasgow, J., Narayanan, N.H., Chandrasekaran, B. (eds.) Diagrammatic Reasoning, pp. 235–271. MIT Press, Cambridge (1995) 9. Br¨ uggermann-Klein, A., Wood, D.: Drawing trees nicely with tex. Electronic Publishing 2(2), 101–115 (1989) 10. Kennedy, A.J.: Drawing trees. Functional Programming 6(3), 527–534 (1996) 11. Plaisant, C., Grosjean, J., Bederson, B.B.: Spacetree: Supporting exploration in large node link tree, design evolution and empirical evaluation. In: Proceedings of the IEEE Symposium on Information Visualization, Washington, DC, USA, p. 57. IEEE Computer Society, Los Alamitos (2002) 12. Kumar, H., Plaisant, C., Shneiderman, B.: Browsing hierarchical data with multilevel dynamic queries and pruning. Technical Report UMCP-CSD CS-TR-3474, College Park, Maryland 20742, U.S.A (1995) 13. Huang, M.L., Eades, P., Cohen, R.F.: Webofdav navigating and visualizing the web on-line with animated context swapping. Comput. Netw. ISDN Syst. 30(1-7), 638–642 (1998) 14. Wetherell, C., Shannon, A.: Tidy drawings of trees. IEEE Transactions on Software Engineering 5(5), 514–520 (1979) 15. Reingold, E.M., Tilford, J.S.: Tidier drawings of trees. IEEE Transactions on Software Engineering 7(2), 223–228 (1981) 16. Walker, J.Q.: A node-positioning algorithm for general trees. Software Practice and Experience 20(7), 685–705 (1990) 17. Christoph Buchheim, M.J., Leipert, S.: Improving Walker’s algorithm to run in linear time. In: Graph Drawing, pp. 344–353 (2002) 18. Marriott, K., Moulder, P., Hope, L., Twardy, C.: Layout of Bayesian networks. In: Australasian Computer Science Conference, vol. 38 (2005) 2
http://rationale.austhink.com/
Visualization of Industrial Structures with Implicit GPU Primitives Rodrigo de Toledo1 and Bruno Levy2 1
Petrobras – CENPES/PDP/GR, Brazil 2 INRIA – ALICE, France
Abstract. We present a method to interactively visualize large industrial models by replacing most triangles with implicit GPU primitives: cylinders, cone and torus slices. After a reverse-engineering process that recovers these primitives from triangle meshes, we encode their implicit parameters in a texture that is sent to the GPU. In rendering time, the implicit primitives are visualized seamlessly with other triangles in the scene. The method was tested on two massive industrial models, achieving better performance and image quality while reducing memory use.
1
Introduction
Large industrial models are mainly composed of multiple sequences of pipes, tubes and other technical networks. Most of these objects are combinations of simple primitives, e.g., cylinders, cones, tori and planes. Each CAD application has its own internal data format and different ways to manipulate the objects. However, the most common way to export data is through a triangular mesh (triangles are currently the lingua franca in all 3D software). Representing simple primitives by tessellated approximations results in an excessive number of triangles for these massive models. Interactive visualization is only possible when applying advanced and complex algorithms such as Far Voxels [1]. In this paper we propose a hybrid approach, using GPU ray-casting primitives [2], in combination with rasterized triangles, for visualizing industrial models. By substituting GPU primitives (cones, cylinders and torus slices) for most of the original triangles, we can achieve several improvements: quality enhancement (smooth silhouettes, per-pixel depth and shading, and continuity between pipe primitives, see Figure 1); faster rendering speed (CPU/GPU transference reduction and balancing between vertex and fragment pipelines); and Memory reduction (we only keep implicit data such as radius, height and position, instead of storing vertices, normals and topology of tessellated models).
2
Related Work
The idea of replacing meshes with other representations for the purpose of interactive visualization has been widely use. Both billboard and nailboard, which is a billboard that includes per-pixel depth information, have been explored in G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 139–150, 2008. c Springer-Verlag Berlin Heidelberg 2008
140
R. de Toledo and B. Levy
(a)
(a)
(b) (c)
(b) (c)
Polygons
GPU primitives
Fig. 1. An industrial piece in PowerPlant (section 13). Left: Original tessellated data. Right: Rendering with implicit GPU primitives. Image quality enhancement: (a) silhouette roundness; (b) precise intersection; (c) continuity between consecutive primitives.
recent GPUs (in the CG tutorial [3] they are called depth sprites). More complex solutions have been proposed, such as billboard clouds [4] and relief texture mapping [5]. In massive model visualization the concept of texture depth meshes was used to obtain interactive rendering [6,7]. Despite being fast, this method presents some visible regions where the mesh stretches into skins to cover missing geometric information. Another issue is storage space (original 482MB PowerPlant uses 10GB of memory). Recently, a method that has achieved good results for massive models visualization was Far Voxels [1]. It is based on the idea of scene partitioning, grouped into a tree by volumetric clusters. Leaf nodes keep the original triangles and inner nodes have a volumetric grid representation. In rendering time, the grid’s voxels are splatted, respecting a maximal screen area. The negative points are: very expensive preprocessing; significant memory use (70MB per million vertices); and an overly complex visualization system (including LOD, occlusion culling and out-of-core data management). In an attempt to accelerate the rendering process, most aforementioned algorithms have as side effects reducing quality and/or increasing memory space. As opposed to them, our solution acheives better performance while enhancing quality and reducing memory. Ray Tracing on GPU. Since the beginning of programmable GPUs, ray tracing is a target application that explores their parallelism [8,9,10]. The z-buffer algorithm is not used to determine visible surfaces because ray tracing already discovers, for each pixel, the closest intersected object given the viewing ray. Purcell et al. [9] broke the ray-tracing algorithm into kernels that access the entire scene data, which is mapped on textures. The idea of encoding the scene in texture was followed by other methods [11,12,13]. All these methods suffer from GPU memory limitation. They require space to represent geometry and hierarchical structures, such as kD-trees or bounding volumes. In a recent work, Gunther et al. [13] succeeded in loading the PowerPlant and rendering it at 3 fps.
Visualization of Industrial Structures
141
We also have encoded geometry in texture. However, in contrast with explicitly representing triangles, we record the parameters of implicit surfaces in texture. Thus, we do not have memory space problems (see Section 6.3). Extended GPU Primitives. The concept of extended GPU primitives was first introduced by Toledo and Levy [2]. They have created a framework to render quadrics on GPU without tessellation. GPU primitives are visualized through a ray-casting algorithm implemented on fragment shaders. The rasterization of a simple proxy triggers the fragment algorithm. To keep GPU primitives compatible with other surfaces, the visibility issue between objects is solved by the z-buffer. Some recent work have concentrated in ray casting cubics and quartics [14,15].
3
Topological Reverse Engineering
There are several reverse engineering algorithms for scanned data, composed by a dense set of points laid down on objects surfaces. However, the data we are dealing with present a different situation, since CAD models are generated without any scanning or capturing step. The main differences are: CAD data have sparser samples; the vertices are regularly positioned (spatially and topologically); and it is partially segmented. Toledo et al. [16] has shown that numerical reverse engineering is not the best choice for CAD triangular meshes. Treating sparse data is a weakness for optimization algorithms [17,18,19]. On the other hand, it is possible to explore the topological information to reconstruct original implicit data. The topological algorithm is exclusively designed to segment tubes composed of cylinders, truncated cones and torus slices (“elbow junctions”). After applying this reverseengineering algorithm on triangular meshes, the result is a set of higher-order primitives (see Figure 2) plus a set of triangles for the unrecovered objects. Two data sets, PowerPlant and P40 Oil Platform, were used for tests (see Table 1). Table 1. Reverse engineering results with PowerPlant and P40 Oil Platform Data set triangles (Δ) cylinders cones tori unrecovered Δ effectiveness PowerPlant 12,742,978 117,863 2,150 82,359 1,251,019 90.18% Oil Platform 27,320,034 215,705 40,001 85,707 2,932,177 89.26%
4
Storing Implicit Parameters on GPU Memory
In our application, we charge GPU memory with the parameters of the implicit shapes in pre-processing. The goal is to avoid sending this information at each frame to increase frame rate (see results in Section 6). The drawback is that primitive parameters are difficult to modify on the fly. Before the first rendering frame, we include information from all primitives in a floating-point texture,
142
R. de Toledo and B. Levy r1
r
r1
ma ax i n is
main axis
B
r2
h O
Į
revolution
h C
O
r2
Fig. 2. Recovered primitives: cylinder, cone section and torus slice. Grouping all implicit information in a floating-point texture (each texel has four floats).
loading it into video memory. Each texel contains four floating-point scalars. We store the following information for each primitive: cone. (8 scalars in 2 texels): origin; main axis scaled by height; two radii. cylinder. (8 scalars in 2 texels): origin; main axis; radius and height. torus. (12 scalars in 3 texels): center; revolution vector; center-to-begin vector; slice angle; two radii. In texture, the implicit information is grouped by primitive type (Figure 2). We have tested both random and sequential accesses and they do not produce any difference in performance (in NVidia 7900 graphics card)1 . In rendering time, we draw the primitives by type to avoid too much shader switching. Recent graphics cards enable the use of floating-point textures that can be read in vertex shaders2. In our case, this strategy is better than using vertex buffers because several vertices share the same parameter values. Using a vertex texture makes it possible to represent this level of indirection. This procedure resulted in doubling the speed of our primitives (see Section 6.2).
5
Rendering Implicit Primitives on GPU
We have implemented some high-performance GPU primitives aiming at visualization of industrial models. We have developed two specific quadrics extending Toledo’s et al. work [2]: cones (Subsection 5.1) and cylinders (Subsection 5.2). Based on [15], we have created a novel torus slice primitive (Subsection 5.3). Each surface uses a different fragment shader that executes its ray casting. To trigger fragment execution we rasterize the faces of a specific proxy for each primitive type: hexahedron for cones; quadrilateral for cylinders; and an adapted polyhedron for tori. To accelerate ray casting, implicit surfaces are locally described in canonical positions. Fragments that do not intersect the surface are discarded, otherwise the fragment shader computes the z-value (or depth). The coordinates of proxy vertices encode (u, v) texture coordinate rather than a 3D positional info. For example, we use glRect(-u,-v,u,v) to call the quadrilateral rasterization for one cylinder. The vertex shader reads implicit information from the texture starting in the position (u, v). 1 2
Horn et al. [11] indicate that, in their case, coherent access was better than random. We use the RECT texture target without filtering.
Visualization of Industrial Structures
143
Fig. 3. The front faces of the cone’s bounding-box are used to trigger the fragment shader responsible for rendering the cone. A billboard is the best solution for a cylinder without caps. For torus slices, we use an adapted bounding polyhedron with 14 vertices. Each vertex has a parameterized coordinate representing base/top, external/internal and partial angle (1 is the complete slice angle, in this case, 90 degrees).
5.1
Cone Sections
The ray casting executed inside the hexahedron uses a local coordinate system. In this system, the base of the cone is zero-centered and it has unit radius and unit total height (distance between the base and the apex). A very simple fragmentshader computes the intersection between a ray and this canonical cone. The fragment shader is also responsible for drawing the cone caps. Note that complete cones (those that include an apex point) are more efficintly rendered by using a pyramid as their bounding-box (see Figure 3). 5.2
Cylinders
For cylinders a quadrilateral billboard is the best choice to trigger their fragment shader. It uses only four vertices and it has a very tight projection enclosing the cylinder, which reduces the discarded fragments. The vertex shader computes the vertex position based on the cylinder’s dimensions, in such a way that the 4 vertices always form a perfect convex hull for the cylinder’s visible body (Figure 4b). The local coordinate system can be deduced directly from the cylinder’s main direction and the computed convex hull axes directions. This is a kind of view-dependent coordinate system, where one of the axes is fixed relatively to the world (the cylinder’s main axis) and the other axes depend on the viewing direction. Finally, the unique per-vertex information required is the relative position in the convex hull (front/back and left/right). This information is implicitly given by negative/positive values combined with (u, v) texture coordinates. View-dependent Coordinate System. Generally, in a cylinder’s local coordinate system, z coincides with the main axis while x and y are in a plane perpendicular to z, observing the right-hand rule. In our View-dependent Coordinate System, we impose one more restriction: x must be perpendicular to the viewing direction v. z=
(C1 − C0 ) , | (C1 − C0 ) |
x=v×z ,
y =z×x
144
R. de Toledo and B. Levy P0 P2 '
C0 z x
(a)
''
y
(b)
P1
C1
''
C
B P3
'
á ''
''
á
O'
(c)
Fig. 4. (a) View-dependent Coordinate System. (b) Billboard in perspective covering the cylinder’s body. (c) We reduce the geometric problem to R2 (O is the observer projected onto the plane defined by the cylinder’s main axis). We can compute the vertices position based on cylinder radius r, distance |O C| and unit vectors x and y.
The vector v is a normalized vector pointing to the observer (as if it was perpendicular to the sheet of paper for the reader in Figure 4a). Note that y and z are aligned when projected to the screen, although perpendicular in R3 . To compute the final position of the vertices P0 , P1 , P2 , P3 , we must consider the viewer position and perspective distortion. As indicated on Figure 4, we can compute these positions as follows: P0 = C0 + y + x , P1 = C0 + y − x P2 = C1 + y + x , P3 = C1 + y − x , where x = x · r · cos α, x = x · (|O C| − r) · tan α, r y = y · r,y = y · r · sin α, and α = arcsin . |O C| Thickness Control is a special feature developed for the GPU cylinders. In the vertex shader we guarantee that vectors x and x have a minimum size of 1 pixel on the screen. This avoids the dashed line aspect of thin and high cylinder rendering, which may occur when the viewer is far away. Cylinders’ Caps. Similarly to cones, it is straightforward to compute the intersection between a ray and the canonical cylinder’s body. However, to render the cylinder’s caps, there are three possible strategies to attach them: – Extending the billboard in directions P2 P0 and P3 P1 (Figure 5a). – Using a partial 3D bounding box, with 2 faces and 6 vertices (Figure 5b). – Using a second billboard showing the cap turned to the viewer (Figure 5d). In the first option, note that P2 P0 and P3 P1 directions are divergent and extending them may include too many fragments that will be discarded at the end. On the other hand, the second and third options have the drawback of increasing the total number of vertices. In our implementation we have chosen the last option, because it is the one that better reduces the number of discarded fragments. The vertex shader of
Visualization of Industrial Structures
(a)
(b)
145
(c)
Fig. 5. Three different ways to render the GPU cylinder with caps: (a) extending the billboard used for the body; (b) using two faces and six vertices of a bounding box; or (c) rendering two separated billboards, one for the body and another one for the caps
this extra GPU primitive uses the same parameters stored on the floating-point texture. Notice that only one extra primitive is used to render both caps, because they are never visible together in the same frame. Finally, in our industrial model visualization, most cylinders do not have caps because they are part of a pipe sequence. Thus, the extra billboard is used only for few ones. 5.3
Torus Slices
In industrial structures, torus shapes appear in junctions, chains and CAD patterns. Actually, in most cases, the objects contain only a slice of a torus. We use a bounding box well adapted to its form rather than a simple hexahedron. This polyhedron was chosen to reduce pixel wasting by better fitting the slice. At the same time, it is possible to surround torus slices from small angles up to 180 degrees without a significant vertex cost. We have implemented a special vertex shader to automatically locate vertices around the torus slice based on its parameterization. Once more, (u, v) texture coordinates are associated with positive/negative values, determining internal/external and top/base position. The third coordinate represents the relative angle in the interval [0, 1]. The vertex shader fetches torus parameters and computes vertex world and local coordinates. We use an iterative algorithm for ray casting tori, the Newton-Raphson root finder presented by Toledo et al. [15]. They have compared several solutions, considering this one the fastest solution for ray casting torus on GPU.
6 6.1
Results Enhancing Image Quality
There are five image quality improvements obtained with our method. Smooth silhouette. The tessellation of curved surfaces impedes the rendering of continuously smooth silhouette, since it is restricted by mesh discretization. The GPU primitives are computed by pixel, therefore the silhouettes are always smooth, even after a huge zoom. See Figure 1(a). Intersections (Per-pixel z-computation). In conventional triangle rasterization, the z-buffer depth of each pixel is the result of interpolation of per-vertex
146
R. de Toledo and B. Levy
(b)
(a)
Fig. 6. (a) PowerPlant and P40 Oil-Platform models used on performance tests. (b) PowerPlant Section 1, about 60,000 primitives recovered from 3,500,000 triangles. The thickness control prevents undesired visible patterns when cylinders are seen from far.
depth. On the other hand, the GPU primitives compute per-pixel depth, which is much more accurate. When two or more GPU primitives intersect, the boundary has a correct shape due to this precise visibility decision (Figure 1(b)). Continuity between pipe primitives. In industrial plant models, the pipes are a long sequence of primitives. With the tessellated solution, there is a risk of cracks appearing between primitives. To avoid them, the tessellation should keep the same resolution and the same alignment for consecutive primitives. In Figure 1(c), a misalignment has caused cracks. This kind of undesirable situation is also a problem in applications using LOD in pipes (see [20]), whereas with GPU primitives continuity is natural since there is no discretization. Per-pixel shading. Our primitives compute shading by pixel (Phong shading), enhancing image quality if compared to default Gouraud shading. Cylinder thickness control. The thickness control adopted in the vertex shader of GPU cylinders avoids dashed rendering when zooming out of a thin and high cylinder. Moreover, when this control is associated with color computation based on the normal average, it significantly reduces the aliasing pattern effect for a group of parallel cylinders (see Figure 6b). 6.2
Performance
We have done two sets of performance measures targeting different comparisons. The first set uses the PowerPlant and compares GPU primitives with tessellated data of the original model. The second set uses P40 model to compare the visualization of GPU primitives with different levels of mesh tessellation. All tests were done on an AMD Athlon 2.41GHz, 2GB memory, with GeForce 7900 GTX512MB. We render the models entirely in frustum view, fulfilling a 1024 × 768 screen. In this scenario, the bottleneck is not in the fragment shader because primitives are not so large on the screen. Important Note. All the rendering tests were executed without any culling technique, which would surely speed up all the frame rates. We have intentionally
Visualization of Industrial Structures
147
Table 2. PowerPlant performance test. In first block, Sections 1, 15, 19 and 20 do not have unrecovered triangle (UT), in other words, triangles were 100% converted into primitives by the reverse engineering. Frames per second (FPS) measured without v-sync. In second block, Section 12 and the entire PowerPlant have some UT. Sec.
1 15 19 20 12 all
FPS Other GPU prim. VBO information isolate group multiple single #triangles #UT 48.5 93.0 51.0 2.9 3,429,528 0 138.5 268.5 177.5 165.5 1,141,240 0 62.5 128.0 73.5 3.8 2,650,680 0 66.5 130.5 66.9 4.1 2,415,976 0 group +UT all only UT #triangles #UT 911 719 524 2935 360,872 38,067 27.8 21 12.9 81 12,742,978 1,251,019
#prim. 57,938 21,839 42,046 41,872 #prim. 6,205 202,372
presented the results in this way to avoid covering up speed results when comparing GPU primitives and exclusively-rasterized polygons. PowerPlant. PowerPlant sections 1, 15, 19 and 20, used in Table 2, were chosen because they were 100% recovered by the reverse engineering algorithm. This way, we can directly compare rasterization and GPU primitive techniques. We have created two situations for GPU primitives: isolated (without using the floating-point texture, but passing their implicit information at each frame) and grouped. The latter is twice as fast as the isolated solution. We also have compared two rasterization strategies based on VBO (Vertex Buffer Object): decomposing the model in multiple VBOs and using a single VBO. Decomposing gives better results for large models. Comparing the fastest solutions for GPU primitives and for triangle rasterization (second and third columns in Table 2), grouped GPU primitives are clearly faster, almost doubling the speed with multiple VBO implementation. A different situation is presented on the two-last lines of Table 2, where some of the original triangles (10%) were not converted to implicit primitives (e.g., walls). We compare hybrid rendering (GPU primitives + UT, second column) with the best VBO rasterization rendering (third column). The hybrid solution is between 40% and 60% faster than rasterization. Oil Platform - P40. In this set of measures we have compared the visualization of GPU primitives with mesh rasterization with different levels of tessellation. The tests were done with 4, 8 and 12-sided meshes (see Figure 7). We were not able to load a 16-sided mesh version because of memory restriction. As can be seen in Figure 7, the GPU primitives achieve a much better image quality than any tessellated solution. The performance of GPU primitives is comparable to the 8-sided solution (see Table 3).
148
R. de Toledo and B. Levy
(4-sided mesh)
(8-sided mesh)
(12-sided mesh)
(GPU primitives)
Fig. 7. Comparing GPU primitives and tessellations for the P40 model
Table 3. Performance results for oil-platform P40. (*) 341,413 primitives: 215,705 cylinders, 40,001 cones and 85,707 tori. Method 4-sided mesh 8-sided mesh 12-sided mesh GPU primitives
FPS Memory Triangles 40fps 283MB 4,475,852 21fps 464MB 10,559,474 13fps 740MB 18,394,158 22fps 89MB (*)
Table 4. Memory space comparison Model Triangles Initial space Conversion Primitives space Final space PP Sec. 1 3,429,528 119MB 100% 1.9MB 1.9MB PPlant 12,742,978 482MB 90.18% 6.5MB 79.3MB P40 27,320,034 1033MB 89.26% 10.25MB 121.19MB
6.3
Memory Use
In Table 4 we summarize memory-space information before and after recovering implicit surfaces. The topological recovery procedure converts 90% of the original data of industrial models (PowerPlant section 1 is an exception because it is only composed by tubes and pipes resulting in 100% of conversion rate). The memoryspace of recovered data is reduced to at most 2% (98% of reduction). This is a consequence of compact implicit representation used for recovered primitives. If we consider the remaining 10% of unrecovered triangles, industrial models such as oil platforms and power plants can be stored in about 15% of their original data size. The size of floating-point textures on recent graphics cards is limited to 4096 × 4096 (16M texels with 4 floating-points), using 256MB. Our biggest example, P40 model, only uses 4% of this space. Note that it is also possible to use multiple textures, which would push further the scene-size limits.
Visualization of Industrial Structures
7
149
Conclusion
The use of GPU primitives for CAD and industrial models is very promising. The benefits are classified in three categories: image quality (e.g., perfect silhouette and per-pixel depth), memory and rendering efficiency. Grouping primitive information in a GPU texture has proven to be a good strategy for performance purposes without any memory space problem. As future work, we suggest the application of well-known acceleration techniques that are usually applied on conventional visualization of massive models: frustum culling, occlusion culling, coherent memory cache, and so on. Since GPU primitives update the Z-buffer, special techniques, which are usually applied to triangles, can also be combined with our primitives. Two interesting future work are shadow maps application and occlusion culling based on queries. Most implementations of GPU ray tracing have adopted triangles as their only primitive. We suggest the use of implicit primitives, which would probably reduce memory use, which is a critical issue for them. We also recommend the implementation of other surfaces found in industrial plant that were not covered in our work (e.g., sheared cylinders and half spheres).
References 1. Gobbetti, E., Marton, F.: Far voxels: a multiresolution framework for interactive rendering of huge complex 3d models on commodity graphics platforms. ACM Trans. Graph. 24, 878–885 (2005) 2. Toledo, R., Levy, B.: Extending the graphic pipeline with new gpu-accelerated primitives. In: 24th International gOcad Meeting, Nancy, France (2004); also presented in Visgraf Seminar 2004, IMPA, Rio de Janeiro, Brazil 3. Fernando, R., Kilgard, M.J.: The Cg Tutorial: The Definitive Guide to Programmable Real-Time Graphics. Addison-Wesley Longman Publishing Co., Inc., Boston (2003) 4. D´ecoret, X., Durand, F., Sillion, F.X., Dorsey, J.: Billboard clouds for extreme model simplification. ACM Trans. Graph. 22, 689–696 (2003) 5. Oliveira, M.M., Bishop, G., McAllister, D.: Relief texture mapping. In: Proceedings of ACM SIGGRAPH 2000. Computer Graphics Proceedings, Annual Conference Series, pp. 359–368 (2000) 6. Aliaga, D., Cohen, J., Wilson, A., Baker, E., Zhang, H., Erikson, C., Hoff, K., Hudson, T., Stuerzlinger, W., Bastos, R., Whitton, M., Brooks, F., Manocha, D.: Mmr: an interactive massive model rendering system using geometric and image-based acceleration. In: SI3D 1999: Proceedings of the 1999 symposium on Interactive 3D graphics, pp. 199–206. ACM Press, New York (1999) 7. Wilson, A., Manocha, D.: Simplifying complex environments using incremental textured depth meshes. ACM Trans. Graph. 22, 678–688 (2003) 8. Carr, N.A., Hall, J.D., Hart, J.C.: The ray engine. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Eurographics Association, pp. 37–46 (2002) 9. Purcell, T.J., Buck, I., Mark, W.R., Hanrahan, P.: Ray tracing on programmable graphics hardware. ACM Transactions on Graphics 21, 703–712 (2002); ISSN 07300301 (Proceedings of ACM SIGGRAPH 2002)
150
R. de Toledo and B. Levy
10. Wald, I., Purcell, T.J., Schmittler, J., Benthin, C., Slusallek, P.: Realtime Ray Tracing and its use for Interactive Global Illumination. In: Eurographics State of the Art Reports (2003) 11. Horn, D.R., Sugerman, J., Houston, M., Hanrahan, P.: Interactive k-d tree gpu raytracing. In: I3D 2007: Proceedings of the 2007 symposium on Interactive 3D graphics and games, pp. 167–174. ACM Press, New York (2007) 12. Popov, S., G¨ unther, J., Seidel, H.P., Slusallek, P.: Stackless kd-tree traversal for high performance GPU ray tracing. Computer Graphics Forum 26, 415–424 (2007); (Proceedings of Eurographics) 13. G¨ unther, J., Popov, S., Seidel, H.P., Slusallek, P.: Realtime ray tracing on gpu with bvh-based packet traversal. In: Keller, A., Christensen, P. (eds.) IEEE/Eurographics Symposium on Interactive Ray Tracing, Ulm, Germany. IEEE, Los Alamitos (2007) 14. Loop, C., Blinn, J.: Real-time gpu rendering of piecewise algebraic surfaces. In: SIGGRAPH 2006: ACM SIGGRAPH 2006 Papers, pp. 664–670. ACM Press, New York (2006) 15. de Toledo, R., Levy, B., Paul, J.C.: Iterative methods for visualization of implicit surfaces on gpu. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Paragios, N., Tanveer, S.-M., Ju, T., Liu, Z., Coquillart, S., Cruz-Neira, C., M¨ uller, T., Malzbender, T. (eds.) ISVC 2007, Part I. LNCS, vol. 4841, pp. 598–609. Springer, Heidelberg (2007) 16. de Toledo, R., Levy, B., Paul, J.C.: Reverse engineering for industrial-plant cad models. In: TMCE, Tools and Methods for Competitive Engineering, Izmir, Turkey, pp. 1021–1034 (2008) 17. Thompson, W., Owen, J., de St. Germain, H., Stark, S., Henderson, T.: Featurebased reverse engineering of mechanical parts. IEEE Transactions on Robotics and Automation, 57–66 (1999) 18. Fitzgibbon, A.W., Eggert, D.W., Fisher, R.B.: High-level CAD model acquisition from range images. Computer-aided Design 29, 321–330 (1997) 19. Petitjean, S.: A survey of methods for recovering quadrics in triangle meshes. ACM Comput. Surv. 34, 211–262 (2002) 20. Krus, M., Bourdot, P., Osorio, A., Guisnel, F., Thibault, G.: Adaptive tessellation of connected primitives for interactive walkthroughs in complex industrial virtual environments. In: Virtual Environments 1999. Proceedings of the Eurographics Workshop in Vienna, Austria, pp. 23–32 (1999)
Cartesian vs. Radial – A Comparative Evaluation of Two Visualization Tools Michael Burch, Felix Bott, Fabian Beck, and Stephan Diehl Computer Science Department, University of Trier, Germany {burchm,diehl}@uni-trier.de
Abstract. Many recently developed information visualization techniques are radial variants of originally Cartesian visualizations. Almost none of these radial variants have been evaluated with respect to their benefits over their original visualizations. In this work we compare a radial and a Cartesian variant of a visualization tool for sequences of transactions in information hierarchies. The Timeline Trees (TLT) approach uses a Cartesian coordinate system to represent both the hierarchy and the sequence of transactions whereas the TimeRadarTrees (TRT) technique is the radial counterpart which makes use of a radial tree, as well as circle slices and sectors to show the sequence of transactions. For the evaluation we use both quantitative as well as qualitative evaluation methods including eye tracking.
1
Introduction
Many radial visualizations can be produced by transforming a visualization from a Cartesian coordinate system into a radial coordinate system. For example, a rose diagram and a pie chart are radial variants of bar charts, and a star plot [1] is a radial variant of a parallel coordinates visualization [2], see Figure 1. Hierarchical data is also represented in many different ways, for example in a node-link and layered icicle approach. Not surprisingly, radial node-link visualizations have been developed [3] and the layered icicle technique has also been “radialized”, for example in Information Slices [4]. Furthermore, several recently developed visualization techniques combine radial visualizations, e.g. hierarchical edge bundles [5] combine radial icicles and radial trees, and Stargate [6] combines radial icicles and parallel coordinates. Looking at all these examples, the question arises what is the effect of the radial transformation on the useability. Radial visualizations are more difficult to implement and often look nicer than their Cartesian counterparts. But it remains an open question whether they better support users to comprehend data and extract knowledge. In this paper, we present two empirical studies comparing two visualization tools – a Cartesian one and its radial variant. The tools were developed for the visualization of sequences of transactions in information hierarchies. The rest of this paper is organized as follows. In Section 2 we discuss related work. Section 3 briefly introduces both visualization tools. Next, we present the G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 151–160, 2008. c Springer-Verlag Berlin Heidelberg 2008
152
M. Burch et al.
Fig. 1. Some radial visualizations and their Cartesian counterparts
design, results and limitations of our eye tracking study in Section 4. Finally, we draw some conclusions from our study in Section 5.
2
Related Work
The pros and cons of radial visualizations have mostly been discussed by their developers but rarely evaluated in a user study. An informal user study has been organized for the hierarchical edge bundling approach [5] for example. But the researchers did not evaluate if the radial layout would be better than a nonradial one. The developers of this system just found that the participants could gain insights in the adjacency relations in a hierarchical structure. They did not compare their different layouts against each other. Stasko et al. [7] compared the Sunburst technique to Treemaps [8] by evaluating the effectiveness and utility of both tools. They conducted two empirical studies for the two space-filling visualizations of the hierarchical data namely file and directory structures. The participants had to perform search tasks with both the rectangular Treemap method and the radial Sunburst technique. The authors found that the participants better understood the hierarchical structure with the radial tool. Four different tree representations are compared in [9]. The author examined in a user study that different tree layouts lead to the fact that users uncover different and sometimes complementary insights in the given data. He compared a treemap, a baloon layout [10], a hierarchical node-link, and – most important in the context of this paper – a radial tree layout. There has been a lot of work on visualizing information hierarchies using node-link diagrams [11], radial [12], or space-filling techniques like Treemaps [8], Information Slices [4], or Sunburst [13], but only few researchers have developed methods to visualize transactions between elements of a hierarchy [14,15,16,17]. The goal of this paper is to present the results of a comparative study of two tools – a Cartesian [18] and a radial one [19] – for visualizing sequences of transactions in information hierarchies.
Cartesian vs. Radial
3
153
Description of the Tools
Information hierarchies exist in many application domains, e.g. hierarchical organization of companies, or file/directory systems. In addition, there are relations between elements in these hierarchies. For example, employees are related if they communicate with each other, or files are related if they are changed simultaneously. Through these relations the participating elements together form a transaction. Often, we are not interested in a single transaction, but in a sequence of transactions that occur over time. In the following we explain the details of the two tools compared in the study: Timeline Trees (TLT) [18] which uses a Cartesian representation to visualize sequences of transactions in an information hierarchy, and TimeRadarTrees (TRT) [19] which is able to represent the same kind of data but in a radial style. Both tools can be separated into three views: The tree view as a traditional node-link diagram: The TLT approach places the information hierarchy on the left hand side of the whole view whereas the TRT visualization makes use of a radial tree that represents the leaves of the hierarchy on the circle circumference and the whole tree on top of the timeline view. Trees can be collapsed or expanded to an interactively selectable level in both tools. The timeline view with a space-filling approach: In TLT the sequence of transactions is visualized as sets of boxes, that are drawn from left to right in the diagram. We refer to this sequence of boxes as a timeline – in many applications time provides a natural order on the transactions. Each box represents one member element of a transaction and is positioned in the same column as the other members of this transaction and in the row of the according item. The TRT approach uses circle slices and circle sectors instead of rectangular boxes. The time axis starts in the circle center. The thumbnail view: In TLT thumbnails are displayed for every item or collapsed node at the right side of the tree diagram. They show the transactions from the perspective of the according node, such that only those transactions the node is member of are represented in the thumbnail using the selected color code, whereas the remaining transactions are only drawn as gray boxes. Thumbnails are a good tool for identifying correlations between nodes though they are very small. The TRT visualization uses radial thumbnails that are located outside the circle to avoid an overlap with the timeline view.
4
Eye Tracking Study
To attract participants to our study, we decided to use a data set related to soccer. The reason for choosing this kind of data is that soccer is well-known and easy to explain. Furthermore it is a real and an adequately representative data set. We found that a data set representing the number of ball contacts of players in a sequence of moves contains all the features that we need for the visualization
154
M. Burch et al.
Fig. 2. TLT representation of a soccer data set
tools. First of all, each soccer match is hierarchically organized in the following manner: One match consists of two teams which build the first level of the hierarchy. It is then further subdivided into team parts – the goalkeeper, the defense, the midfield and the offense. A move in the match is the set of players that have one or more ball contacts until the opposite team wins the ball possession. The number of ball contacts of each player in each move is recorded and is an indication for the weight of each player in that move. Two players of different teams can also be in the same move in a special kind of event when both players are ejected from the match simultaneously, for example, because of a red card or a substitution. In this way the whole match can be separated in a sequence of moves. A move corresponds to a transaction in the more general terminology of the visualization tools. We base our experiment on a real data set which was manually recorded from a soccer match between the national teams of Germany and the Netherlands in the World Cup Championships in 1990 played in Italy. It was the round of the last sixteen teams that Germany won 2 to 1. 4.1
The Population
The population that performed the evaluation consisted of 35 students (18 males, 17 females). The participants were split randomly into Group TLT (17 participants: 9 males, 8 females) and Group TRT (18 participants: 9 males, 9 females). All test persons participated voluntarily in the evaluation. Before the actual experiment the participants had to fill in a short questionnaire about their mathematical background, video gaming skills, and soccer interests. As we can see in
Cartesian vs. Radial
155
Fig. 3. TRT representation of a soccer data set
Table 1, both groups were relatively balanced with respect to their interest in soccer, but Group TRT had slightly better mathematical skills and Group TLT slightly more experience in video gaming. Table 1. Population TLT TRT Participants - total 17 18 - male 9 9 - female 8 9 Mathematical skills (1 very good, 6 very bad) - in school (∅) 2,76 2,47 - estimated current skills (∅) 3,35 3,11
4.2
TLT TRT Soccer interests - not at all 4 5 - some 8 10 - very interested 4 3 - plays soccer 2 2 3D-game playing (hours/week) 0,82 0,67
The Experiment
After finishing the initial questionnaire, the participants were asked to read a printed tutorial text about the visualization technique (either TRT or TLT). At the end of the tutorial text there were some initial questions for them to check whether they understood how to read the visualizations. The participants had 10 minutes for the tutorial. The actual experiment took 15 minutes and was performed with an eye tracking system (Tobii x50) that uses corneal reflection of infra-red light to locate the position and movement of the eye. The questions and visualizations were shown on a computer screen and two cameras mounted on the screen recorded
156
M. Burch et al.
Fig. 4. Correctness of answers for both groups
the eye movements at a frequency of 50 Hz, i.e. an image is taken every 20 ms. The visualizations were single screenshots of the TRT and the TLT tool showing between 13 to 157 transactions and between 8 to 22 leaf nodes. No interactive features were available. For the analysis of the recorded eye tracking data we used heatmap visualizations. To produce the heatmaps, points of fixations of several test persons have been combined. A fixation was registered by the system when a test person gazed at an area of 30 pixels radius at least for 100 ms. 4.3
Results
In this experiment the participants had to answer 18 questions. The last two of these questions were open questions, while the first 16 questions had clearly determined correct answers. These 16 questions and the overall results are shown in Figure 4. They can be grouped into three categories: warm-up questions, counting questions and correlation questions. Warm-up Questions. For the warm-up questions we see, that these questions have been answered correctly by more than 90 percent, sometimes even 100 percent, of the participants. Counting Questions. This type of question focuses on counting and summing items in different scenarios. As shown in Table 2, TLT outperformed TRT with respect to correctness of answers as well as with respect to response time for correct answers. Moreover, these two results are statistically significant1 . By examining the heatmaps we found that the participants did not use the thumbnails when answering these questions. This was expected because the main purpose of the thumbnails is the detection of relationships. Correlation Questions. For correlation questions, that ask about relations between items, the participants could answer more questions correctly when using TRT, as shown in Table 2. Unfortunately, this result is not statistically significant. 1
In the tables we have set the error probability of all statistically significant results (p < 0.05 with Bonferroni-Holm correction) in bold face.
Cartesian vs. Radial
157
Table 2. T-test analysis Number of correct answers Mean Mean P-Value (TLT) (TRT) T-Value (2-sided) - all 16 questions 11.83 11.06 -1.089 0.284 - counting questions 4.35 3.17 -3.436 0.002 - correlation questions 3.65 4.17 1.037 0.520
Response time for correct answers Mean Mean P-Value (TLT) (TRT) T-Value (2-sided) 16.69 21.55 3.060 0.004 17.40 23.07 2.511 0.017 21.74 23.23 0.840 0.407
Fig. 5. Heatmap for TRT (Correlation question)
Fig. 6. Heatmap for TLT (Correlation question)
After examining the heatmaps of the correlation questions we found that the participants using TRT looked at the thumbnails more intensively than those using TLT. For example, Figure 5 and Figure 6 show the heatmaps for “Which player played most often with Marco van Basten?”. In the TLT heatmap, one can easily see that there was almost no fixation on the thumbnails, whereas in the TRT heatmap there was a strong fixation at the thumbnail of Marco van Basten
158
M. Burch et al.
and another one at the thumbnail of Ronald Koeman – the correct answer for this question. When looking at the heatmaps of those participants using TLT who answered correlation questions incorrectly, we often found that they did not make much use of the thumbnails. Open Questions. All of the previous questions could be answered automatically with relatively simple database queries and no visualization at all. We think that the most important contribution of visualization tools is for exploration of large data sets, where we do not know what to look for in advance. For this, we also showed the participants the visualizations and asked the very general question “Can you detect any trends or anomalies?”. In both groups a test person mentioned on average about 4.3 observations. But the observations varied between the two groups. For example, 14 participants using TRT found that two players (Rudi V¨ oller and Frank Rijkaard) only took part in moves at the beginning of the visualized time period2 , but only 6 participants using TLT detected this anomaly. Looking at the heatmap shown in Figure 7 we realized that the participants using TLT did not inspect the periphery of the visualization, i.e. they did not fixate any of the four corners of the computer screen. Figure 7 shows that for TRT due to its radial layout this “blinders effect” did not occur.
Fig. 7. Heatmap for TRT and TLT (Open question)
4.4
Threats to Validity
There are various factors that limit the validity of the results of these kinds of studies. These include for example the choice of the data set, the choice of the questions, and the size of the data set for each question. Furthermore, while the eye tracker used is not very distracting (compared for example to a headmounted one), it still restricts the user not to move his or her head. 2
Both players have actually been ejected from the match.
Cartesian vs. Radial
159
Finally, TRT does not exploit one of the alleged advantages of radial displays, the possibility to put detailed information in the center and context information in the periphery. Thus, we could not evaluate this feature.
5
Conclusion
While the overall performance of the participants using TLT was better than the performance of those using TRT, the interpretation and thus effective use of the thumbnails worked better in TRT. One reason for this might be that it is easier to distinguish and remember locations in the radial layout. Radial visualizations are fancy, and for some tasks they may even be superior to their Cartesian counterparts. At least, in our empirical study the radial visualization could not keep up with the Cartesian one. Although TLT outperformed TRT overall, there is still some hope: The eye tracking experiment showed that the radial visualization did not lead to the “blinders effect”, and that the radial thumbnails were more useful than the Cartesian ones. The study presented in this paper should only be considered a first step towards answering our initial question of whether radial visualizations better support users to comprehend data and extract knowledge.
References 1. Chambers, J.M., Cleveland, W.S., Tukey, P.A.: Graphical methods for data analysis (The Wadsworth statistics/probability series). Duxbury Press (1983) 2. Inselberg, A., Dimsdale, B.: Parallel Coordinates: A Tool for Visualizing Multidimensional Geometry. In: Proc. IEEE Visualization 1990, October 23-25, pp. 361–378. IEEE Computer Society Press, San Francisco (1990) 3. Eades, P., Whitesides, S.: Drawing Graphs in Two Layers. Journal of Theoretical Computer Science 131(2), 361–374 (1994) 4. Andrews, K., Heidegger, H.: Information Slices: Visualising and Exploring Large Hierarchies using Cascading, Semi-Circular Discs (Late Breaking Hot Topic Paper). In: Proc. of the IEEE Symposium on Information Visualization (INFOVIS 1998), Research Triangle Park, NC, pp. 9–12 (1998) 5. Holten, D.: Hierarchical edge bundles: Visualizing of adjacency relations in hierarchical data. In: Proc. of the IEEE Transactions on Visualization and Computer Graphics, vol. 12, pp. 741–748. IEEE, Los Alamitos (2006) 6. Ogawa, M., Ma, K.L.: StarGate: A Unified, Interactive Visualization of Software Projects. In: Proc. of IEEE VGTC Pacific Visualization Symposium (PacificVis) 2008, Kyoto, Japan (2008) 7. Stasko, J., Catrambone, R., Guzdial, M., McDonald, K.: An Evaluation of SpaceFilling Information Visualizations for Depicting Hierarchical Structures. International Journal of Human-Computer Studies 53(5), 663–694 (2000) 8. Johnson, B., Shneiderman, B.: Tree-maps: A space-filling approach to the visualization of hierarchical information structures. In: Proc. of IEEE Visualization Conference, San Diego, CA, pp. 284–291 (1991) 9. Teoh, S.: A Study on Multiple Views for Tree Visualization. In: Proc. of SPIEIS&T Electronic Imaging, Visualization and Data Analysis (VDA) 2007, vol. 6495, pp. B1–B12 (2007)
160
M. Burch et al.
10. Teoh, S., Ma, K.L.: RINGS: A technique for visualizing large hierarchies. In: Proc. of 10th International Symposium on Graph Drawing, pp. 268–275 (2002) 11. Reingold, E.M., Tilford, J.S.: Tidier drawing of trees. IEEE Transactions on Software Engineering 7, 223–228 (1981) 12. Yee, K.P., Fisher, D., Dhamija, R., Hearst, M.: Animated exploration of dynamic graphs with radial layout. In: Proc. of the IEEE Symposium on Information Visualization, San Diego, CA, USA (2001) 13. Stasko, J., Zhang, E.: Focus+Context Display and Navigation Techniques for Enhancing Radial, Space-Filling Hierarchy Visualizations. In: Proc. of the Symposium on Information Visualization (InfoVis 2000), Salt Lake City, UT, pp. 57–65. IEEE Computer Society Press, Los Alamitos (2000) 14. Neumann, P., Schlechtweg, S., Carpendale, S.: ArcTrees: Visualizing Relations in Hierarchical Data. In: Brodlie, K.W., Duke, D.J., Joy, K.I. (eds.) Data Visualization, Eurographics/IEEE VGTC Symposium on Visualization, Aire-la-Ville, Switzerland (2005) 15. Fekete, J.D., Wand, D., Dang, N., Aris, A., Plaisant, C.: Overlaying Graph Links on Treemaps. In: Poster Compendium of the IEEE Symposium on Information Visualization (INFOVIS 2003). IEEE, Los Alamitos (2003) 16. Zhao, S., McGuffin, M.J., Chignell, M.H.: Elastic Hierarchies: Combining Treemaps and Node-Link Diagrams. In: Proc. of the IEEE Symposium on Information Visualization (INFOVIS 2005), Minneapolis, MN, USA. IEEE, Los Alamitos (2005) 17. Burch, M., Diehl, S.: Trees in a Treemap. In: Proc. of 13th Conference on Visualization and Data Analysis (VDA 2006), San Jose, California (2006) 18. Burch, M., Beck, F., Diehl, S.: Timeline Trees: Visualizing Sequences of Transactions in Information Hierarchies. In: Proc. of 9th International Working Conference on Advanced Visual Interfaces (AVI 2008), Naples, Italy (2008) 19. Burch, M., Diehl, S.: TimeRadarTrees: Visualizing Dynamic Compound Digraphs. In: Proc. of Tenth Joint Eurographics/IEEE-VGTC Symposium on Visualization (EuroVis 2008), Eindhoven, The Netherlands (2008)
VoxelBars: An Informative Interface for Volume Visualization Wai-Ho Mak, Ming-Yuen Chan, Yingcai Wu, Ka-Kei Chung, and Huamin Qu Department of Computer Science and Engineering The Hong Kong University of Science and Technology {nullmak,pazuchan,wuyc,kkchung,huamin}@cse.ust.hk
Abstract. In this paper, we present VoxelBars as an informative interface for volume visualization. VoxelBars arrange voxels into a 2D space and visually encode multiple attributes of voxels into one display. VoxelBars allow users to easily find out clusters of interesting voxels, set opacities and colors of a specific group of voxels, and achieve various sophisticated visualization tasks at voxel level. We provide results on real volume data to demonstrate the advantages of the VoxelBars over scatterplots and traditional transfer function specification methods. Some novel visualization techniques including visibility-aware transfer function design and selective clipping based on VoxelBars are also introduced.
1
Introduction
Direct volume rendering is a useful technique for scientific visualization. To visualize a given volume dataset, users specify transfer functions which define the assignment of optical properties to voxels. Various structures within the volume can be revealed with suitable transfer functions. However, the specification of transfer function is usually non-trivial and requires expertise. Traditionally, voxels are grouped together for transfer function specification in histograms or scatterplots according to at most two voxel attributes such as density and gradient magnitude. To achieve sophisticated visualization tasks, more attributes of voxels other than density and gradient magnitude should be considered. However, it is unclear how to properly represent these additional attributes in one display. It is also difficult to work with more than two attributes. To alleviate the difficulty in manipulating transfer functions, we propose a novel visual representation for voxels, namely VoxelBars, to assist the process. It is inspired by the pixel bar chart [1], which is a well established approach in information visualization to encode mutli-variate data. As voxels have multiple attributes and the pixel bar chart is a widely used visual technique for multivariate data exploration in the information visualization field, it inspires us to extend the pixel bar chart representation to the VoxelBars representation and tailor it for volume visualization. Voxels can be packed into VoxelBars and their attributes can be encoded into various visual channels effectively. Users can easily rearrange, select, and edit for a single or a group of voxels and specify transfer functions for them with multiple attributes of voxels taken into account. Because G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 161–170, 2008. c Springer-Verlag Berlin Heidelberg 2008
162
W.-H. Mak et al.
multiple attributes of voxels can be conveniently encoded into one display using VoxelBars, attributes such as visibility information and distance to the view plane can be added for volume data exploration. While voxels can be sorted and grouped according to different criteria, VoxelBars are comprehensible to users and make the voxel selection and operations easy and efficient. In this paper, we investigate how to conduct complicated volume visualization tasks which must take multiple voxel attributes into consideration simultaneously. We propose VoxelBars which can visually encode up to 5 data attributes into one informative display and allow users to easily specify transfer functions and edit volumes at voxel level given a comprehensive overview of the volume data. To address special challenges and characteristics of volume visualization, we introduce a new construction algorithm and a set of interaction tools for VoxelBars. The visualization pipeline and encoding schemes for VoxelBars are also presented. We demonstrate the advantages of VoxelBars over scatterplots and traditional transfer function specification methods by experiments. Some novel visualization techniques such as selective clipping and visibility-aware transfer function design based on VoxelBars are as well introduced.
2
Previous Work
Transfer Function Design. Searching for a proper transfer function is a challenging problem. He et al. [2] used a stochastic search technique to find good transfer functions. A semi-automatic transfer function generation method based on the histogram volume was proposed in [3]. Kniss et al. [4] developed several useful manipulation widgets for transfer function design. Design Galleries [5] is a useful tool with which users progressively refine the transfer function by choosing the best image in the gallery. Recent work include a painting interface [6] and a semantic interface [7] for transfer function design. We refer readers to an extensive survey [8] for other work in this area. Information Visualization for Volume Exploration. Recently, some information visualization methods have been employed in volume visualization and exploration. Wang and Shen [9] proposed the use of LOD Map in visual analysis of multiresolution images for volume visualization. Tory et al. [10] used parallel coordinates to help search for proper visualization parameters. Akiba and Ma [11] also utilized parallel coordinates in time-varying multivariate volume data. Star coordinates are as well exploited in [12]. In this paper, we are inspired by the pixel bar chart proposed by Keim et al. [1]. It is useful for a wide range of applications [13] [14]. We extend its basic idea for volume visualization.
3
VoxelBars
VoxelBars representation extends the space-filling pixel bar chart [1] for volume visualization. The pixel bar chart, which is mainly derived from traditional bar chart, combines and takes advantages of both traditional bar chart and scatterplot (2D histogram) representations. Its basic idea is to represent each data item
VoxelBars: An Informative Interface for Volume Visualization y
163
Active Region Color
In-Bar Y-Position
Choose Encoding Scheme
Generate VoxelBars
Manage VoxelBars
Visualize Results
Operate on Active Voxels
Select Active Voxels in VoxelBars
Y-Category
x X-Category
In-Bar X-Position
(a)
(b)
Fig. 1. (a) VoxelBars with an active editing region. The attributes of a voxel can be encoded by its x -/y- category, in-bar x -/y- position, and color. (b) A flow chart demonstrating the use of VoxelBars in the visualization process.
by a pixel within a bar in which the data is categorized. Besides the category, various other attributes of the data items are also encoded by the position and color of the pixels. Thus, compared with traditional bar charts, it can further encode the details of each data item in addition to the number of items in each category. Fig. 1(a) shows the layout and visual encoding channels of VoxelBars which can be used to encode any voxel attributes for visualization tasks. Like traditional bar charts in which the data items are distributed to different bars according to their types, voxels are partitioned into different x -categories according to certain attribute Dx . Voxels within a x -category can be further partitioned vertically into different y-categories according to some other attribute Dy . Each category is also called a bar which represents a discrete class of voxels. Voxels are distributed into pixels according to two of their attributes, namely Ox and Oy , which are used for in-bar x -ordering and in-bar y-ordering. The pixel bar chart uses 1-to-1 mapping between data items and pixels. However, a volume dataset usually contains much more voxels than the total number of pixels available on the screen. Therefore, in VoxelBars, each pixel in a bar represents a similar number of voxels instead. In addition, the pixel color can be used to encode one more attribute, namely the color attribute C. Different channels in VoxelBars serve different proposes. The partitioning attributes Dx and Dy encode some discrete categorization of voxels such as their density ranges. On the other hand, the in-bar ordering attributes Ox and Oy encode continuous attributes of voxels such as their gradient magnitude values. The color attribute C is used to show visual patterns for users to better understand the data and find out interesting clusters of voxels. 3.1
Construction Algorithm
To arrange voxels into a dense 2D space properly according to their x -/y- ordering attributes while maintaining the constraint that each pixel has to represent a similar number of voxels is a very difficult optimization problem. A heuristic algorithm for this problem has been proposed in the pixel bar chart paper [1]. However, since for a volume dataset there are an enormous number of voxels to be placed, the heuristic algorithm proposed in that paper becomes too computation and memory intensive in our situation. Therefore, based on that
164
W.-H. Mak et al.
existing approach, we propose another heuristic algorithm which is more efficient. Our proposed algorithm is as follows: 1. All the voxels are first categorized into different x -categories according to Dx . Each x -category has a fixed height h. We can calculate its width w by w = n/(h × p) where n is the number of voxels assigned to it and p is the number of voxels per pixel set by the users in prior. If w is smaller than a user-defined threshold, the x -category will be merged with a neighboring category until it is wide enough for clear display and easy selection. If the category is being further vertically divided into smaller categories according to Dy , each y-category will share the same width as the parent x -category and have a height proportional to its number of voxels. 2. We now need to distribute the voxels into pixels in each bar (category). For a bar of size w × h, we try to evenly partition the voxels into h buckets (non-intersecting sets). First, we sort the voxels according to Oy and find all the h-quantiles, q1 , q2 , ..., qh , where qk is the kth h-quantile. A voxel with Oy attribute in the range [qk−1 , qk ) is assigned to the kth bucket Bk for k = 2, 3, ..., h − 1 while voxels with their Oy attributes less than q1 are assigned to the first bucket B1 and voxels with their Oy attributes greater than or equal to qh−1 are assigned to the last bucket Bh . 3. We fill the voxels into pixels row by row. In each bar, the kth row is associated with the kth bucket Bk , for k = 1, 2, ..., h. For the kth row, there are w pixels available for |Bk | voxels from Bk to be filled. We fill from left to right, from the 1st pixel to the wth pixel. To fill the kth row, we first sort the voxels in Bk according to Ox . Then the voxels are filled evenly to the pixels in that order so that the 1st pixel contains the voxels with the smallest Ox values and the wth pixel contains the largest ones. After the filling, each pixel should contain around p voxels. For the color of a pixel, the average value of color attribute C of all the voxels being represented is calculated. A lookup table is used to find the color of that pixel according to the calculated average value. 3.2
User Interactions
While the VoxelBars can efficiently arrange voxels in a 2D space and visually encode multiple attributes simultaneously, to facilitate users to perform operations on voxels, a set of effective user interaction techniques that are specially designed for volume visualization and exploration purposes are developed. Besides some typical interaction like zooming, panning, and filtering, our system also provides the following advanced features: – Selecting the active region: The users can select active voxels by simply brushing on VoxelBars for further manipulations. Because VoxelBars is essentially an image, any well-established selection technique for image can be supported. To assist the users to perform accurate voxel selection easily, we have developed extended brushing. In VoxelBars, voxels are not positioned in
VoxelBars: An Informative Interface for Volume Visualization
165
the 2D space in the same absolute scale so voxels with similar attributes may not fall into the same region. The extended brushing can extend the selection to nearby unselected voxels of similar properties with the user selection. – Manipulating voxels: After the voxels are selected, different optical properties, like the color and opacity, can be assigned to the selected voxels. – Manipulating bars: Multiple bars can be merged into one for easier batch operations. A merged bar can also be split up back into multiple bars. – Thumbnail images: For each bar, a direct volume rendered image for the subvolume represented by the bar can be displayed on demand.
4
Visualization with VoxelBars
An overview of the visualization process with VoxelBars is shown in Fig. 1(b). While it is possible to specify the transfer function directly in the VoxelBars from scratch, it is more efficient for users to manually tune a simple 1D transfer function or use some automatic transfer function generation approach to start with. With a chosen encoding scheme, voxels are properly arranged into the VoxelBars accordingly. Users can perform bar management operations such as merging and splitting to organize the bars in the desired way. After that, interesting voxels can be selected with the technique introduced previously. Various operations such as editing and transfer function specification can then be performed on selected voxels. Finally, the resulting image will be generated. The procedure can be repeated to further refine the result.
5
Experiments and Applications
We designed three experiments to demonstrate the effectiveness and applications of VoxelBars. Our system is tested on a standard PC machine (Intel Core2Duo T2500 2.0GHz, 1GB RAM, ATI Radeon X1400). The sizes of the datasets used are from 128×128×231 to 256×256×161. It takes 3.5s to 55.2s for preprocessing attribute values and arranging the voxels in the VoxelBars for the first time. Our program is not yet highly optimized for speed, but the whole process can be significantly accelerated using GPUs as the procedures involved are highly parallel. After the VoxelBars generation, our system allows real-time user interaction. Comparison between VoxelBars and Scatterplots. A scatterplot is a chart which arranges voxels into a 2D space according to two attributes of the voxels. It can reveal the correlation between two variables and is also used as an interface for data exploration and transfer function specification. There are several common characteristics shared by both scatterplots and VoxelBars. Voxels with different attribute values are represented as pixels in both charts. However, the scatterplot display can be sparse and an arbitrary number of voxels can be condensed into one single pixel if they have similar attributes. Therefore, each pixel in the plot may represent very different number of voxels and the actual number of voxels represented by a pixel is unclear to users. Though one may use the pixel
166
W.-H. Mak et al.
Gradient Magnitude
Y: Gradient Magnitude
Before removal:
After removal:
Density Low
X-Category: Density High
Color: Gradient Magnitude
(a)
Low
High
Color: Gradient Magnitude
(b)
(c)
Fig. 2. Comparison between (a) scatterplot and (b) VoxelBars. The regions enclosed by dashed lines in (a) and (b) correspond to the same group of voxels. After removing the blue cluster in that region, the inner bone structures become visible with the outer layer of the skin preserved as shown in (c).
color to represent the number of voxels represented by a pixel, that would waste one important visual channel which could be used for encoding another attribute otherwise. Condensing a large cluster of voxels into one single pixel also makes selection difficult and unprecise. That single pixel representing a large cluster may be unnoticed by the users even with the help of color encoding. VoxelBars, in contrast, ensure that each pixel in the chart only represents a similar number of voxels. The number of voxels in a region is directly proportional to the area of the corresponding region in the chart. By using VoxelBars, different classes of voxels can be clearly shown and this enables fine manipulation on the voxels with respect to their classes. In addition to better screen space utilization as a space-filling representation, Voxelbars also facilitate more precise voxel selection and help interesting pattern discovery. Furthermore, VoxelBars can display more attributes simultaneously than scatterplots. We use a CT foot dataset of size 152 × 261 × 220 in our experiment. The CT foot is first visualized using a manually-tuned 1D transfer function. A scatterplot (Fig. 2(a)) and the VoxelBars (Fig. 2(b)) are then generated. The density and gradient magnitude are encoded in the x - and y- axes in both charts, while the gradient magnitude is also encoded by pixel color. In the VoxelBars, some distinguishable clusters can be seen. It is observed from the pixel colors and bar widths that there is a density range containing a large cluster of low gradient voxels in the middle of the chart (blue region), as enclosed by dashed lines in Fig. 2(b). Thumbnail images show that the cluster is the inner homogenous region of the foot tissue. The corresponding region in the scatterplot is also enclosed by dashed lines. However, that region in the scatterplot is much smaller and less clear than that in the VoxelBars due to serious overlapping and underutilization of available space. We select that big blue cluster in the VoxelBars using extended brushing. The voxels selected are then removed, giving the effect in Fig. 2(c). The result shows that some meaningful clusters can be more easily identified in VoxelBars and the sizes of clusters are faithfully reflected. It can better utilize the available space in the chart to represent those voxels. Also, with a much bigger size of display of voxels, it allows easier and more accurate voxel selection.
VoxelBars: An Informative Interface for Volume Visualization
(a)
(b)
167
(c)
Fig. 3. Experiment on the tooth dataset: (a) The initial direct volume rendered image (DVRI) and the corresponding VoxelBars in which the blue region corresponds to voxels with low visibility. (b) The DVRI after assigning low opacity to voxels selected (in whitish color) in a wide bar on the left. (c) The DVRI after making the selected voxels in the VoxelBars transparent.
Multi-dimensional Transfer Function Specification. VoxelBars can facilitate multi-dimensional transfer function design. As voxels are organized in the chart with respect to their attribute values, different classes of structures can be recognized as clusters of pixels in the chart. Different attributes of the voxels can be considered with reference to the employed encoding scheme. Users can specify different visual properties to individual voxels according to their attribute values. After generating VoxelBars using multiple voxel attributes, the users can select a region in VoxelBars and set opacity and colors to all the selected voxels. We use a tooth dataset (256 × 256 × 161) as an example. There are different layers present in a tooth. However, the interior structure (i.e., pulp) may be occluded by the outer layer (i.e., enamel). We try to construct a good 3D transfer function based on density, gradient magnitude, and visibility. Voxel visibility is useful information for transfer function design. It can be estimated by the voxel opacity contribution αcontrib = (1 − αaccum ) × α which can be readily obtained during the ray casting process. Voxels in occluded regions have low visibility. Making the voxels of high visibility more transparent reveals the occluded structures. Using VoxelBars, we can effectively reveal hidden structures with low visibility. We group the voxels into bars of different intensity ranges. In each bar, we further encode gradient magnitude and visibility into the in-bar x - and yaxes respectively (see Fig. 3(a)). Notice that two of the bars have some pixels with different colors at the top, meaning that there are clusters of voxels with relatively high visibility in these two bars. From the color we can see that the left one contains more visible voxels than the right one. The initial direct volume rendered image (Fig. 3(a)), which is rendered with a simple transfer function, shows that the most visible structure is the outermost enamel layer. Therefore, it is logically guessed that the colorful wide bar on the left corresponds to the outermost enamel layer. Checking with the corresponding thumbnail image confirms this. We select those voxels with high visibility and low gradient magnitude in that bar as shown in Fig. 3(b). The selected voxels in the bar are highlighted in a whitish color. They belong to the lower part of the outer layer as they have a relatively higher visibility and lower gradient magnitude. It is noted that the
168
W.-H. Mak et al.
(a)
(b)
(c)
Fig. 4. Experiment on the CT head dataset: (a) A direct volume rendered image (DVRI) and its corresponding VoxelBars in which the blue region corresponds to voxels with low visibility. (b) The DVRI after clipping the voxels selected (in whitish color) in a merged bar on the left. (c) The DVRI after further clipping the voxels selected (in whitish color) in three bars on the right.
upper part of the outer layer has a high gradient magnitude because the inner structure in the upper part has quite different density values from those of the outer layer. Therefore, the selected voxels of low gradient magnitudes belong to the lower part of the outer layer. We remove that part of the outer layer by assigning a low opacity to them. The result is shown in Fig. 3(b). As intended, part of the outer layer of high gradient magnitude is not removed at the crown of the tooth. To remove that as well, the selection is extended in the same chart to include the voxels with high gradient magnitude (see Fig. 3(c)). Most of the occluding layers are then removed in Fig. 3(c). These voxels can only be effectively removed by considering the visibility attribute. This example demonstrates how a 3D transfer function can be easily specified using the VoxelBars. In addition, we introduce the visibility as an useful attribute in transfer function design for effective visualization. The VoxelBars can also be utilized to track the visibility change of the whole volume or a certain region of interest by encoding the maximum visibility for each voxel during the visualization process. Users can therefore always be notified of the existence of any unrevealed structures. Selective Clipping. Selective clipping is another useful application of VoxelBars. Typical clipping methods define clipping regions using clipping planes. These regions are made invisible to reveal the occluded inner structures. However, some meaningful structures may also be unnecessarily removed by the planes. Different from typical clipping methods, our VoxelBars allow users to perform a selective clipping on the attribute domain and can preserve those meaningful structures in the process. To deliver a clipping effect, the distance to the view-plane is considered as an attribute in the VoxelBars. We use a CT head dataset of size 128 × 128 × 231 to demonstrate the idea. In the VoxelBars, we encode the density, distance to the view-plane, and visibility with the x -category, in-bar x - and y- axes respectively (Fig. 4(a)). In the chart, two distinct groups of bars with relatively high visibility can be easily recognized. Thumbnails are generated to identify the corresponding classes of structures (i.e., mainly skin and skull). Being an exterior structure, the skin is more visible, as shown in the chart. We perform a clipping on the bars (on the left) belonging to
VoxelBars: An Informative Interface for Volume Visualization
169
the skin. The objective is to remove the front skin to reveal the inner structures. In the bar belonging to the skin, we select the voxels with high visibility and small distance to the view-plane (Fig. 4(b)). This captures the characteristic of the skin occluding the view. We can adjust the clipping distance by changing the selection with respect to the distance to the view-plane. By assigning zero opacity to those selected voxels, we can deliver a clipping effect (Fig. 4(b)). Other structures (e.g., bones) are preserved in the clipping process as they belong to separate bars. Also, unseen tissues that have similar densities with the skin are preserved as they only have low visibility. To further remove the bones and reveal the soft tissue behind them, we carry out similar operations on the bars of the bones by considering the distance to the view-plane. The voxels close to the viewplane are selected. With the help of the extended brush, voxels in neighboring bars with similar properties are also selected (Fig. 4(c)). We make the skull semitransparent by lowering the opacity of the voxels (Fig. 4(c)). Clipping effects can be selectively exerted to different structures in the volume and the clipping on the structures can be performed individually and incrementally.
6
Discussion
Different visual channels in VoxelBars can encode up to 5 different attributes. It is, however, often useful to use more than one channel to encode the same attribute. Therefore, in the second and third experiments, we used both in-bar y-ordering and color to encode visibility. With this strategy, the voxels of similar visibility can be clustered together spatially. This allows easy pattern observation and voxel selection. For instance, in Fig. 3(a), a cluster of highly visible voxels can be immediately observed and easily selected for manipulation.
7
Conclusion
An effective volume visualization technique has been proposed to present the data in a comprehensible layout. Voxels in the volume are organized into VoxelBars which optimally utilize the space for presenting the voxels and effectively encode multiple attributes into various visual channels. Interesting patterns can be easily and accurately perceived by users and therefore reduce their effort in exploring the data. This facilitates multivariate volume visualization, multi-dimensional transfer function design, and volume editing. Accurate and flexible transfer function specification can be performed at voxel level. Different useful utilities are also proposed to assist the visualization process. Our system allows sophisticated explorations into the volume through a series of editing procedures with the interaction tools. The experiments show that our system can accomplish various sophisticated tasks that are difficult to be achieved using traditional approaches.
Acknowledgments This work was partially supported by Hong Kong RGC grant CERG 618705.
170
W.-H. Mak et al.
References 1. Keim, D.A., Hao, M.C., Dayal, U.: Hierarchical pixel bar charts. IEEE Trans. on Visualization and Computer Graphics 8, 255–269 (2002) 2. He, T., Hong, L., Kaufman, A., Pfister, H.: Generation of transfer functions with stochastic search techniques. IEEE Visualization, 227–234 (1996) 3. Kindlmann, G., Durkin, J.W.: Semi-automatic generation of transfer functions for direct volume rendering. In: IEEE Symposium on Volume Visualization, pp. 79–86 (1998) 4. Kniss, J., Kindlmann, G.L., Hansen, C.D.: Interactive volume rendering using multi-dimensional transfer functions and direct manipulation widgets. IEEE Visualization (2001) 5. Marks, J., Andalman, B., Beardsley, P.A., Freeman, W., Gibson, S., Hodgins, J.K., Kang, T., Mirtich, B., Pfister, H., Ruml, W., Ryall, K., Seims, J., Shieber, S.M.: Design galleries: a general approach to setting parameters for computer graphics and animation. In: SIGGRAPH, pp. 389–400 (1997) 6. Tzeng, F.Y., Lum, E.B., Ma, K.L.: A novel interface for higher-dimensional classification of volume data. IEEE Visualization, 505–512 (2003) 7. Rezk-Salama, C., Keller, M., Kohlmann, P.: High-level user interfaces for transfer function design with semantics. IEEE Trans. on Visualization and Computer Graphics 12, 1021–1028 (2006) 8. Kindlmann, G.: Transfer functions in direct volume rendering: Design, interface, interaction. In: Course notes of ACM SIGGRAPH (2002) 9. Wang, C., Shen, H.W.: LOD map - a visual interface for navigating multiresolution volume visualization. IEEE Trans. on Visualization and Computer Graphics 12, 1029–1036 (2006) 10. Tory, M., Potts, S., Moller, T.: A parallel coordinates style interface for exploratory volume visualization. IEEE Trans. on Visualization and Computer Graphics 11, 71–80 (2005) 11. Akiba, H., Ma, K.L.: A tri-space visualization interface for analyzing time-varying multivariate volume data. In: EuroVis, pp. 115–122 (2007) 12. Bordignon, A.L., Castro, R., Lopes, H., Lewiner, T., Tavares, G.: Exploratory visualization based on multidimensional transfer functions and star coordinates. In: SIBGRAPI, pp. 273–280 (2006) 13. Hao, M.C., Keim, D.A., Dayal, U., Schneidewind, J., Wright, P.: Geo pixel bar charts. IEEE Visualization, 89 (2003) 14. Lyons, M.: Value-cell bar charts for visualizing large transaction data sets. IEEE Trans. on Visualization and Computer Graphics 13, 822–833 (2007)
Wind Field Retrieval and Display for Doppler Radar Data Shyh-Kuang Ueng and Yu-Chong Chiang Department of Computer Science, National Taiwan Ocean University, Keelung City, Taiwan 202
[email protected],
[email protected]
Abstract. Doppler radars are useful facilities for weather data gathering. In this paper, we propose a visualization pipeline to extract and display horizontal wind fields for Doppler radar data. At first, the input radar data are filtered with adaptive filters to reduce noise and enhance features. Then the horizontal wind field is computed by using a hierarchical optical flow method. In the visualization stage, a multi-level streamline construction method is employed to generate evenly-spaced streamlines to reveal the wind field structures.
1
Introduction
Strong wind and torrential rain can cause severe property damages and casualties. To monitor and predict these meteorological phenomena is crucial to our safety. Doppler radars are powerful facilities for the measurement of weather data. A Doppler radar sends out microwave pulses toward the air. When the waves hit objects, small fractions of the waves are scattered back to the radar antenna. The reflectivities are transformed into scalar values to represent the density of precipitation in the air. However, a Doppler radar can not estimate the wind field directly. It can only measure the speed of wind in the radial direction, based on the frequency shifting of the scattered radar waves. In this article, a visualization pipeline is presented to retrieve and display the wind field hidden in Doppler radar data. The brief structure of the pipeline is illustrated in Fig. 1. At the first stage, the raw radar data are filtered and re-sampled. At the second stage, a hierarchical optical flow method is employed to compute the horizontal wind field. Then a multi-level streamline tracing algorithm is utilized to produce evenly-spaced streamlines. At the final stage, the wind field is illustrated by using streamline images. Our system possesses the following improvements over other radar data visualization systems: First, improved filters are employed for smoothing and re-sampling the data. Second, a hierarchical optical flow method is used to compute the horizontal wind fields such that more accurate results are produced. Third, the multi-resolution streamline construction method is easier to implement and possesses the capability of multi-resolution wind field visualization. The rest of this paper is organized as follows: In Section 2, related researches on meteorological data processing and vector field visualization are G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 171–182, 2008. c Springer-Verlag Berlin Heidelberg 2008
172
S.-K. Ueng and Y.-C. Chiang
introduced. Our filtering and resampling methods are presented in Section 3. The hierarchical optical flow method is described in Section 4. The multi-resolution streamline tracing method is presented in Section 5. Test results and conclusion are contained in Section 6.
Filtering and Resampling
Wind Field Computation
Streamline Tracing
Wind Field Visualization
Fig. 1. The 4 stages of the pipeline for wind field visualization
2
Related Work
Many software systems have been designed for processing meteorological data. A system called D3D is proposed in [1]. It is developed based on its precedent system, Vis5D. New functionalities are added such that both 3D and 2D meteorological data can be processed. In [2], a 4D graphics system is presented for processing meteorological data. Weather data are illustrated by using 2D or 3D graphics techniques. Textures are added to enhance image quality. More graphics systems for processing weather data are overviewed in [3]. These systems are useful for weather data analysis. However, they are designed to cope with different types of weather data and do not focus on the processing of Doppler radar data. In [4], a multiresolution method is presented for visualizing Doppler radar reflectivities. In their method, radar data are resampled in a hierarchical structure of tetrahedral meshes. The data of coarse meshes are used for displaying small variations while fine resolution data are rendered for exploring detail features. Djurcilov and Pang develop a visualization technique for processing Doppler radar data. In their method, isosurfaces are created to show the distribution of precipitation in the air [5]. Jang et al. design an LoD visualization system for displaying Doppler radar data[6]. They use a hierarchical data structure to split radar data. Splatting is used to render radar reflectivities[6]. In [7], a numerical procedure is developed to interpolate and visualize Doppler radar reflectivities. The gradient and diffusion of reflectivity are utilized to enhance cloud structures. These visualization methods focus on displaying precipitations. Little or no effort is paid to the retrieval and visualization of wind field for Doppler radar data. Since a wind field is a vector field, flow visualization methods can be used to explore wind fields. Streamline images are the most powerful media for displaying flow fields. In LIC methods, an enormous number of streamlines are traced and convolved with noises to depict fine flow structures [8]. Many algorithms have been proposed to create evenly-spaced streamlines for flow visualization[9,10,11]. In these methods, special structures like regular grids or triangle meshes are used to control streamline densities and seeding. Therefore the domain is uniformly covered with streamlines, and the flow patterns are revealed. These method may suffer from slow converge speed, high complexity for implementation, or the lack of multi-resolution visualization capability.
Wind Field Retrieval and Display for Doppler Radar Data
173
Scientists have developed algorithms to calculate wind fields by using multiple radars [12,13]. These methods use the wind speed measured by two or more Doppler radars and augment with the mass conservation law to compute wind fields. For regions covered by only one radar, different approaches are employed for the computation. An approach called TREC is proposed in [14]. This method uses two consecutive radar data frames to estimate the wind field. At first, the two frames are divided into blocks. Then the correspondences between the blocks in the two frames are searched. By computing the moving distances of the blocks, the wind field is retrieved. This method is simple but error-prone. In [15], an improved method, called COTREC is presented. This method uses TREC method to generate an initial solution. Then the initial solution is modified to meet the mass conservation law and to minimize a variation functional. In [16], Tuttle and Gall use echo-tracking algorithms to estimate the wind fields of tropical cyclones. Their method also uses two consecutive radar data to calculate the wind field. Their method is only good for tropical cyclones, since the patterns of tropical cyclones are clear and the maximum speeds can be estimated. In [17], Shapiro et al. assume that the wind field is a linear function of time. Then the equation of the radial component of wind is derived under the mass conservation law. Then the wind field is calculated by minimizing a cost function.
3
Filtering and Re-sampling for Radar Data
In gathering weather data, a Doppler radar rotates 360 degrees about the vertical axis. For each degree of angle, the radar sends out a ray to sense the precipitation in the air. The reflectivities are measured at equidistant points along each ray. After completing a circular scanning, the radar raises up its elevation angle by several degrees and performs the circular scanning again [18]. Since the sample points are located on multiple layers of concentric conical surfaces, the radar data come with polar coordinates in nature, as shown in part (a) of Fig. 2. This prohibits efficient calculation and display for wind fields. Thus, in our work, the radar data are resampled by using a regular grid before computing the wind field. Raw radar data may contain noise. For example, when a radar wave hits mountains, it produces significant reflectivities along the ray. Consequently, a stationary radial line appears in the output data. Noise deteriorates the accuracy of wind field computation and should be reduced beforehand. 3.1
Filtering Procedures
To eliminate the signals echoed by high mountains, the height field of the terrain is taken as references. If the path of a radar ray is blocked by a mountain, the radar data gathered by this ray beyond the mountain are ignored. This process is called the height field filtering. After the height field filtering, a median filter is used to eliminate data of extreme values, caused by mechanical errors. In the median filter, the reflectivity of a sample point is computed by: ρ˜(r, θi , α) = M edian(ρ(r, θi+p , α)), p = −k, ..., k,
174
S.-K. Ueng and Y.-C. Chiang ρi
ρ i−1
ρ
i+2
ρ i−2
~
ρi
Radar Station
(a) scan of Doppler radar
ρ i+1
r
radar
= median of {ρi−2, ρi−1, ρi , ρi+1, ρi+2} (b) median filter gradient
layer n−1 cloud mass
z
: :
y
layer 1 layer 0
x (c) resampled data
P1
P2
P3
Gaussin filter
(d) Gaussian filter masks
Fig. 2. Preprocessing Doppler radar data, (a) raw data distribution, (b) median filter, (c) re-sampled data at different layers
where ρ˜ and ρ are the resampled and original reflectivities, (r, θi , α) are the polar coordinates of the sample point, in which r is the distance from the radar to the point, θi is the azimuth angle, and α is the elevation angle. The new reflectivity of the sample point is set to the median of the reflectivities of the 2k + 1 nearest points on the same circle of the same conical surface, as shown in part (b) of Fig. 2. Some images are presented in Fig. 3 to demonstrate the effects of the filters. Part (a) of the figure contains the raw data. The radar data produced by the height field filter is displayed in part (b). Then the radar data is smoothed by using the median filter. The results are shown in part (c). It should be noted that a hole emerges in the centers of the images. This hole is called the cone of silence. It is where the radar is located. Since the radar can not raise up its antenna too high, no data is gathered within this cone. 3.2
Re-sampling
After the pre-filtering, the radar data set is re-sampled on a 3D regular grid. The 3D grid comprises 32 layers of horizontal grids. Each horizontal grid contains 512 × 512 sample points. The vertical gap between the layers of grids is 0.3 km. The space of grid line in the horizontal grids is 0.5 km. The 3D grid covers an area of 256km × 256km horizontally, while it vertical extent is less than 10km. Since the wind field higher than 10km has little influence on the ground, the radar data higher than 10km are ignored. The 3D grid is shown in part (c) of Fig. 2. The x and y axes span the horizontal space, and the z axis directs vertically. To resample the radar reflectivity, the Cartesian coordinates of each grid point are transformed into the polar coordinate system. Reflectivities are computed by using tri-linear interpolation performed in the polar coordinate system. Once the re-sampling is completed, a Gaussian filter is utilized to smooth out noise and enhance cloud structures. A conventional Gaussian filter may blur data and can not achieve these goals simultaneously. It has been proved in [19] that scaling down the Gaussian filter mask in the direction of highest variation
Wind Field Retrieval and Display for Doppler Radar Data
(a) raw data
175
(b) height filtering
(c) median filtering
(d) Gaussian filtering
Fig. 3. Filtering of radar data. (a) the raw data, (b) results of the height field filter, (c) radar data after applying median filtering, (d) results produced by Gaussian filter.
preserves edges and damps noise in other directions. In the work of [7], diffusions are used to adjust the variance of Gaussian filter to enhance cloud masses. Based on these results, we design an adaptive Gaussian filter for smoothing radar data. Let I denote the reflectivity. The horizontal gradient of I is computed by: ∇I = [
∂I ∂I T , ] . ∂x ∂y
The gradient shows the variation direction and tendency of reflectivity. If the gradient magnitude at a sample point is larger than a predefined threshold, the sample point is at an edge, for example point P2 of in part (d) of Fig. 2. In the filtering process, the filter mask is rotated such that its y axis is aligned with the gradient direction. Then a scaling factor Sf is computed by: g Sf = max(1, ln ), g0 where g and g0 are the gradient magnitude and the threshold. The filter mask is scaled down by a factor of Sf in the y direction.
176
S.-K. Ueng and Y.-C. Chiang
In the regions with small gradients, reflectivity diffusions are utilized to adjust the Gaussian filter. The horizontal diffusion of I is defined by: ∇2 I =
∂2I ∂2I + 2. 2 ∂x ∂y
The diffusion magnitude at a position x represents the concentration of reflectivity there[20]. If it is large, the reflectivity is denser at x than the surrounding area, for example point P3 of part (d) in Fig. 2. Then the variance of the Gaussian filter is reduced to allow more high frequency signals passing the filter such that the cloud mass is preserved. Initially, the standard deviation σ0 of the Gaussian filter is set to 1, and a threshold of diffusion is selected by users. Assume the threshold of diffusion is D0 . If the diffusion magnitude at x is Df , the deviation σ is adjusted by: σ0 σ= . D max(1, ln Df0 ) Thus, larger diffusions result in smaller variances. The scalar 1 is used to ensure that the deviation is not larger than σ0 . Part (d) of Fig. 3 shows the results produced by using the Gaussian filter.
4
Wind Field Computation
Because of the lack of key information, estimating the wind field for Doppler radar data is difficult. Fortunately, the vertical component of wind field is relatively small. Computing the horizontal wind field is more practical and feasible for weather forecasting. In the traditional methods, the horizontal wind field is calculated by tracking the movement of cloud particles. These algorithms can not calculate wind fields in the regions containing no cloud. Secondly, these procedures estimate the mean wind velocity at the center of a rectangle by using statistical strategies. If the cloud speed is greater than the width of the region, these algorithms will fail. Furthermore, cloud masses are not rigid. They are deformed when traveled with winds. It is hard to track cloud particles. Optical flow methods have been used in computer vision society to estimate velocities of objects. The strategy of optical flow methods is similar to that of the traditional horizontal wind field computing methods. Therefore it is not suitable for estimating horizontal wind fields. In our work, a hierarchical optical flow method [21] is employed for solving the horizontal wind field. The flow chart of the procedure is shown in part (a) of Fig. 4. At first, two consecutive radar data sets obtained at time points t0 and t1 are filtered and resampled to create two 3D volume data sets. Then users are asked to select a horizontal layer of data from each of these volume data sets. The two data frames are measured at the same elevation but at different time points. They serve as the cloud images for the optical flow computation. Then the two cloud images are down-sampled by using a box filter to create two pyramids of cloud images. The bottom layers of the pyramids contain the coarsest cloud images while the finest cloud images are at the top-most layers. The down-sampling ratio is 2 : 1 in both the x and y
Wind Field Retrieval and Display for Doppler Radar Data Cloud Images I0 & I1
Wind Field Data
High Resolution
I1
Derivative Computing
Wind Field Computing
I0
Image Warping
Wind Field
High Resolution
Down Sample
Up Sample
Median Resolution
Median Resolution
Down Sample
Up Sample
Low Resolution
Cloud Image Pyramid
177
Low Resolution Optical Flow Computing
Cloud Images
Initial Wind Field
Horizontal Wind Fields
(a) Hierarchical Optical Flow Computing
(b) Optical Flow Method for Wind Field
Fig. 4. (a) The hierarchical optical flow procedure: Radar data are down-sampled to create 2 image pyramids. The wind field is computed, up-sampled, and refined bottomup. (b) Flowchart of the optical flow method for wind field computing.
directions. The down-sampling is terminated when the maximum wind speed is less than a few pixels in the coarsest cloud images, and therefore a cloud particle will not move too far to be tracked. The down-sampling also spreads clouds from the regions with dense clouds to the empty regions in the lower resolution cloud images such that the wind field in the empty regions can be computed. Then the wind field is solved by using the coarsest cloud images based on the optical flow method proposed in [22]. Let u and v be the wind field components. They are computed by solving the following equations[23]: ∂I ∂I ∂I ∗u+ ∗v =− , ∂x ∂y ∂t 2 2 ∇ u = 0, ∇ v = 0, where t represents the time variable. The first equation is the optical flow constraint. It ensures the cloud intensity is invariant between the two images. The second and third equations are served as smooth constraints. After being computed, the coarsest wind field is up-sampled by one level to serve as the initial wind field for the next level cloud images. The magnitude of the up-sampled wind field must be multiplied by 2, since the grid-line gap is shrunk by a factor of 2 in the finer images. The resulted wind field is used to warp the cloud image of time t0 . Then the temporal derivatives of the cloud images are computed and a least square method is applied to modify the initial wind field. This image warping and wind field modification process is repeated until the wind field is converged, as shown in part (b) of Fig. 4. Once the wind field of the finer cloud images is computed, the wind field is up-sampled again in the grid of the next finer cloud images, and the computation is repeated until the wind field of the finest cloud images is computed. In Fig. 5, an example of the optical flow computing is shown. The first two images are the cloud images. The computed wind field is depicted in the third image. White arrows are used to display the
178
S.-K. Ueng and Y.-C. Chiang
(a) cloud image at t0
(b) cloud image at t1
(c) computed wind field
Fig. 5. Wind field computed by using the hierarchical optical flow method
directions and magnitudes of the wind field. These arrows are super-imposed on the first cloud image to show the motions of clouds.
5
Evenly-Spaced Streamline Tracing
In [24], a multiresolution streamline generation method is presented. The authors use different separating distances to create streamlines at different levels. These streamlines are kept but not rendered until the visualization stage. In our streamline tracing method, we use regular grids of different resolutions to control the density of streamline. At first, multiple levels of grids are super-imposed on the domain to split the domain into cells, as shown in Fig. 7. These grids are used for seeding and controlling streamline density. The streamline tracing starts at the coarsest grid. To trace a streamline, a seed is placed at the center of an unvisited cell. A cell is unvisited if it has not be penetrated by any streamline. Then, the streamline is integrated by using RK2 method. The streamline is terminated if it advances into the vicinities of other streamlines, enters a critical point, or leaves the domain. Once the streamlines seeded in the coarsest grid are integrated, the streamline tracing is carried out in the next level grid. The whole process repeats until the streamlines in the finest grids are computed. The pseudo-codes of the multiple level streamline construction procedure is presented in Fig. 6. Our algorithm allows users to achieve multi-resolution flow visualization by rendering the streamlines traced at different levels. To control the separate distances between streamlines, each cell is associated with a flag which records the average position of the sample points inside the cell. When a new sample point is calculated, the cell containing the point is examined. If it has been visited by another streamline, this point is neglected and the streamline is ended. Otherwise, the distances between the sample point and the average streamline positions of the neighboring cells are calculated. If any of the distances
Wind Field Retrieval and Display for Doppler Radar Data proc hierarchical_streamline() { create_multilevel_grids(numLevel,grids); for(each level) do{ trace_one_level_streamlines(level); filter_short_streamlines(level); set_next_level_flag(level+1); extend_streamlines(level+1); } }
179
proc trace_one_level_streamlines(level) { for(each cell) do { if(unmarked(cell)=True){ seed = create_seed(cell); if(is_valid(seed)==True) trace_streamline(seed); } } }
Fig. 6. Procedure of the multi-level streamline construction
(a) seeds of 1st level grid 1st level seeds
(b) seeds of 1st & 2nd level grids 2nd level seeds
(c) remaining seeds after tracing 1st level streamlines
Fig. 7. Seeding and space-control by using multi-level grids for streamline tracing. (a) Seeding by using the coarsest grid, (b) seeds of the 2nd level grid, and (c) remaining seeds and space after integrating the first level streamlines.
(a) level1, resolution=7x7
(b) level2, resolution=13x13
(c) level3, resolution=38x38
Fig. 8. Sample results of the multi-level streamline tracing. (a) Results by using the coarsest grid, (b) results by using the 1st and 2nd level grids, and (c) results by using 3 levels of grids.
is less than one half of the cell width, the streamline is terminated. Fig. 7 shows the seeding by using two levels of grids. When the streamline tracing in the coarsest grid is completed, the finer grid is utilized to place seeds and trace streamlines in the empty space left by the 1st level streamlines.
180
S.-K. Ueng and Y.-C. Chiang
(a) streamline image
(b) streamline image
(c) LIC image
Fig. 9. Streamline images of a typhoon wind field. (a) at time = 0 minute, (b) at time = 21 minute, (c) LIC image of the wind field.
Long streamlines are preferred in flow visualization, since they reveal flow patterns better. In our work, short streamlines are deleted to preserve space for long streamlines in each iteration of streamline tracing. After purging short streamlines, the remaining streamlines are extended to occupy the space left by the short streamlines. Therefore, more long streamlines are produced. A set of images is shown in Fig. 8 to demonstrate the progression of the streamline tracing. In part (a), the streamlines traced in the coarsest grid are shown. Part (b) contains the streamlines integrated by using the 1st and 2nd level grids. The streamlines traced by using all grids are displayed in part (c). The grid resolutions are 7 × 7, 13 × 13, and 38 × 38 respectively.
6
Visualization Results and Conclusion
We test our system by using a radar data set of a typhoon passing through the southern tip of Taiwan in August 2003. The radar scans the sky in 9 different elevation angles. For each elevation angle, the radar antenna rotates 360 degrees and sends out a ray for each azimuth angle. Radar reflectivities are sampled at 923 equidistant points along each ray. In one scan, the radar generates 9 × 360 × 923 sample points. The radar data set is filtered and resampled. The horizontal wind field at the elevation of 2 km above sea level is computed. Streamline images are created to show the topology of wind field. Wind magnitudes are encoded by colors. Two streamline images are presented in Fig. 9. In the first image, a saddle line is clearly displayed about the centerline of the island, where is a mountainous area. The wind speed is relatively low there. After 21 minutes, the wind field changes and the associated streamline image is shown in part (b). The image shows that a sub-low-pressure center is evolving near the south-east coastline, and the wind field pattern becomes more complicated. If the finest grid resolution in the streamline tracing is set to that of the streamline images, an LIC image is produced. One LIC image is shown in part (c) of Fig. 9. The streamlines are convoluted with the speed of wind to illustrate
Wind Field Retrieval and Display for Doppler Radar Data
181
the wind speed in each region. Based on the results, our system is proved to be capable of extracting and visualizing wind fields from Doppler radar data. It produces useful wind field information for weather forecasting and related applications. It can be coupled with radar stations and commercial visualization systems to collect, display, and analyze meteorological data.
References 1. McCaslin, P.T., McDonald, P.A., Szoke, E.J.: 3D Visualization Development at NOAA Forecast Systems Laboratory. ACM Computer Graphics 34, 41–44 (2000) 2. Hibbard, W.L.: 4-D Display of Meteorological Data. In: Proceedings of the 1986 Workshop on Interactive 3D Graphics, pp. 23–36 (1986) 3. Hibbard, W.: Large Operational User of Visualization. ACM SIGGRAPH Computer Graphics 37, 5–9 (2003) 4. Gerstner, T., Meetschen, D., Crewell, S., Griebel, M., Simmer, C.: A Case Study on Multiresolution Visualization of Local Rainfall from Weather Radar Measurements. In: Proceedings of IEEE Visualization 2002, pp. 533–536 (2002) 5. Djurcilov, S., Pang, A.: Visualizing Gridded Datasets with Large Number of Missing Values (Case Study). In: Proceedings of IEEE Visualization 1999, pp. 405–408 (1999) 6. Jang, J., Ribarsky, W., Shaw, C., Faust, N.: View-Dependent Multiresolution Splatting of Non-Uniform Data. In: Proceedings of IEEE TVCG Symposium on Visualization 2002, pp. 125–132 (2002) 7. Ueng, S.K., Wang, S.C.: Interpolation and Visualization for Advected Scalar Fields. In: Proceedings of IEEE Visualization 2005, pp. 615–622 (2005) 8. Stalling, D., Hege, H.C.: Fast and Resolution Independent Line Integral Convolution. In: ACM SIGGRAPH 1995, pp. 249–258 (1995) 9. Jobard, B., Lefer, W.: Creating Evenly-Spaced Streamlines of Arbitrary Density. In: Eurographics Workshop on Visualization in Scientific Computing 1997, pp. 45–55 (1997) 10. Mebarki, A., Alliez, P., Devillers, O.: Farthest Point Seeding for Efficient Placement of Streamlines. In: IEEE Visualization 2005, pp. 479–486 (2005) 11. Turk, G., Bank, D.: Image-guided streamline placement. In: ACM SIGGRAPH 1996, pp. 453–460 (1996) 12. Chong, M., Georgis, J.F., Bousquet, O., Brodzik, S.R., Burghart, C., Cosma, S., Germann, U., Gouget, V., Houze, R.A., James, C.N., Prieur, S., Rotunno, R., Froux, F., Vivekanandan, Zeng, Z.X.: Real-Time Wind Synthesis from Doppler Radar Observations during the Mesoscale Alpine Programme. Bulletin of the American Meteorological Society 81, 2953–2962 (2000) 13. Friedrich, K., Hagen, M.: On the use of advanced Doppler radar techniques to determine horizontal wind fields for operational weather surveillance. Meteorological Applications 11, 155–171 (2004) 14. Rinehart, R.E.: A Pattern Recognition Technique for Use with Conventional Weather Radar to Determine Internal Storm Motions. Recent Progress in Radar Meteorology. Atmos. Technol., 119–134 (1981) 15. Li, L., Schmid, W., Joss, J.: Nowcasting of Motion and Growth of Precipitation with Radar over a Complex Orography. Journal of Applied Meteorology 34, 1286– 1300 (1995)
182
S.-K. Ueng and Y.-C. Chiang
16. Tuttle, J., Gall, R.: A Single-Radar Technique for Estimating the Winds in Tropical Cyclones. Bulletin of the American Meteorological Society 80, 653–668 (1999) 17. Shapiro, A., Robinson, P., Wurman, J., Gao, J.: Single-Doppler Velocity Retrieval with Rapid-Scan Radar Data. Journal of Atmospheric and Oceanic Technology 20, 1578–1595 (2003) 18. Collier, C.G.: Application of Weather Radar Systems: A Guide to Uses of Radar Data. Halsted Press (1989) 19. Ueng, S.K., Cheng, H.P., Lu, R.Y.: An Adaptive Gauss Filtering Method. In: Proceedings of IEEE Pacific Visualization Symposium 2008, pp. 127–134 (2008) 20. Davis, H.F., Snider, A.D.: Introduction to Vector Analysis. Wm. C. Brown Publisher (1991) 21. Beauchemin, S.S., Barron, J.L.: The Computation of Optical Flow. ACM Computing Surveys 27, 433–467 (1995) 22. Lucas, B., Kanade, T.: An Iterative Image Registration Technique with an application to Stereo Vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 23. Bowler, N., Pierce, C.: Development of a Short QPF Algorithm Based upon Optic Flow Techniques. Technical report, Met Office, United Kingdom (2002) 24. Jobard, B., Lefer, W.: Multiresolution Flow Visualization. In: Proceedings of WSCG 2001 (2001)
Dual Marching Tetrahedra: Contouring in the Tetrahedronal Environment Gregory M. Nielson Arizona State University
Abstract. We discuss the dual marching tetrahedra (DMT) method. The DMT can be viewed as a generalization of the classical cuberille method of Chen et al. to a tetrahedronal. The cuberille method produces a rendering of quadrilaterals comprising a surface that separates voxels deemed to be contained in an object of interest from those voxels not in the object. A cuberille is a region of 3D space partitioned into cubes. A tetrahedronal is a region of 3D space decomposed into tetrahedra. The DMT method generalizes the cubille method from cubes to tetrahedra and corrects a fundamental problem of the original cuberille method where separating surfaces are not necessarily manifolds. For binary segmented data, we propose a method for computing the location of vertices this is based upon the use of a minimal discrete norm curvature criterion. For applications where dependent function values are given at grid points, two alternative methods for computing vertex positions are discussed and compared. Examples are drawn from a variety of applications, including the Yes/No/Don’t_Know data sets resulting from inconclusive segmentation processes and Well-Log data sets.
1 Introduction The original cuberille method [3] is primarily concerned with the rendering of a surface that separates the voxels that are part of an object from the voxels that are not part of the object. Due to the wide spread need for such techniques, there continues to be a fair amount of published literature on this topic. See [2], [4], [6], [9], [13] and [19] for example. Here we discuss the tetrahedronal version of this method or, in other words, the dual marching tetrahedra (DMT) method. In addition, here we are not only interested in the rendering, but we are also interested in methods that will produce the geometry consisting of a polygon mesh representation of the separating surface. This geometry can not only be used for rendering, but it also allows for the efficient application of surface parameterizations, curvature texture applications and many other geometry processing tools. The next four paragraphs constitute an annotated outline of the present paper. 1. As motivation and background, we first discuss the polygon mesh surface which would result from the application of the cuberille method. Among other properties, it is noted that this surface is not always guaranteed to be a manifold. This may not be a problem when only a rendering of the surface is required, but geometric processing of the polygon mesh surfaces such as parameterization, volume inside/outside determination or curvature computations can not be applied to nonmanifold surfaces. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 183–194, 2008. © Springer-Verlag Berlin Heidelberg 2008
184
G.M. Nielson
2. Next, we discuss the dual marching tetrahedral method, DMT, which produces a polygon mesh surface with a vertex lying in each tetrahedron that has both marked and unmarked grid points. In Section 2.1, we describe a method for computing the actual vertex positions which is based upon discrete norm curvature. In Section 3, we show examples where the new tetrahedronal method is applied to the general class of segmented data sets called “Yes/No/Don’t_Know”. We also illustrate the application of the tetrahedronal method to the general class of data often called “Well-Log” data. 3. Using methods of decomposing cubes into tetrahedra without using additional grid points, we also show in Section 3 how to apply the DMT method to the more conventional rectilinear lattice data or cuberille data. 4. Next, we note that it is very simple to extend the application of the method from binary classified grid points to the case where there are dependent function values given at every grid point. The generalization to this type of data is accomplished by using thresholding to create a binary classification and so the basic method immediately applies. In addition to the minimum discrete norm curvature method for computing vertex positions, in Section 4. we describe two methods based upon dependent function values. We note the connection to the marching tetraheda (MT) method through its dual graph and compare results.
2 Background and Motivation Based Upon the Cuberille Method As we previously mentioned, much of the motivation for the DMT method is based upon the original Cuberille Method which is described in [3]. This approach to rending surface bounded objects from computed tomography (CT) data is classical due to the simplicity of the method. A cuberille is a dissection of space into cubes called voxels. The total collection of all of these voxels is segmented into two groups, namely those voxels that belong to an object of interest and those voxels that do not belong to this object. A rendering of the surface bounding the object is accomplished by rendering all of the directed faces which separate the voxels of the object and from those of the background. The process of determining these distinguished voxels that belong to the object is often referred to as segmentation. See [11], [13], [21] and [23]. While the objective of the original cuberille method was mainly the rendered image of the separating surface, here we are also interested in obtaining a representation of the separating surface as a polygon mesh surface. This geometry not only can be used for rendering of the object, but it can also facilitate many subsequent geometry processing operations involving the object of interest; such as, parameterizing or computing the volume or the area of the separating surface. The collection of centroids of the voxels forms a three dimensional rectilinear lattice we denote by Pijk , i = 1, L, N x , j = 1, L , N y , k = 1, L, N z The cubes with corners from this lattice
are denoted by C ijk . The centers of the cubes of the lattice points are the “corners” of the voxels. The separating surface S , is a polygon surface comprised of quadrilaterals with vertices Va , a = 1,L, N taken as the centroids of the cubes C ijk where at least one of the voxels intersecting Cijk is in the object and one is not in the object.
Dual Marching Tetrahedra: Contouring in the Tetrahedronal Environment
185
The topology (edge connectivity) of the vertices is determined by the various configurations shown in Figure 1. At the onset, there are a total of 256= 2^8 possible cases to consider, but with the use of rotations, the number of cases is reduced to 23 equivalence classes with representers of each class shown in Figure 1. See [17]. As can be easily seen from this table, nonmanifold surfaces are produced by configurations C3, C6, C7, C10, C12, C13, C15, C16 and C19 (where C0 is upper left and C22 is lower right). While these configurations are not overly abundant in typical applications, they do occur, on the average, approximately 0.07%, 0.08%, 0.015%, 0.025%, 0.05%. 0.00005%, 0.015%, 0.08%, 0.07% respectively.
C
C Fig. 1. The figure on the left shows the connectivity of the quad patches of the cuberille method for the various configurations. Each configuration is a representer of an equivalence class determined by rigid rotations of the 256 possible cases. See [17] for a table indicating the rotation group element, the representer and the cases. An example of the cuberille method is shown in the top, right image where a tumor has been segmented from the brain. The data is the Harvard brain tumor data (see [11], [23]) available from Surgical Planning Laboratory at Brigham and Women’s Hospital, www.spl.harvard.edu and consists of an array of size 124 × 256 × 256 . The bottom image is a zoom-in of the top image so that the voxel features of the image are clearly visible.
2.1 The Dual Marching Tetrahedra Method
Let Pi = (xi , yi , zi ), i = 1,L, N denote the grid points which are segmented as marked or unmarked. We assume these points are not collectively coplanar. We assume that the grid points have been arranged into a collection of tetrahera to form a tetrahedronal. A tetrahedonal consists of a list of 4-tuples which we denote by I t . Each 4-tuple, ijkl ∈ It denotes a single tetrahedron with the four vertices Pi , P j , Pk , Pl which is denoted as Tijkl . A valid tetrahedronal requires: i) No tetrahedron Tijkl , ijkl ∈ I t is degenerate, i. e. the points Pi , P j , Pk , Pl are not coplanar, ii) The interiors of any two
186
G.M. Nielson
tetrahedra do not intersect and iii) The boundary of two tetrahedra can intersect only at a common triangular face. See [16] for a survey of methods for computing optimal (Delaunay) tetrahedronals of the convex hull of scattered 3D point sets.
Vni n j nk nb
Pni Vni n j nk na
Pnb
Pna
Fig. 2. The three active cases of the DMT method. From left to right: one, two and three points classified as being contained in the object of interest. The lower image illustrates the notation of the tetrahedronal method.
A tetrahedron is said to be active if among its 4 grid points there are both marked grid points and unmarked grid points. There are three distinct configurations for these active tetrahedra as shown in Figure 2. Similarly, a triangular face of an active tetrahedron is active provided it contains both marked and unmarked grid points. Interior to each active tetrahedron, Tni n j nk ni , there is a vertex Vni n j nk ni . For each interior active triangular face, there is an edge of S joining the two vertices of the two tetrahedra sharing this triangular face. If Fni n j nk denotes the interior active triangular face and Tni n j nk na and Tni ,n j ,nk ,nb denote the two active tetrahedra sharing this face then an edge joins Vni n j nk na and Vni n j nk nb . In addition to the surface nets approach [9], we propose a scheme for computing the positions of the vertices that is based upon minimizing discrete norm curvature estimates at each vertex which we now describe. For the tetrahedron defined by the points Pi , P j , Pk , Pl we consider the tetrahedra lattice points
(
)
Va ,b,c,d = aPi + bP j + cPk + dPl / N where the integers a, b, c, d satisfy the two conditions 0 < a, b, c, d < N and a + b + c + d = N . For each point Va ,b,c,d in Ti , j ,k ,l we use the discrete curvature methods of Dyn et al [8] and Kim et al [12] to compute an estimate of the norm curvature k12 (S ) + k 22 (S ) = 4[M (S )]2 − 2 K (S ) where
Dual Marching Tetrahedra: Contouring in the Tetrahedronal Environment
187
k1 (S ) and k 2 (S ) are the principal curvatures, M (S ) is the mean curvature and K (S ) is Gaussian curvature. These estimates are based upon a triangulation of Va ,b,c,d and the vertices of its 1-ring that maintain the edges of separationg surface S and do not introduce any additional edges containing Va ,b,c,d The estimates are computed as
K (S ) = 3(2π − ∑ α i ) / A ,
M (S ) = .75∑ ei βi / A , where ei is an edge joining
Va ,b,c,d and a vertex of its 1-ring, β i is its dihedral angle, α i is a subtended angle and A is the sum of the areas of all the adjacent triangles. We take as our first
approximation Vi(,1j),k ,l the point Va,b,c,d associated with the smallest estimate of norm curvature. These values are computed for all tetrahedra containing vertices of the separating surface S . We do another pass over all of the tetrahedra containing vertices of S leading to the approximations Vi(,2j ,)k ,l . This is continued until the user
specified criteria for convergence is satisfied. In practice usually 7 or 8 digits of accuracy are obtained in less than 6 iterations (a complete loop through all active tetrahedra). A resolution of N = 5, … , 9 is a typical choice.
3 Application to the General Types of Data Sets: Yes/No/Don’t_Know, Well-Log and Cuberille In this section we describe three fairly widely observed types of data sets for which the DMT method applies. These are the so-called Yes/No/Don’t_Know, Well-Log and cuberille data sets. The first type is from the application of segmentation results applied to conventional rectilinear data and the second and third types result from some very common types of methods for taking samples. 3.1 Yes/No/Don’t_Know Data Sets
MRI and CAT scan data sets can be viewed as samples of a function defined over a rectilinear lattice. Often segmentation algorithms are invoked to determine which of the lattice points belong to a certain specified type and which do not. For example, we may wish to determine which lattice points of a MRI scan are brain matter and which are not. See [11] and [23] for more discussion on this. Quite often, segmentation algorithms require the user to specify the values for certain parameters. Even if domain experts are used to “tune” these algorithms, the results can be, in some applications, “inconclusive”. This means that, for some lattice grid points, the algorithm reports that these lattice points definitely belong to the object, for others it reports that these lattice points definitely do not belong to the object. But for some points the algorithm cannot definitely report one way or the other and so these lattice points could or could not be part of the object of interest. The technique, we present, for handling this type of data is illustrated in Figure 3. The left image illustrates with the black circles the lattice points known to be part of the object and the white circles indicate lattice points which are definitely not part of the object. The gray circles indicate the “Don’t_Know” lattice points where the segmentation algorithm is inconclusive. The
188
G.M. Nielson
gray points are simply removed and the convex hull of the remaining white and black lattice points are collectively triangulated or, in the 3D case, tetrahedralized into a tetrahedronal. For a survey on methods of tetrahedral zing general scattered points see [16]. Once we have a tetrahedronal with segmented grid points, the methods of Section 2 immediately apply.
Fig. 3. In the left image the white circles indicate lattice (grid) points that are definitely not contained in an object of interest, the black points are ones that are definitely belonging to the object and the grey points are undetermined. In the middle image, the grey points have been removed and the convex hull of the remaining white and black points are triangulated. In the right image a separating polygon is obtained by the 2-D version of the DMT method.
Fig. 4. The results of the DMT method applied to a yes/no/don’t know data set resulting from a segmentation algorithm applied to rectilinear data obtained from the scan of a Chimp tibia bone (only the top portion is used). The right image illustrated the application of the DMT to a Well-Log data set yielding a bounded region where copper exists at beyond trace levels.
In Figure 4 we show the results of applying this method to segmented data representing a portion of a bone. Lattice points in the voxel grid have been classified as definitely bone, definitely not bone and inconclusive. The lattice points that could not conclusively be determined as bone or not are removed and the remaining 3D lattice points are used as the bases of a tetrahedronal. The tetrahedronal method is applied where the actual vertex positions are determined by the minimal discrete curvature method described in Section 2.
Dual Marching Tetrahedra: Contouring in the Tetrahedronal Environment
189
3.2 Well-Log Type Data
In the geophysical sciences, it is quite common to collect measurements at various depths (or heights) at several locations. The locations (xi , yi ) are often positions on the earth and the depth (or height) may possibly vary from location to location. Also the number of measurements per location may vary. This type of data can be represented as (xi , yi , z ik , M ik ), i = 1, K , N ; k = 1,L ni where n i is the number of meas-
urements taken at location (xi , yi ) and M ik = 0 or 1 . The example in the right image of Figure 4 utilizes data courtesy of D. Kinsel (See [15]) and represents from 9 to 43 samples taken from 13 well locations. The samples are tested for copper concentrations above trace levels and marked 0 or 1 accordingly. The region where copper can be found is bounded by the isosurface.
3.3 Cuberille Data Sets
We illustrate the application of the tetrahedronal method to cuberille data with the Harvard brain data set mentioned above in the example of Figure 1. In Figure 6 we show the contour surface of the tumour which has been segmented from the brain data using both the alternating 5-split and the CFK split of a cube into tetrahedral.
Fig. 5. Two different decomposition of a cube into tetrahedra and the resulting DMT surfaces
4 Extension to Function Valued Data Sets The previous sections have dealt with the case where the grid points, Pi , i = 1,L, N of the tetrahedronal grid have the binary property of being marked or unmarked. Many applications result in tetrahedronal grids where there is a dependent function Fi value associated with each vertex Vi . That is, the grid is viewed as a decomposition of the domain of a trivariate function and the dependent values Fi are samples of an underlying trivariate function defined over the tetrahedronal grid domain volume. Given a threshold value, F , we can classify whether or not the grid points Pi are marked or unmarked based upon whether or not Fi ≥ F thus allowing the level sets of the
190
G.M. Nielson
sampled function to be viewed by the tetrahedronal method. Here we have a richer context where the dependent sample values can be used to determine the location of the vertices of the approximating separating surface. A large variety of choices are possible. We describe two possibilities here. The first is based upon the idea that if there is a sign change on an edge then the plane passing through these points is also close to sign changes in the interior of the tetrahedra. This plane will be determined by three or four points. The vertex point is selected as a point on this plane that intersects the line joining the centroid of a face and the opposing grid vertex or the line joining the midpoints of two opposing edges. We call this method: Intersection between Centroids, IbC, and it is further illustrated in Figure 6. Pb
Pb
Pc
Pc Va,b,c,d
Va,b,c,d
Pd Pd
Pa
Pa
Fig. 6. The computation of the vertex Va,b,c, d in Case 1 and 2 of the Intersection between Centroids, IbC method
The second method is based upon the idea that surface intersection points can be computed by linear interpolation on all edges where function values at the end points encompass the threshold value. The vertex point, Va ,b,c,d , is taken to be the centroid of these points. We refer to this method as the Centroid of Intersections, CoI. Detailed computational formulas for both of these two methods are given below, where we assume that that the points have been labelled so that the following holds: Fa ≤ Fb ≤ Fc ≤ Fd . The values Vij represent the intersection of linear interpolation along an edge eij . Method: Intersection between Centroids, IbC Fa ≤ F < Fb < Fc < Fd
Vabcd =
(Fb − Fa )Vab + (Fc − Fa )Vac + (Fd − Fa )Vcd (Fb + Fc + Fd − 3Fa )
(Fd − Fa )Vad + (Fd − Fb )Vbd + (Fc − Fa )Vac + (Fc − Fb )Vbc 2(Fc + Fd − Fa − Fb )
Fa ≤ Fb < F < Fc < Fd
Vabcd =
Fa ≤ Fb ≤ Fc < F < Fd
Vabcd =
(1)
(Fd − Fa )Vad + (Fd − Fb )Vbd + (Fd − Fc )Vcd (3Fd − Fa − Fb − Fc )
Method: Centroid of Intersections, CoI ⎛1⎛ F −F Vabcd = ⎜ ⎜⎜ b ⎜3 F −F a ⎝ ⎝ b
Fa ≤ F < Fb < Fc < Fd
+
⎞ 1 ⎛ Fc − F ⎟⎟ + ⎜⎜ ⎠ 3 ⎝ Fc − Fa
1 ⎛⎜ ⎛ F − Fa ⎜ 3 ⎜⎝ ⎜⎝ Fb − Fa
⎞ 1 ⎛ Fd − F ⎟⎟ + ⎜⎜ ⎠ 3 ⎝ Fd − Fa
⎞ ⎛ F − Fa ⎟ Pb + ⎜ ⎟ ⎜F −F a ⎠ ⎝ c
⎞⎞ ⎟⎟ ⎟ Pa ⎟ ⎠⎠
⎞ ⎛ F − Fa ⎟ Pc + ⎜ ⎟ ⎜F −F a ⎠ ⎝ d
⎞ ⎟ Pd ⎟ ⎠
⎞ ⎟ ⎟ ⎠
Dual Marching Tetrahedra: Contouring in the Tetrahedronal Environment Fa ≤ Fb < F < Fc < Fd
Fa ≤ Fb ≤ Fc < F < Fd
⎞⎞ 1 ⎛ ⎛ F − F ⎞ ⎛ Fc − F ⎞ ⎞⎟ ⎟ ⎟ Pa + ⎜ ⎜ d ⎟+⎜ ⎟ Pb ⎟⎟ 4 ⎜⎝ ⎜⎝ Fd − Fb ⎟⎠ ⎜⎝ Fc − Fb ⎟⎠ ⎟⎠ ⎠⎠ 1 ⎛ ⎛ F − Fb ⎞ ⎛ F − Fa ⎞ ⎞⎟ 1 ⎛ ⎛ F − Fa ⎞ ⎛ F − Fb ⎞ ⎞⎟ ⎟+⎜ ⎟ Pc + ⎜ ⎜⎜ ⎟+⎜ ⎟ Pd + ⎜ ⎜⎜ 4 ⎜⎝ ⎝ Fc − Fb ⎠⎟ ⎝⎜ Fc − Fa ⎠⎟ ⎟⎠ 4 ⎜⎝ ⎝ Fd − Fa ⎠⎟ ⎝⎜ Fd − Fb ⎠⎟ ⎟⎠
Vabcd =
1 ⎛⎜ ⎛ Fc − F ⎜ 4 ⎜⎝ ⎜⎝ Fc − Fa
⎞ ⎛ Fd − F ⎟+⎜ ⎟ ⎜F −F a ⎠ ⎝ d
191
(2)
⎛ 1 ⎛ F − Fa ⎞ 1 ⎛ F − Fb ⎞ 1 ⎛ F − Fc ⎞ ⎞ ⎟+ ⎜ ⎟+ ⎜ ⎟ ⎟P Vabcd = ⎜ ⎜⎜ ⎜ 3 F − F ⎟ 3 ⎜ F − F ⎟ 3 ⎜ F − F ⎟⎟ d a ⎠ b ⎠ c ⎠⎠ ⎝ d ⎝ d ⎝ ⎝ d ⎛ F −F ⎞ ⎛ F −F ⎞ ⎞ 1 ⎛⎛ F − F ⎞ ⎟⎟ Pb + ⎜⎜ d ⎟⎟ Pc ⎟ ⎟ Pa + ⎜⎜ d + ⎜ ⎜⎜ d ⎟ 3 ⎜⎝ ⎝ Fd − Fa ⎟⎠ ⎝ Fd − Fc ⎠ ⎠ ⎝ Fd − Fb ⎠
Typical results of the application of the tetrahedronal method with dependent functional values are shown in Figure 7. The left example is from a FEM (finite element method) combustion simulation. The tetrahedronal has 47025 grid points and 215040 tetrahedra. Here we have used the Centroid of Intersection, CoI, method of determining the isosurface vertices.
Fig. 7. On the left is shown the results of a Finite Element Method (FEM) combustion simulation. The isosurface is displayed along with the edges of the tetrahedronal. The CoI method of computing isosurface vertices is used. On the right, the Delta Wing data is displayed at isolevel 0.22977 which yields a surface with some interesting topology. The tetrahedronal has 211,680 grid points and 1,005,657 tetrahedra. The IbC method of computing vertex locations is used.
An example which utilizes the “Delta Wing at 40o Attack” data set available from NASA Ames is shown in the right image of Figure 7. This data in its original form consists of a curvilinear grid of size 56x54x70. Here, each of the hexahedral cells has been decomposed into 5 tetrahedra using the decomposition of Figure 5. In Figure 9, we show an example which uses the well known “Blunt Fin” data set. Here, the IbC method of computing vertex positions is used. In addition, we include the isosurface produced by the Marching Tetrahedra (MT) method. See [1], [10], [14] and [15]. We use the variation that produces a quadrilateral (rather than two triangles) in the case where function values at two grid vertices are above the threshold and two are below. With this version, the surfaces produced by the MT method and the present tetrahedronal method are formal mathematical dual polygon mesh surfaces of each other. Each vertex of one, uniquely associates with a face of the other. This is illustrated in the blow-up shown at the bottom, right of Figure 8 where the (white) edges of the tetrahedronal method are displayed along with the (black) edges of the surface produced by the MT method.
192
G.M. Nielson
Fig. 8. The grid of the “Blunt Fin” data set is shown in upper left. The upper right is the DMT using the IbC method of vertex selection and the lower left is the MT. The lower right illustrates the duality of the MT and the DMT.
5 Remarks 1. Since a tetrahedronal can be formed from an arbitrary scattered point cloud (by computing the Delaunay tetrahedralization of the convex hull, see [16]), the methods described here have very widespread application; including datasets with binary and/or function dependent grid values. The method is very easily implemented particularly in the case of function dependent grid values where the formulas of the Intersection between Centroids (IbC) and the Centroid of Intersections (CoI) are available (Equations (1) and (2)). 2. While both the IbC and CoI methods of computing vertex positions are both simple and effective (and these are the main reasons we chose to report only on these methods here), there are many possibilities we have yet to fully test. One particularly intriguing possibility which we hope to report upon in the near future is to combine the marching diamond method of [1] with our CoI method. 3. Even for rectilinear grid data applications, the implementation of the DMT method presented here is considerably easier and less tedious than the standard marching cubes (MC) (see [17]) or dual marching cubes (DMC) (see [18]) due to the differences in complexity of the algorithm. There are 256 cases leading to 23 distinct configurations for both the MC and the DMC while the DMT method has only 16 cases and 3 distinct configurations. 4. While the present method is designed for tetrahedronal application, it applies immediately to rectilinear data also by means of the decompositions of Figure 5. It has been noted in the past, that when the marching tetrahedra (MT) method is applied to rectilinear data, empirical evidence suggests that typically the number of vertices
Dual Marching Tetrahedra: Contouring in the Tetrahedronal Environment
193
and/or triangles only increases by a factor of approximately 2.5. This suggests that the main deterrent to the general adoption of the MT for rectilinear data is not the increase in the complexity of the isosurface, but rather the over all poor triangle quality of the resulting mesh surfaces. The present method eliminates this deterrent as evidenced by the examples presented here. We plan to report on an analysis and empirical study on this issue in the near future as these results become available.
Acknowledgements We wish to acknowledge the support of the US Army Research Office under contract W911NF-05-1-0301 and the US National Science Foundation. We wish to thank Narayanan Chatapuram Krishnan, Vineeth Nallure Balasubramanian and Ryan Holmes for their help and contributions to this project.
References 1. Anderson, J.C., Bennett, J., Joy, K.: Marching Diamonds for Unstructured Meshes. In: Proceedings of IEEE Visualization 2005, pp. 423–429 (2005) 2. Bonneau, G.-P., Hahmann, S.: Polyhedredal Modeling. In: Proceedings of Visualization 2000, pp. 381–387 (2000) 3. Chen, L.-S., Herman, G.T., Reynolds, R.A., Udupa, J.K.: Surface shading in the cuberille environment. Computer Graphics and Applications 10, 33–43 (1985) 4. Chica, A., Williams, J., Andujar, C., Crosa-Brunet, P., Navazo, I., Rossignac, J., Vinacua, A.: Pressing: Smooth Isosurfaces with flats from binary grids. In: Computer Graphics Forum 2007, pp. 1–10 (2007) 5. Co, C.S., Joy, K.: Isosurface generation for large-scale scattered data visualization. In: Proceedings of VMV 2005, pp. 233–240 (2005) 6. Crosa-Brunet, P., Navazo, I.: Solid Representation and Operation using Extended Octrees. ACM Transactions on Graphics, 123–456 (1990) 7. Crosa-Brunet, P., Andujar, C., Chica, A., Navazo, I., Rossignac, J.: Optimal Isosurfaces. Computer Aided Design and Applications (2004) 8. Dyn, N., Hormann, K., Kim, S.-J., Levin, D.: Optimizing 3D triangulations using discrete curvature analysis. Mathematical Methods for Curves and Surfaces, 135–146 (2001) 9. Gibson, S.F.F.: Constrained elastic surface nets: Generating smooth surfaces from binary segmented data. In: Wells, W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 888–898. Springer, Heidelberg (1998) 10. Gregorski, B.F., Wiley, D.F., Childs, H., Hamann, B., Joy, K.: Adaptive contouring with quadratic tetrahedral. In: Scientific Visualization: The Visual Extraction of Knowledge from Data, pp. 3–15 (2006) 11. Kaus, M., Warfiedl, S.K., Nabavi, A., Black, M., Kikinis, R.: Segmentation of MRI of brain tumors. Radiology 218, 586–591 (2001) 12. Kim, S.-J., Kim, C.-H., Levin, D.: Surface simplification using a discrete curvature norm. Computers & Graphics 26, 657–663 (2002) 13. Lagopoulos, J.: Voxel-based morphometry made simple. Acta Neoropsychiatrica 9(3), 213–214 14. Linsen, L., Prautzsch, H.: Fan Clouds – An Alternative to Meshes. In: Asano, T., Klette, R., Ronse, C. (eds.) Geometry, Morphology, and Computational Imaging. LNCS, vol. 2616, pp. 451–471. Springer, Heidelberg (2003)
194
G.M. Nielson
15. Nielson, G.M.: Modeling and visualizing volumetric and surface-on-surface data. In: Focus on Scientific Visualization, pp. 219–274. Springer, Heidelberg (1992) 16. Nielson, G.M.: Tools for Triangulations and Tetrahedrizations and Constructing Functions Defined over Them. In: Scientific Visualization: Overviews, Methodologies Techniques, pp. 429–526. IEEE Computer Society Press, Los Alamitos (1997) 17. Nielson, G.M.: On Marching Cubes. Transactions on Visualization and Computer Graphics 9(3), 283–297 (2003) 18. Nielson, G.M.: Dual Marching Cubes. In: Proceedings of Visualization 2004, pp. 489–496. CS Press (2004) 19. Ning, P., Bloomenthal, P.: An evaluation of implicit surface tilers. IEEE Computer Graphics and Applications 13(6), 33–41 20. Sreevalsan-Nair, J., Linsen, L., Hamann, B.: Topologically accurate dual isosurfacing using ray intersection. Journal of Virtual Reality and Broadcasting 4(4), 1–12 (2007) 21. Udupa, J., LeBlanc, V., Schmidt, H., Imielinska, C., Saha, P., Grevera, G., Zhuge, Y., Molholt, P., Jin, Y., Currie, L.: A methodology for evaluating image segmentation algorithms. In: SPIE Conference on Medial Imaging, San Diego, pp. 266–277 (2002) 22. Vivodtzev, F., Bonneau, G.-P., Linsen, L., Hamann, B.: Hierarchical Isosurface Segmentation based on Discrete Curvature. In: Proceedings of ACM International Conference Proceeding Series (2003) 23. Warfield, S., Kaus, M., Jolesz, F., Kinkinis, R.: Adaptive template moderate spatially varying statistical classification. Med. Image Anal. 4, 43–55 (2000)
Vision-Based Localization for Mobile Robots Using a Set of Known Views Pablo Frank-Bolton, Alicia Montserrat Alvarado-González, Wendy Aguilar, and Yann Frauel Departamento de Ciencias de la Computación Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas Universidad Nacional Autónoma de México
[email protected]
Abstract. A robot localization scheme is presented in which a mobile robot finds its position within a known environment through image comparison. The images being compared are those taken by the robot throughout its reconnaissance trip and those stored in an image database that contains views taken from strategic positions within the environment, and that also contain position and orientation information. Image comparison is carried out using a scaledependent keypoint-matching technique based on SIFT features, followed by a graph-based outlier elimination technique known as Graph Transformation Matching. Two techniques for position and orientation estimation are tested (epipolar geometry and clustering), followed by a probabilistic approach to position tracking (based on Monte Carlo localization).
1 Introduction Robot localization techniques consider a wide variety of perception models, which are the techniques they use to obtain information from their surroundings and interpret the data. Some proposals base the sensing of the environment on sonar or laser readings [1, 2]; others pinpoint a robot's location by triangulating the intensity of signals being broadcast for that purpose [3, 4]; there are even those that shun the use of a perception system and decide to keep track of the robot position through dead reckoning, which of course, becomes terribly inaccurate over time. Robot vision is one of the richest – yet complex – perception models. Several proposals of vision-based localization have been made using various approaches [5-13]. Among these, some techniques use a collection of training images to help the robot in finding its current position [8-13]. In this work, a vision based localization system is described, in which a robot navigates through a known environment based on image comparison and position estimation. A robot is required to detect its position and then track its own movements within a defined space of which an image database was previously gathered. A similar idea is presented in [10] using omnidirectional images. To obtain these images, special omnidirectional cameras are necessary, which is a specialty item that may be expensive or difficult to obtain. In this work however, no such special camera is required. The system is designed in such a way that even a web-cam may be employed. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 195–204, 2008. © Springer-Verlag Berlin Heidelberg 2008
196
P. Frank-Bolton et al.
This is also the case of [11], where the system is also monocular. However, in the latter, a different reference scheme is employed in which key points are acquired in conjunction with a laser defined distance. In the work presented here, no such distance is necessary so that the system needs no other sensor than a camera. Robot localization comes in two flavors: global localization (in which a robot's initial position is not known) and position tracking (in which the starting position is known, as well as the odometry of its movements). In this work, the two localization problems are solved as described in the next sections. The rest of the paper is organized as follows. Section 2 presents the outline of the localization process. Section 3 presents the setting scenario and the image database generation. Image comparison between the actual robot view and the images stored in the database is discussed in section 4. In section 5 two techniques for position and angle estimation are presented: i) epipolar geometry and ii) quality threshold clustering. This is followed, in section 6, by a description of a probabilistic approach to position tracking based on Monte Carlo Localization. Finally, results of the proposed approach are presented in section 7 and conclusions are outlined in section 8.
2 Process Outline The proposed visual-based localization approach of this paper relies on the identification of similar images between the actual view and the reference views, on pose estimation between similar views and on a probabilistic position tracking. This approach is illustrated in Figure 1 for both the global localization and the position tracking cases. In the following sections, each element of this approach is described.
Fig. 1. Process outline (a) and Visual Localization detail (b)
Vision-Based Localization for Mobile Robots
197
3 Reference Image Acquisition The navigation scenario of the robot is represented by a set of manually selected views. These represent key positions, or nodes, separated by roughly 90cm, from which eight different views were collected, spanning 360 degrees in 45 degree steps. Figure 2 represents an example of a test area, with 14 nodes and eight views for each one, giving a total of 112 images. Other research is also based on a reference information stage. In [11], reference images are obtained for the robot’s path, and in [10], manually positioned reference images, as well as distance measurements help create the feature map. Once this stage is complete, no other “learning” stage is required and localization may be carried out directly.
Fig. 2. Test environment. The test area is a lab section surrounded by project posters. In the sketch, circles with numbers represent the viewpoint nodes, totaling 14. Each square represents 30 cm.
4 Image Comparison The first step in determining the robot's position is gaining information on which nodes have similar images to that of the captured view (which is what the robot is “seeing” in the unknown actual position). This step indirectly points towards the robot’s state, for image similarity also entails certain restrictions on the robot's probable position and orientation, also called pose. In other words, if the robot's actual view is very similar to some node's viewpoint, its pose is also related to the actual robot's pose. In the next section, this relation and the estimation of the actual pose is discussed. First, similarity between images must be defined and determined through image processing. What can be compared in the images? There are many alternatives to solve this question, which is a significant area of research in and of itself. For this work, David Lowe's SIFT (Scale-Invariant-Feature-Transform) [14] feature extractor was selected. It identifies points of interest or keypoints within an image depending on each pixel's intensity derivatives in the scale-space [15] of that particular image. A descriptor is
198
P. Frank-Bolton et al.
assigned to each point. Each descriptor contains a 128 feature vector that contains information of the keypoint's region orientations. Each keypoint is defined by a pair of coordinates (position in the 2D image matrix), a scale (level at which the keypoint is found inside the scale-space decomposition), an orientation (obtained from the intensity gradient of that particular location), and the 128 feature vector describing the keypoint's surroundings. In his work [14, 25], Lowe presents an approximate nearest-neighbor method for comparing two sets of keypoints. This method is called BBF (Best-Bin First). One important drawback of this technique is that there is no spatial information taken into consideration (other than region similarity) for obtaining the nearest match. This means that if a keypoint's region descriptor (the 128 feature vector) and its orientation and scale closely match another keypoint's data, these are a candidate match. Notice that this may happen for two completely unrelated points that share local similarity but that are really at very different 3D locations. These erroneous matches are called outliers. A technique called GTM (Graph-Transformation-Matching) [16] was implemented to reduce these types of outliers in the final matching. GTM has a simple but effective conducting principle: iteratively eliminate correspondences that disrupt the neighborhood relationships. In order to do so, it constructs a K-nearest-neighbor (K-NN) graph for each view image (based on the space coordinates of detected feature points), and during each iteration it a) removes the vertex (match) which most disrupt the similarity structure on both graphs, b) reconstructs the corresponding K-NN graphs and repeats this process until both graphs are identical, which means outliers have been removed.
Fig. 3. (a) Keypoint detection and matching with BBF. (b) Result of Graph Transformation Matching for images 1 and 2. (c) Matches after GTM.
The entire process is shown through images in Figure 3. In (a), keypoints of two images are detected and described using SIFT (shown as the end line dots). These are then matched using the BBF algorithm and showed as matching connecting lines. In (b), the GTM identical resulting consensus graphs are shown. Finally, in (c), the GTM resulting free outlier matching set is shown. This process may be implemented for the comparison of the actual robot view with the entire 112 images of the database. As mentioned before, every keypoint was
Vision-Based Localization for Mobile Robots
199
detected in relation to a particular scale. Low-value scales are related to high frequency changes in the image. Therefore, it is at these scales that most small details or even noise will be detected. This scale range is also where most of the keypoints will be detected so, to optimize the image comparison process, a dynamic-scale match range is defined. This means that a first matching approximation will take into consideration only medium and high scaled keypoints (medium and low frequencies). If this first comparison renders enough matches to warrant further analysis, the matching will include the low scale (high frequency) information. After this step, a list of relevant nodes (those with more matches than a certain threshold value) will constitute the base information for the position estimation phase.
5 Pose Estimation At this point, the only information gathered in relation to the actual robot pose is a list of node views that are similar to the current view. As mentioned before, these views have a pair of position coordinates (in relation with a predefined origin) and an orientation coordinate (spanning 360 degrees through 45 degree rotations). Together, these three coordinates form the State Coordinate of the view. Two different techniques were implemented (as alternatives) to obtain the state coordinate of the robot in relation to the candidate nodes. The first technique is based on motion estimation using epipolar geometry [17, 26, 27]. The second method is based on clustering and is called QTc (Quality Threshold clustering) [18]. 5.1 Epipolar Geometry This technique attempts to recover the spatial information through projective data. This means that it can recover the relation between two separate projections of the same (or very similar) view. In this case, the scene projections are the 2D images of the related views. The geometry forces a tight relation between points in one image plane, and lines in the other. This relation is called the epipolar constraint, and states that corresponding points must be on conjugated epipolar lines [17]. The algebraic expression of this restriction is l r = Epl , where E represents a transformation matrix from a point to a line. This transformation matrix is known as the Essential matrix, and contains all the information necessary to reconstruct the epipolar geometry of the scene, from which motion estimation may be carried out. It is, therefore, of vital importance to obtain this matrix to be able to estimate the relation of one perspective to another. There are several ways of doing this [19-21]. One such technique is called the normalized-eight point algorithm [22, 23], which uses a series of matches between projections to estimate E. It is called “normalized” because it normalizes pixel coordinates to reduce unwanted numerical instability. The next step is to extract rotation and translation information between the projections. This process is shown in [17]. Once obtained, the rotation and translation matrices R and T respectively, contain information of how one camera system may be moved to arrive at the other. If motion between viewpoints is known, and the state coordinate of one viewpoint is known (one of the candidates obtained through image
200
P. Frank-Bolton et al.
comparison), then the other viewpoint's state coordinate may be obtained (the actual robot pose). It is very important to notice that the resulting solution for T is up to a scale factor. Even after solving the sign ambiguity for T [17], it only indicates direction of translation and not actual distance. To obtain the translation, at least two suitable references (ones that do not produce instability [26]) must be used so as to determine the point where the two lines of translation overlap. The precision of the overlap area depends on the accuracy at which the 3D space is reconstructed by the epipolar geometry. With real data, this precision is very sensitive to noise, and can’t be correctly determined if the actual and reference images form degenerate configurations [26, 27]. This is the reason an alternative pose estimation technique – based on Quality Threshold clustering - was tested. 5.2 Quality Threshold Clustering Clustering is a data classification and ordering technique with many applications in the areas of data mining, pattern recognition, image analysis and many more. The basic principle is to decide classification approach and then, depending on the nature of the data, generate groups or generalized representations. This may serve the purpose of grouping elements under defining characteristics, or finding boundaries between element clusters. This last interpretation may have, if thus treated, a spatial connotation that is relevant to this study. If boundaries are defined, then areas are also defined. Starting from the candidate nodes extracted in the comparison phase, a clustering algorithm may extract a definite area of concentration, both for position and for orientation, therefore generating a simple but accurate state estimation. There are many types of clustering methods. The nature of this work requires a clustering technique that dynamically selects the number of clusters and generates cluster centroids that represent them. One technique that accomplishes this is called QTc (Quality Threshold clustering) [18]. Originally used in genomics, QTc is a partitional clustering scheme that derives from the popular k-means algorithm. It trades processing speed for the freedom of not having to define the number of clusters a priori. QTc works as follows: 1. 2. 3. 4.
A maximum cluster radius is defined. Construct a candidate cluster for each point. This is done by including the next closest neighbor until the maximum distance is exceeded. The cluster with the most elements is maintained and represented by a cluster centroid (removing all cluster members). Recurse with the remaining points until no cluster may be detected.
An example of this process is shown in Figure 4.
Fig. 4. QTc applied to 5 points. Circles represent maximum cluster radius.
Vision-Based Localization for Mobile Robots
201
In Figure 4, (a) shows the initial set of points. The numbers inside the nodes represent the node's weight (relevant in the definition of the centroid). Image (b) shows the maximum cluster (which is the one tied to the node in its center). Image (c) shows the location of the cluster centroid, which is the center of mass of the total node weights. This is obtained by the weighted average of the node positions within the maximum cluster. The values used to define the relevance of each node's position are, in this case, the amount of matches registered between that particular view and the robot's actual view. Image (d) shows the final cluster centroids. Note that each final centroid is too far away to the other to unite in a sole cluster. The example shown above represents the cluster estimation of position. Nevertheless, the same logic may be applied to the orientations. The final orientation estimate would be the weighted average of the candidate orientations: ˆ = ∑ Wi Oi Θ
(1)
i
where Oi represents the orientation of candidate i, and Wi represents the weight of candidate i (number of matches of candidate i / total number of matches). The results of implementing the comparison and position estimation stages are discussed in the Results section.
6 Monte Carlo Localization Monte Carlo localization, or MCL [5, 10, 24], is a type of particle filter applied to robot localization. This technique employs a series of particles that represent the statespace, which for this particular case is the area of possible robot positions. If global estimation is being attempted with no information on the robot's initial position, then the robot's position is represented by a uniform probability distribution over all possible particles. An importance function is used to assign weights to each particle, and as the process continues, only particles that have weights over a certain value are maintained. In the localization scenario, weights are assigned in relation to a perceived signal. In [1], the perceived signal is a sonar reading of the robot's environment. Weights are assigned to particles depending on the probability that such a reading might have been registered from each particular particle's position (node). For this work, the perceived signal is a captured image, and the assigned weight is the amount of matches found between the actual view and each particle’s viewpoints. This weight is normalized over the entire amount of matches so as to accurately represent the probability of being the “most similar” node. In traditional MCL, the amount of particles is maintained throughout the operation. However, this is not the case in this study. When doing position tracking, only particles (nodes) close to estimated positions are incorporated. The sequence of steps needed to pinpoint the best particle is as follows: 1.
2. 3.
Robot motion: the robot moves and generates a series of candidate nodes, depending on the last stage's estimated position whether from a known initial position or the result of global localization. Sensor reading: after a reading, probability weights are assigned to particles. Those with lower weights than a certain threshold are eliminated. The sequence is repeated until only one particle is being propagated.
202
P. Frank-Bolton et al.
This process is used to narrow down the possible robot's position during position tracking. One important aspect of the visual correction is that it has a certain precision that may be of advantage only if the alternative (odometric tracking) has a lower precision due to increasing uncertainty. This means that visual correction is used only if the estimated odometry ambiguity supersedes visual localization uncertainty. After the correction, the process starts again from odometry tracking.
7 Results Tests carried out on diverse settings proved that an indoor setting with medium to high amount of detectable characteristics is a good benchmark test. This renders a more precise importance assignment (through keypoint matches) for reference nodes. In relation to the comparison stage, SIFT, BBF and GTM accomplish the task with minor precision errors. This is because certain outlier arrangements are undetectable by GTM and are therefore contributing factors in erroneous position estimation. In practice, due to the inability to eliminate all outliers, many more matches are needed to ensure a good estimation. For this reason, a “feature-rich environment” is advisable for augmenting estimation efficiency. In the series of experiments carried out in the above mentioned scenario, epipolar geometry proved too sensitive to noise (outliers or differences in the exact nature of the observed scene), and therefore not as precise as QTc. Orientation estimation accuracy was calculated via the percentage of orientation error. On average, in the case of epipolar geometry Θ error was of about 3.5 %. This may seem low, but this means that an orientation estimation might be off ± 13 degrees. The alternative state estimation method, QTc, turned out to render more accurate results for orientation coordinates as well as very good first order approximations for the position of the robot. Θ error was of about 2.5% on average (about ± 9 degrees). The actual position of the robot was determined as the resulting centroid of nodes (if only one remained) or the average of the resulting centroids (if more than one centroid was brought forth by QTc). These “global localization” estimates were, on average, only 46 cm off (for this specific reference node configuration). The next step in state estimation is to move the robot and generate a set of probably-close nodes. This is done by moving the estimated position and adding, as node candidates, those nodes that are within a certain radius. This is done for all estimated positions (centroids) remaining from the last stage. The comparison-localization stages are repeated and a new state coordinate calculated. In practice, QTc improves position estimation only if a denser particle grid is incorporated. For time optimization, global localization may be carried out based on a medium density particle grid, followed by position tracking based on a local subset of a high density particle grid.
8 Conclusions We proposed a localization technique based on the comparison of a current view against a database of known images, all of which were obtained by a simple monocular system. The comparison uses the SIFT feature matches as a similarity metric. The
Vision-Based Localization for Mobile Robots
203
information from several views is integrated using a clustering method (which for this work proved far superior to the alternative of using epipolar geometry). Temporal progression is taken into account with a variant of Monte Carlo localization. The proposed technique proved effective in solving the two basic localization problems. If the space is represented by a grid of particles, the estimated position and orientation varies depending on the grid's density. Epipolar geometry based on the normalized eight point algorithm proved too sensitive for position estimation, but may still be used for fine orientation adjustments once a general state coordinate is calculated through alternative methods (clustering in this case). Another factor that will contribute to time optimization is the scale at which images are compared, which is a factor that is also environment-dependent. Acknowledgments. The authors acknowledge financial support from the PAPIIT program of Universidad Nacional Autónoma de México, under grant IN104408.
References 1. Burgard, W., Fox, D., Henning, D.: Fast Grid-Based Position Tracking for Mobile Robots. In: Brewka, G., Habel, C., Nebel, B. (eds.) KI 1997. LNCS, vol. 1303, pp. 289–300. Springer, Heidelberg (1997) 2. Zhao, F.-J., Guo, H.-J., Abe, K.: A mobile robot localization using ultrasonic sensors in indoor environment. In: Proceeding of the 6th IEEE International Workshop on Robot and Human Communication, pp. 52–57 (1997) 3. Detweiler, C., Leonard, J., Rus, D., Teller, S.: Passive Mobile Robot Localization within a Fixed Beacon Field. In: Proceedings of the International Workshop on the Algorithmic Foundations of Robotics, New York (2006) 4. Sewan, K., Younggie, K.: Robot localization using ultrasonic sensors. In: Proceedings of Intelligent Robots and Systems, pp. 3762–3766 (2004) 5. Elinas, P., Little, J.J.: σMCL: Monte-Carlo localization for mobile robots with stereo vision. In: Proceedings of Robotics: Science and Systems, Cambridge, MA, USA, pp. 373– 380 (2005) 6. Beardsley, P.A., Zisserman, A., Murray, D.W.: Navigation using affine structure from motion. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 85–96. Springer, Heidelberg (1994) 7. Robert, L., Zeller, C., Faugeras, O., Hebert, M.: Applications of non-metric vision to some visually guided robotics tasks. Technical Report 2584, INRIA. France (1995) 8. Wolf, J., Burgard, W., Burkhardt, H.: Robust vision-based localization for mobile robots using an image retrieval system based on invariant features. In: Proceedings of the IEEE International Conference on Robotics & Automation (2002) 9. Sim, R., Dudek, G.: Comparing image-based localization methods. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 1560–1562 (2003) 10. Menegatti, E.P.E., Zoccarato, M., Ishiguro, H.: Image-based Monte-Carlo localization with omnidirectional images. Robotics and Autonomous Systems 48, 17–30 (2004) 11. Bennewitz, M., Stachniss, C., Burgard, W., Behnke, S.: Metric localization with scaleinvariant visual features using a single camera. In: Proceedings of European Robotics Symposium (EUROS 2006), pp. 143–157 (2007)
204
P. Frank-Bolton et al.
12. Royer, E., Lhuillier, M., Dhome, M., Lavest, J.-M.: Monocular Vision for Mobile Robot Localization and Autonomous Navigation. International Journal of Computer Vision 74, 237–260 (2007) 13. Ulrich, I., Nourbakhsh, I.: Appearance-based place recognition for topological localization. In: Proceedings of the IEEE International Conference on Robotics & Automation, pp. 1023–1029 (2002) 14. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision 60, 91–110 (2004) 15. Lindeberg, T.: Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics 21, 224–270 (1994) 16. Aguilar, W., Frauel, Y., Escolano, F., Martinez-Perez, M.E., Espinoza-Romero, A., Lozano, M.A.: A Robust Graph Transformation Matching for Non-rigid Registration. Image and Vision Computing (2008) 17. Trucco, E., Verri, A.: Introductory techniques for 3-D computer vision. Prentice-Hall, Englewood Cliffs (1998) 18. Heyer, L.J., Kruglyak, S., Yooseph, S.: Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9, 1106–1115 (1999) 19. Zhang, Z.: Determining the Epipolar Geometry and its Uncertainty: A Review. International Journal of Computer Vision 27, 161–195 (1998) 20. Luong, Q.T., Deriche, R., Faugeras, O., Papadopoulo, T.: On determining the fundamental matrix: analysis of different methods and experimental results. Technical report RR-1894. Sophia-Antipolis, France, INRIA (1993) 21. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 22. Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1981) 23. Hartley, R.I.: In defence of the 8-point algorithm. In: Proceedings of the Fifth International Conference on Computer Vision, p. 1064. IEEE Computer Society, Los Alamitos (1995) 24. Fox, D., Burgard, W., Dellaert, F., Thrun, S.: Monte Carlo localization: Efficient position estimation for mobile robots. In: Proceedings of the National Conference on Artificial Intelligence, pp. 343–349 (1999) 25. Beis, J., Lowe, D.G.: Shape indexing using approximate nearest-neighbour search in highdimensional spaces. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, Puerto Rico, pp. 1000–1006 (1997) 26. Luong, Q.-T., Faugeras, O.D.: The Fundamental Matrix: Theory, Algorithms, and Stability Analysis. International Journal of Computer Vision 17, 43–75 (1996) 27. Haralick, et al.: Pose Estimation From Corresponding Point Data. IEEE Trans. On Systems, Man and Cybernetics 19(6), 1426–1446 (1989)
On the Advantages of Asynchronous Pixel Reading and Processing for High-Speed Motion Estimation Fernando Pardo, Jose A. Boluda, Francisco Vegara, and Pedro Zuccarello Departament d’Inform` atica. Universitat de Val`encia http://tapec.uv.es
Abstract. Biological visual systems are becoming an interesting source for the improvement of artificial visual systems. A biologically inspired read-out and pixel processing strategy is presented. This read-out mechanism is based on Selective pixel Change-Driven (SCD) processing. Pixels are individually processed and read-out instead of the classical approach where the read-out and processing is based on complete frames. Changing pixels are read-out and processed at short time intervals. The simulated experiments show that the response delay using this strategy is several orders of magnitude lower than current cameras while still keeping the same, or even tighter, bandwidth requirements.
1
Introduction
Most current image and video processing applications are based on processing an image stream coming from a visual system. Each image in this stream is a snapshot of the environment taken at regular intervals. Biological visual systems work in a different way: each sensor cell (image pixel) sends its illumination information asynchronously. Another biological feature is the space-variant cell distribution in the image plane, which reduces the amount of visual data to be transmitted and processed. The artificial strategy based on a synchronous flow of space-uniform images is a few decades old, while the biological visual system is the result of several million years of evolution. Current technology makes it difficult to manufacture a complex biological visual system, but some ideas taken from the biological system can be adapted, with some changes, to improve artificial visual systems. The space-variant nature of most biological eyes has been extensively exploited in artificial visual systems for roughly two decades. Nevertheless, little attention has been focused on exploiting the asynchronous nature of biological visual systems, probably because the advantages it affords are not worth stopping current work with synchronous static images. This paper shows that independent pixel acquisition and processing has advantages over full acquisition and frame processing; moreover, its implementation is feasible with current imaging technology, taking into account the limitations in bandwidth and processing power. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 205–215, 2008. c Springer-Verlag Berlin Heidelberg 2008
206
1.1
F. Pardo et al.
Biologically Inspired Visual Sensors
The space-variant nature of visual acuity found in the human eye has been already studied in the past. Some artificial sensors have been designed to have a foveal log-polar pixel distribution [1]. Similar pixel distribution can be modeled on conventional cameras. This is not the case when trying to implement independent pixel sampling and processing since conventional cameras are based on constant sampling of space and time. Visual system studies of neural activity show that some cells respond to illumination transients in time and space (ON/OFF cells), while others have a sustained response dependent on illumination levels. This behavior has been emulated in many biomorphic sensors. The combined transient and sustained behavior has already been implemented [2], although this approach was frame based rather than asynchronous like the biological visual system. The cell activity (measured as spikes) dependency on light intensity has been emulated in some sensors which also worked asynchronously [3,4]; in these sensors each pixel works independently and sends spikes at a period which inversely depends on light intensity. The available output bandwidth is allocated according to pixel output demand, favoring the brighter pixels and not the most interesting ones that are usually those with a high spatial and/or temporal transient. There are several sensors where the spike interval is dependent on change intensity [5,6], allowing the selection of interesting areas (those with largest changes). The change event signaling depends on a Contrast Sensitivity Threshold (CST), which is also found in most biological vision systems. This CST has already been successfully employed to accelerate some movement analysis algorithms [7,8]. It is usual to consider in movement analysis that the most interesting pixel to process is the one with largest transient change. There are also some image sensor implementations that not only detect image changes, but also yield the pixel presenting a largest change thanks to a Winner Take All circuit (WTA) [9]. These sensors avoid the readout of the complete frame [10,11], allowing a speedup in the process of movement detection and analysis. 1.2
Change-Driven Image Processing
The fixed spatial and temporal sampling rate of standard cameras makes it difficult to exploit many of the advantages of biological vision systems. Nevertheless, there is a feature that can be exploited for artificial image processing: it is the change driven image processing. In biological systems, illumination variations drive movement detection; while in typical artificial systems each frame is completely processed at every time interval, even if no changes have taken place at all. Processing only those pixels that have changed, above a certain threshold, can decrease the total amount of data to be processed, while still maintaining accurate image processing. Some sensors have been designed to take advantage of data reduction (image compression) [12,13]. In this article we present Selective Change Driven (SCD) processing as a mechanism for reducing the amount of data to process. This technique is especially
On the Advantages of Asynchronous Pixel Reading
207
interesting for image capture at high rates (above the standard 25/30 fps) or custom high-rate asynchronous pixel-based sensors.
2
Selective Change-Driven Processing
The integration time (shutter) of current pixel technology can be as small as a few microseconds; consequently, a single pixel could deliver visual information at a rate of tenths of kHz, which is several orders of magnitude above the current average speed (25/30 fps). Most commercial cameras have shutter speeds ranging from 1/25 to 1/10,000. The problem of having a high shutter speed is the low Signal to Noise Ratio (SNR) and the poor image quality. Nowadays, there are some cameras that reach this speed, and the limitation is not the integration time (shutter) but the bandwidth: a 10 KHz VGA (640x480) grey-level camera would deliver around 3 Gbyte/s which is very difficult to transmit and almost impossible to process with an embedded system. In fact, these cameras have their own internal storage and are intended to record a few seconds to be processed afterwards. The SCD processing dramatically reduces the bandwidth of this kind of cameras while maintaining the advantages of such high-speed acquisition rates. A further step in data reduction, beyond change-driven image processing, is to process only those changes from frame to frame that can be considered interesting. Only one pixel could be processed per frame, in the most limited bandwidth case. The best pixel to process at each time would be that which presents the higher illumination change since it means two things: first, a large change in intensity means a fast movement around that pixel, and second, it also means that there is an object edge around that pixel. Movement and edges (high spatial/temporal changes) are usually the most interesting parts of a scene. If more bandwidth is available, more than one pixel could be taken from frame to frame; in this case, a list of “most wanted” pixels is prepared based on their illumination transient. SCD image processing reduces the amount of data to be processed by several orders of magnitude. The important question is to know whether this reduction can still offer advantages, not only by reducing bandwidth and processing requirements, but also for motion analysis. In other words, would this technique deliver more accurate motion estimation while still keeping the same pixel rate and processing power? This question is answered in the experimental section of this paper. 2.1
Theoretical SCD Camera Architecture
The proposed SCD image capture strategy works as follows: every pixel works independent from the others. Every pixel has an analog memory with the last read-out value. The absolute difference between the current and the stored value is compared among all pixels in the sensor; the pixel that differs most is selected and its illumination level and address are read out for processing. The read-out can take place at regular intervals or asynchronously, depending on the light capture cell or voting circuit. Experiments shown in next sections use a clocked
208
F. Pardo et al.
read-out strategy as it can be adapted to existing high-speed cameras without change in performance. Though there are already some sensors that behave similarly, there is a fundamental difference with large impact in image processing: every pixel has a memory of the last read-out value, while the existing sensors just signal the most interesting pixel but do not store any value. The advantage of storing the last read-out value is that the choice of the most interesting pixel is based on global change over time instead of just transient. For example, a sudden change in the image (say, switching a light on) would generate many events, but only a few pixels will be processed during the transient. This problem does not appear if the pixel selection is based on the change compared to last stored read-out value, since every pixel that has changed will be processed sooner or later even long after the transient has occurred.
3
Change-Driven Algorithm Implementation
An image processing algorithm can be programmed as the common control flow computing model, that is: an instruction sequence that is executed sequentially in a processor. On the other hand, in the data flow model, instructions are fired when the data needed for these instructions are available in an asynchronous way. The combination of the SCD camera and the data-flow computation model gives the key for the change-driven image processing. Within this approach, redundant computations are avoided since only data that change are processed. This way, the processing delay in the visual tracking loop is speeded-up. In the extreme case that just the pixel with the largest intensity variation is delivered (this could be so because, for example, of SCD bandwidth limitations), only the related computations would be fired. The SCD camera would give mostly the same information that a conventional camera, but delivering the pixels according to its intensity change. The pixel rate would be constrained by the allowed processing stage delay. An image processing algorithm must be partially rebuilt to process only the pixels that change from frame to frame. The data-flow algorithm version requires extra storage to keep track of intermediate results. 3.1
Tracking Algorithm
Change-Driven image processing strategy can be applied to many tracking algorithms, but this mechanism is particularly useful when real-time restrictions are present. One of the simplest object tracking algorithms has been implemented, since the aim of this work is to clearly show the advantages of the SCD strategy. This experiment has been designed for comparing both approaches: the classical with sequential full image processing, versus the change-driven approach. Let ΔT be the integration time of the camera and Δt the time to deliver a single pixel. Usually, ΔT >> Δt thus many pixels are processed during two consecutive frames. For the experiment that is shown in the experimental section ΔT is 500 μs and Δt is 520 ns, thus, around 1000 most significant pixels are processed between two frames.
On the Advantages of Asynchronous Pixel Reading
209
The algorithm implemented tracks a single object by simply calculating the center of mass (CM) of the object as follows. The SCD implementation follows: 1. Let I be the frame being processed and I˜ the stored frame. Both frames are analogically stored in the sensor, so, the following operation is performed inside the SCD sensor: ˜ . (x, y, i) = max(abs(I − I))
(1)
where the max function here returns three values: x and y are the coordinates of the pixel and i is its grey level. This operation calculates the most interesting pixel. This pixel is delivered from the SCD camera to the computing hardware (microcontroller, FPGA, computer, etc.) 2. Once the pixel reaches the computing hardware, there are three possibilities depending on the processing to be performed at this point. Let μ be the threshold to detect the object (the object is detected by grey level thresholding): (a) The first condition checks whether the pixel belonged to the object and now it does not belong to it. The condition and the actions to perform follows: ˜ y) ≥ μ) ∧ (I(x, y) < μ) then if (I(x, (2) sign = −1 process = yes . where sign is a variable for the equation of the new CM and process is the variable that will trigger the processing of the CM or not. (b) The second condition is the opposite; it checks whether the pixel did not belong to the object but now it does: ˜ y) < μ) ∧ (I(x, y) ≥ μ) if (I(x, sign = 1 process = yes .
(3)
(c) In the last possibility the pixel does not change and no processing must be performed: ˜ y) < μ) ∧ (I(x, y) < μ)) ∨ if ((I(x, ˜ y) ≥ μ) ∧ (I(x, y) ≥ μ)) then ((I(x, sign = 0 process = no .
(4)
3. If process = yes the pixel has changed its state and a new CM must be calculated. Let (xcm , ycm ) be the CM coordinates of the object and n the number of pixels belonging to the object. The new CM can be calculated as: n = n + sign xcm ycm
xcm − x = xcm + sign n ycm − y = ycm + sign . n
(5) (6) (7)
210
F. Pardo et al.
4. The last step updates intermediate variables and checks whether it is the end of the current frame integration time and a new one must be captured (these operations are performed inside the SCD sensor): t = t + Δt ˜ y) = i I(x, if (t = ΔT ) then I = capture() t=0.
(8)
The operations performed for each pixel are very simple and can be computed in a very short time. It means that a new CM is calculated right after a new pixel arrived so the object position is accurately calculated with the only delay of the camera shutter. The computation bottleneck of this algorithm is the calculation of the pixel that varied the most, but this computation is performed inside the SCD camera as a feature.
4
Experimental Results
Before designing an SCD camera or sensor some experiments have been carried out to measure motion estimation algorithm accuracy: a standard camera is compared to an SCD camera; the condition is that both have the same resolution and both deliver the same amount of visual information. Let us suppose a constant resolution of 320x240 pixels for both cameras. Let the standard camera image rate be 25 fps, thus the total pixel rate and processing requirement is 1.92 Mpixel/s. An SCD camera with this pixel rate would capture and deliver one pixel every 520 ns. Any integration cell would yield a very poor image with such a short shutter (integration time). First results will be shown supposing it is possible to implement an SCD camera able to capture illumination at 520 ns; with present technology this is not feasible (considering image quality); however, it is the first step to measuring the benefits of such an image capture strategy. Afterward, more realistic implementations will show that these benefits still hold. In our experiments we have assumed an SCD camera with pixel cells working synchronously. We have also assumed a Winner Take All circuit [14], which selects the row and column of the pixel with the largest change. The grey level of this pixel appears at the sensor output along with its row and column addresses. This operation is repeated at the fixed rate of 1.92 Mpixel/s (520 ns). At the same time, light intensity is integrated during the shutter period of 500 μs. To measure and compare two such cameras, a synthesized scene has been created virtually, since there is no way to emulate an SCD camera using standard cameras. The synthesized scene consists of a background in which a ball moves at very high speed (more than 1,500 pixel/s). The ball takes 160 ms to complete its trip. The scene and ball trajectory are shown in Fig. 1. A standard 48 dB SNR random noise has been injected in the image to make the simulation more realistic. Also, a test image sequence with only 24 dB SNR
On the Advantages of Asynchronous Pixel Reading
211
Fig. 1. Ball trajectory and five calculated positions using a standard camera (very low quality, 24 SNR random noise)
random noise has been generated and processed to see how a very noisy environment affects the SCD strategy. The image shown in Fig. 1 corresponds to the very low quality image sequence (24 SNR) exhibiting the last image and last position of the tracked ball along with its sinusoidal trajectory from left to right. 4.1
Same Pixel Rate and Ultra High-Speed Non-realistic Shutter
In this first experiment, the SCD camera and the standard camera are compared supposing the same pixel rate. An ultra high-speed shutter is used for the SCD camera. As we already mentioned, such a fast shutter (around 600 ns) is out of the reach of current technology when good image quality is of concern, but it is interesting as a first approach to estimating SCD maximum performance. The five “X” along the real trajectory shown in Fig. 1 (24 SNR)correspond to the tracked positions using a standard camera, while the dotted line shows the trajectory calculated with the SCD camera (each dot corresponds to the calculated position at that moment). It is clear that it is impossible to measure or even guess the original sinusoidal trajectory of the ball using the five points calculated using the standard camera. The mismatch between the calculated and real trajectories of the SCD algorithm are only due to the high noise of that experiment; the tracked position exactly matches the real sinusoidal trajectory
212
F. Pardo et al.
with a SNR of 48 dB (standard quality) and it is impossible to visually distinguish the calculated trajectory from the original one. The SCD camera perfectly reproduces the ball trajectory, while the standard camera does not offer enough points. However, the main advantage is that since a new object position is calculated for every new pixel, there is a short delay between the real and tracked ball positions using the SCD camera. In contrast, this delay is in the order of milliseconds in a standard camera, since all the pixels must be read-out and processed; it takes almost one frame to complete the position calculation using a standard camera and only one pixel, or very few, to compute the position using the SCD sensor. As can be observed in Fig. 1 (dotted line describing the SCD tracking), there is an error on the ball position estimation of less than one pixel for standard noisy images, nevertheless, this position is only known after some time of processing when all the pixels necessary for the correct point estimation have arrived. Thus, there is an error due to the delay between the computed and real position of the ball. This delay can be calculated dividing the difference between the tracked and the real ball positions by the ball speed. The average delay between the real and tracked trajectories obtained during this sequence is 190 μs. This short delay produces an error of roughly 1 pixel considering the real speed of the ball. The delay of a standard camera comes from the read-out time plus the processing time; a 20 ms delay (half frame) is an optimistic estimate for this time. An error of about 100 pixels is obtained for such a delay; this is one third of the image length and half of the trajectory shown in Fig. 1, thus a standard camera is useless for very high-speed object tracking. 4.2
Same Pixel Rate and High-Speed Realistic Shutter
In this case, the problem of having a shutter larger than the pixel read-out time is addressed. Current high-speed cameras can reach up to 10 Kfps or even more. The shutter or integration time of these cameras is in the order of tenths of microseconds. Let’s suppose a shutter time of 500 μs while keeping the pixel read-out at 520 ns to maintain the same pixel rate. As in the first experiment, the trajectory obtained with this SCD camera matches the real trajectory of the ball exactly. The only difference observed is the time delay between the calculated and real ball positions. This delay is about 650 μs and produces an error in the position estimation of roughly 2 or 3 pixels. 4.3
Less Pixel Rate and High-Speed Realistic Shutter
The presented experiments show that an SCD camera yield enough information to accurately track a fast moving object with the same bandwidth of a standard camera, while the standard camera is unable to track such trajectories and has a large delay. The SCD camera accuracy and delay are adequate for fast movement analysis. But, would it be possible to reduce the required pixel rate while still keeping the same accuracy? The last experiment (real shutter) has been repeated with a pixel read-out of 2.6 μs, which supposes one fifth of the pixel rate required by the standard camera and the preceding experiments.
On the Advantages of Asynchronous Pixel Reading
213
The trajectory calculation of this experiment is the same as the others, so there is no accuracy missing while reducing pixel rate up to one fifth (for lower pixel rates, an error in the trajectory estimation appears). The delay between real and calculated position is again the only parameter that differs. The delay in this case is around 1.3 ms and produces an error in the position estimation of roughly 5 pixels. 4.4
Discussion
A standard camera working at 25 fps proves useless for high-speed object tracking, as shown in Fig. 1; first, it is unable to estimate the object trajectory and, second, the time it takes for image capture and computation produces a high position error, since the object travels a long distance during that time. 12
10
Position error (pixels)
8
6
4
2
0
0
20
40
60
80 Time (ms)
100
120
140
160
Fig. 2. Position estimation error due to the computation delay for the three experiments
A camera working by Selective Change-Driven (SCD) strategy accurately calculates the object trajectory at any time, even using different shutter speeds and with equal or even lower pixel rates than a standard camera. The differences found, among different configurations, come from the delay between the real and calculated object positions, which produce an error of the object position estimation. Fig. 2 shows the position estimation error for the three experiments
214
F. Pardo et al.
throughout the experiment time-course. The bottom curve (around 1 pixel error) corresponds to the first experiment where an ideal unrealistic shutter has been used. The middle curve corresponds to a realistic implementation and is worse, though still useful, since the error position estimation is around 2-4 pixels. The top curve corresponds to the last experiment where the pixel rate has been dramatically reduced to one fifth, and the error still appears to be small for high-speed object tracking. The error variability over time is due to the nonconstant object speed (sinusoidal speed is faster around cero crossing and slower at valleys and mountains). Noise impact in performance has been also tested. The noise analysis is very important because this sensor must work with a high speed shutter, which is usually a significant source of noise. The experiments show that there is almost no impact in performance when standard noise is injected (48 dB). There is a significant change when processing a very noisy image sequence (24 dB), though results and performance are still good as shown in Fig. 1. Such a low quality image is not very common in most environments, even thinking of very short shutter times.
5
Conclusion
A new biologically inspired strategy for pixel read-out and processing has been presented. This strategy enables ultra-high speed movements to be analyzed while still keeping the same bandwidth requirements of conventional cameras. The simulated experiments have shown that it is possible to accurately measure movements even with a smaller pixel rate than other cameras. The hardware requirements for the implementation of an SCD camera are affordable with present CMOS technology. The low latency offered by the SCD acquisition strategy enable the implementation of many closed-loop control applications not addressed so far due to the large latency of current standard cameras. The small pixel rate and processing capabilities required also enable the use of this strategy in autonomous and embedded systems.
Acknowledgement This work has been supported by the project TEC2006-08130/MIC of the Spanish Ministerio de Educaci´ on y Ciencia and European Union FEDER.
References 1. Pardo, F., Dierickx, B., Scheffer, D.: Space-Variant Non-Orthogonal Structure CMOS Image Sensor Design. IEEE Journal of Solid State Circuits 33(6), 842–849 (1998) 2. Kameda, S., Yagi, T.: A silicon retina system that calculates direction of motion. In: Int. Symposium on Circuits and Systems, ISCAS 2003, Bangkok, Thailand, vol. 4, pp. 792–795 (2003)
On the Advantages of Asynchronous Pixel Reading
215
3. Culurciello, E., Etienne-Cummings, R.: A biomorphic digital image sensor. IEEE Journal of Solid State Circuits 38, 281–294 (2003) 4. Philipp, R.M., Etienne-Cummings, R.: Single-Chip Stereo Imager. Analog Integrated Circuits and Signal Processing 39, 237–250 (2004) 5. Higgins, C.M., Koch, C.: A Modular Multi-Chip Neuromorphic Architecture for Real-Time Visual Motion Processing. Analog Integrated Circuits and Signal Processing 24, 195–211 (2000) ¨ 6. Ozalevli, E., Higgins, C.M.: Reconfigurable Biologically Inspired Visual Motion Systems Using Modular Neuromorphic VLSI Chips. IEEE Transactions on Circuits and Systems 52, 79–92 (2005) 7. Pardo, F., Boluda, J.A., Benavent, X., Domingo, J., Sosa, J.C.: Circle detection and tracking speed-up based on change-driven image processing. In: International Conference on Graphics, Vision and Image Processing, ICGST 2005, Cairo, Egypt, pp. 131–136 (2005) 8. Boluda, J.A., Pardo, F.: Speeding-up differential motion detection algorithms using a change-driven data flow processing strategy. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 77–84. Springer, Heidelberg (2007) 9. Oster, M., Liu, S.C.: A winner-take-all spiking network with spiking inputs. In: IEEE International Conference on Electronics, Circuits and Systems ICECS 2004, Tel Aviv, Israel, vol. 11, pp. 203–206 (2004) 10. Lichtsteiner, P., Delbruck, T., Kramer, J.: Improved on/off temporaly differentiating address-event imager. In: IEEE International Conference on Electronics, Circuits and Systems ICECS 2004, Tel Aviv, Israel, vol. 11, pp. 211–214 (2004) 11. Kramer, J.: An on/off transient imager with event-driven, asynchronous read-out. In: Int. Symposium on Circuits and Systems, ISCAS 2002, Phoenix, AZ, USA, pp. 165–168 (2002) 12. Aizawa, K., Egi, Y., Hamamoto, T., Hatori, M., Abe, M., Maruyama, H., Otake, H.: Computational image sensor for on sensor compression. IEEE Transactions on Electron Devices 44, 1724–1730 (1997) 13. Hamamoto, T., Ooi, R., Ohtsuka, Y., Aizawa, K.: Real-time image processing by using image compression sensor. In: International Conference on Image Processing, ICIP 1999, vol. 3, pp. 935–939 (1999) 14. Indiveri, G., Oswald, P., Kramer, J.: An adaptive visual tracking sensor with a hysteretic winner-take-all network. In: Int. Symposium on Circuits and Systems, ISCAS 2002, Phoenix, AZ, USA, pp. 324–327 (2002)
An Optimized Software-Based Implementation of a Census-Based Stereo Matching Algorithm Christian Zinner, Martin Humenberger, Kristian Ambrosch, and Wilfried Kubinger Austrian Research Centers GmbH – ARC Donau-City-Str. 1, 1220 Vienna, Austria {christian.zinner,martin.humenberger,kristian.ambrosch, wilfried.kubinger}@arcs.ac.at
Abstract. This paper presents S 3E, a software implementation of a high-quality dense stereo matching algorithm. The algorithm is based on a Census transform with a large mask size. The strength of the system lies in the flexibility in terms of image dimensions, disparity levels, and frame rates. The program runs on standard PC hardware utilizing various SSE instructions. We describe the performance optimization techniques that had a considerably high impact on the run-time performance. Compared to a generic version of the source code, a speedup factor of 112 could be achieved. On input images of 320×240 and a disparity range of 30, S 3E achieves 42fps on an Intel Core 2 Duo CPU running at 2GHz. Keywords: Stereo vision, Census-transform, real-time, performance optimization, SSE, OpenMP.
1
Introduction
Stereo vision is a well-known sensing technology that is already used in several applications. It becomes more and more interesting, e.g., in the area of domestic robotics, but also for industrial applications. Stereo matching means solving the correspondence problem between images from a pair of cameras mounted on a common baseline. It is possible to reconstruct world-coordinates for each pixel with a known correspondence by a triangulation process. In this paper we present the Smart Systems Stereo Engine (S 3E), a performance-optimized implementation of a stereo matching algorithm based on the Census transform. Stereo matching in general demands a high computational effort. Thus, practically all known software-based real-time stereo systems use matching approaches based on the sum of absolute differences (SAD) or the sum of squared differences (SSD) over a relatively small mask size of 3×3 up to 7×7 and/or a quite limited range of disparities of, e.g., 16. Census-based systems have shown significantly better matching results, especially when using larger mask
The research leading to these results has received funding from the European Community’s Sixth Framework Programme (FP6/2003-2006) under grant agreement # FP6-2006-IST-6-045350 (robots@home).
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 216–227, 2008. c Springer-Verlag Berlin Heidelberg 2008
Optimized Software-Based Implementation of Census-Based Stereo Matching
217
sizes of up to 15×15 pixels. They are also more robust to real-world illumination conditions [1]. The drawback is that Census-based methods require operations that poorly match the typical instruction sets of general purpose CPUs. This is the reason why fast Census-based systems, such as [2], [3], [4] and [5], usually run on dedicated hardware such as FPGAs or ASICs. Therefore, we present a novel approach to realize such a system in form of a flexible PC-software module that is able to run on mobile PC platforms of frame rates significantly above 10 fps. We achieve this without using any graphic card hardware acceleration, but with intensive use of the Streaming SIMD Extensions (SSE) and the multi-core architectures of state-of-the-art PC CPUs. The remainder of this paper is outlined as follows. Section 2 gives an overview of related stereo vision systems with a focus on Census-based systems. Section 3 depicts the underlying algorithm and the properties of S 3E as a software module. The various ideas and methods applied for optimizing the run-time performance of the program are the main part of Sect. 4. In Sect. 5 the impact of the optimizations are summarized and some comments about the future direction of our work are added.
2
Related Work
The Census transform was introduced by Zabih and Woodfill [6] as a nonparametric local transform to be used as the basis for correlation. This method has shown to have various advantages when compared to other common correlation methods. Its implementation requires only simple logic elements which can be parallelized, and thus it is well suited for realization on FPGAs and dedicated ASICs. Therefore, most known implementations are of this kind. An early system from 1997 [2] is based on the PARTS engine which consists of 16 FPGAs. It processes QVGA stereo images with 24 disparities at 42fps. 10 years later, the authors of [4] presented a low-cost FPGA system with a 13×13 Census mask that is able to process 20 disparities on QVGA images at more than 150fps. They see their system as a kind of low-cost and high-performance descendant of the PARTS engine. The approach proposed in [3] consists of an ASIC that comprises both, a Census-based matching module and an SSD unit, and merges the matching results of both units. It processes 256×192 pixel images at more than 50fps with 25 disparities. The Tyzx DeepSea G2 Vision System [5] is a full-featured embedded stereo system that contains a dedicated correlation ASIC, an FPGA, a signal processor, and a general purpose CPU running Linux. It provides additional computing power for higher-level applications like tracking or navigation. Stereo matching performance reaches 200fps on 512×480 pixel images and 52 disparities. A recent Census-based real-time implementation in software is the MESVSII [7]. It runs on a DSP and focuses on miniaturization of the system. A remarkable frame rate of 20fps is achieved with 30 disparities, although at relatively small resolutions (160×120) and a small Census mask size (3×3).
218
3
C. Zinner et al.
Census-Based Stereo Matching
3.1
Algorithm Description
The workflow of the algorithm to be optimized is shown in Fig. 1.
Stereo Camera
L R
Rectification, Undistortion
Lrect Rrect
Census Trans. 16x16
Lcensus Rcensus
DSI Calculation (Hamming Dist.)
DSIRL DSILR
Camera Calibration Z-Image
Costs Aggregation
DSILR, Agg
dmthresh 3D Reconstruction
Confidence Thresholding
dmsub
Left/Right Consistency
dmsub, l dmsub, r
DSIRL, Agg
WTA + Subpixel Refinement
3D Point Cloud Disparity Map Confidence Map
Confidence Calculation
Fig. 1. Workflow of the implemented algorithm. L, R stand for left and right image and dm for disparity map.
In a first step two digital cameras capture a synchronized image pair. To correct the lens distortion the images are undistorted, and to fulfill the constraint of horizontal epipolar lines, rectified afterwards. The according stereo camera calibration is performed offline. Due to the rectification, the pixel rows of the input images are aligned in parallel to the baseline which is essential for efficient matching (registration). The first step of the registration is the Census transform of the left and right images. Each pixel is converted into a bit string (e.g., 11010110. . . ) representing the intensity relations between the center pixel and its neighborhood. Equation 1 and 2 give a formal description of the transform as introduced by [6] for a mask size of 16 × 16. 0 p 1 ≤ p2 ξ(p1 , p2 ) = (1) 1 p1 > p2 I(x, y) =
i=7,j=7
(ξ(I(x, y), I(x + i, y + j)))
(2)
i=−8,j=−8
The operator denotes a catenation. The Census transform converts the input stereo pair into images with 16×16 = 256 bits per pixel1 . After that, the 1
Using an asymmetric 16×16 mask and including the comparison result from the center pixel itself into the bit string gives advantages for an efficient implementation.
Optimized Software-Based Implementation of Census-Based Stereo Matching
219
main part of the matching, the DSI calculation, follows. DSI stands for disparity space image and is a three dimensional data structure which holds the costs of each pixel for each possible match. The DSI is computed from right to left and vice versa. The costs of a match between two Census transformed pixels is their hamming distance. It is assumed that neighboring pixels, except at disparity discontinuities, have a similar disparity level so a cost aggregation makes the matches more unique. The aggregation mask size is variable and can be chosen according to the application. The best matches are selected by winner takes all (WTA) and refined with subpixel accuracy. To filter occluded and bad matches, a left/right consistency check is performed. Additionally, a confidence map is calculated out of the WTA and left/right check results. The disparity values are thresholded according to the confidence map with an arbitrarily chosen threshold value, which results in the final disparity map. The last step is a 3D reconstruction, which delivers a Z image and a 3D point in the world coordinate system for each pixel with a valid disparity value. 3.2
Multi-purpose Software Module
The stereo matching algorithm is encapsulated in form of a C/C++ library. The module is intended to be a subsystem of machine vision systems and thus it will be used by developers or system integrators. In its current stage, S 3E is able to deliver following data as its output. – – – – –
A disparity map (sub-pixel accurate, 16 bit values, fixed point format) A Z-image (16 bits per pixel, fixed point format) A confidence map (a measure for the probability of a correct match; 8 bits) The input images after rectification (left + right) An image that contains the re-projected 3D-points (point cloud)
To enable the adaption of S 3E to application driven demands, we focused not only on the performance optimizations, but also on the flexibility of the system. Thus, the following parameters of the system can be adjusted. – – – – –
Input image size and format (8 bit, 16 bit, grayscale, RGB) Dimensions for resizing the input images in the preprocessing phase Aggregation mask size Disparity range (minimum disparity and number of disparities) Scaling exponent for the fixed point representation of computed disparity values with subpixel refinement – Confidence threshold value to adjust the filtering of unlikely matches – Further parameters to control the behavior of the Census transform and preand postprocessing filtering operations One of the advantages of a pure software implementation is the ability to provide such a level of flexibility, which makes it possible to run the S 3E with a large variety of sensor heads. It is our intention to use this software in several in-house projects and applications under a variety of different requirements.
220
4
C. Zinner et al.
Performance Optimization
Effective run-time performance optimization is a key requirement for a software based vision system to be competitive in terms of a cost-performance ratio. The following sections will, after some general explanations, highlight several measures that had the most impact on getting a high-performing stereo matching implementation. 4.1
Target Platform
The main target platform during the performance optimization was an Intel Mobile Core 2 Duo processor (model T7200) running at 2 GHz [8]. This CPU model is commonly used in notebook PCs, but also in industrial and so-called “embedded” PCs. We intentionally avoided to optimize our software only for high-end PCs with respect to a broader field of application in the industrial / mobile domain. The platform chosen supports a lot of Streaming SIMD Extensions (SSE, SSE2, SSE3) which were vital for achieving high performance. One important goal was to get an implementation of a stereo matching algorithm that provides both, capability of running on multiple platforms and delivering the best performance possible on the respective platform. We achieved this by extensively using special multi-platform high-performance imaging libraries. Currently, S 3E can be compiled for MS Windows and for Linux on a PC platform with special hardware-specific performance optimizations. We expect that an already scheduled migration of the software to the C6000 series of high-end Digital Signal Processors (DSPs) from Texas Instruments [9] will be possible with limited effort. S 3E and the libraries used can also be built in a generic mode on any ANSI-C compliant platform, but then with far slower performance. 4.2
Development Environment
The S 3E software has been implemented in C with the requirement to keep it portable among MS Windows, Linux and some DSP platforms. We used MSVC 8 on the Windows side and under Linux the GCC 4.3 compiler came to service. Both compilers can deal with basic OpenMP [10] directives which proved to be sufficient for our requirements. A common guideline of our software design process is to consequently encapsulate all computation intensive image processing functions into dedicated performance primitives libraries. The remaining part of the program is usually not performance critical and thus can be coded in a portable manner. We use the PfeLib [11] as the back-end for the performance critical functions, which in turn uses parts of the Intel Performance Primitives for Image Processing (ippIP) [12] whenever possible. Much of the performance optimizations described in the following sections were actually done in functions of the PfeLib. The library has an open architecture for adding new functions, migrating them to other platforms, and optimizing them for each platform. It also provides components for verification and high resolution performance timing on various platforms.
Optimized Software-Based Implementation of Census-Based Stereo Matching
4.3
221
The Starting Point
Before optimizing, we define a reference implementation as the starting point for the upcoming work. At this stage the program provides all features of the final version, except that most of the image processing is performed by “generic” code that lacks any platform-specific optimizations. Table 1 shows the timing of the non-optimized program when processing one stereo frame. The profiling data is structured into several program stages. For every stage it provides information about the size of the images processed, how many CPU cycles per pixel (cpp) this process required, how often this image processing function came to service and how many milliseconds the whole task took. It is clear that a processing time of ∼8.5 seconds is inapplicable for realtime stereo vision, but a priority list for an optimization strategy can be directly deduced from this table. Table 1. Performance and complexity figures of an initial non-optimized implementation with a 450×375 input image pair and a search range of 50 disparities
Function
Image dimensions
Hamming distance Census transform Cost aggregation WTA minimum search Lens undist. & rectification Aggregation: DSI lines shifting Disparity to 3D point cloud calc. Disparity to Z-image calc. Initialization stage LR/RL consistency check Confidence thresholding Complete program run
386 × 1 1136.3 435 × 360 2138.0 19996 × 1 20.1 382 × 50 11.5 450 × 375 204.5 386 × 50 0.7 333 × 356 55.9 450 × 375 27.8 450 × 375 17.4 382 × 1 10.5 450 × 375 7.9 450 × 375 100913.6
4.4
Cycles Function per pixel calls 36000 2 712 712 2 2880 1 1 1 356 1 -
Time (ms) 7894.94 334.81 143.28 78.5 34.51 19.75 3.32 2.34 1.47 0.71 0.66 8514.59
Fast Hamming Distances
Matching costs are obtained by calculating the hamming distance between two Census-transformed pixels. Simply spoken, the hamming distance can be gathered by “XORing” the two binary strings, and afterwards counting the number of set bits in the result. This task is the most expensive one of the whole stereo algorithm, because using a 16×16 Census mask results in quite long 256 bit strings. The required number of hamming distance calculations is nham ≈ 2 W HD ,
(3)
where W and H denote the width and height of the input image, and D is the number of disparities. The additional factor of two results from the need of doing an LR/RL consistency check. Another reason is, that general purpose CPUs
222
C. Zinner et al.
hardly provide dedicated instructions for bit-counting. This is also the case for SSE up to version 3. The instruction POPCNT for counting the set-bits in a 128 bit register will appear with SSE4, but it is not yet available on our target hardware. Starting from a quite cumbersome version that processed bit by bit in a loop—it took over 1100 CPU clock cycles for a bit string of 256 bits (Table 1)—some experiments using 64k lookup-tables reduced the run time to 126 cycles. The final method of choice was inspired from the BitMagic library [13]. Table 2 exemplifies the procedure on 8-bit words. It is easy to extend the method for processing 128 bit wide registers with SSE instructions. Finally, we achieved a speed of 64 cycles for calculating a 256-bit hamming distance on a single CPU core. Table 2. Hamming distance calculation scheme Pseudocode c=xor(a,b) d=and(c, 010101012 ) e=and(shr(c,1),010101012 ) f=add(d,e) g=and(f,001100112 ) h=and(shr(f,2),001100112 ) i=add(g,h) j=and(i,000011112 ) k=and(shr(i,4),000011112 ) l=add(j,k)
4.5
Data bits c7 c6 c5 c4 c3 c2 c1 c0 0 c6 0 c4 0 c2 0 c0 0 c7 0 c5 0 c3 0 c1 c6 + c7 c4 + c5 c2 + c3 c0 + c1 0 0 c4 + c5 0 0 c0 + c1 0 0 c6 + c7 0 0 c2 + c3 c4 + c5 + c6 + c7 c0 + c1 + c2 + c3 0 0 0 0 c0 + c1 + c2 + c3 0 0 0 0 c4 + c5 + c6 + c7 c0 + c1 + c2 + c3 + c4 + c5 + c6 + c7
Comment get differing bits 1-bit comb mask shift and mask add bits 2-bit comb mask shift and mask add 4-bit comb mask shift and mask hamming weight
Fast Census Transform
According to Equ. 2, a Census transform with a large mask requires many comparison operations between pixel values. SSE provides the mm cmplt epi8 intrinsic that compares 16 pairs of signed 8-bit values at once. As pixel values are unsigned, this instruction cannot be used directly. Since negative numbers are represented in the two’s complement system, it is a remedy to add a constant value of 128 to every 8-bit operand before the comparison. By this we deliberately produce arithmetic overflows, but the result of the signed comparison is the same as doing an unsigned comparison on the original values. Using the unsigned comparison operator ξ as defined in Equ. 1, and introducing ψ as a comparison operator for a signed data type, we can write this equivalence as ξ(p1 , p2 ) ≡ ψ(p1 ⊕ 128, p2 ⊕ 128) ,
(4)
where px are 8-bit words and ⊕ is an 8-bit addition with overflow. The described 16×16 Census transform produces strings 256 bits long for every pixel. Especially the computation costs for the hamming distances are, despite intensive SSE optimizations, still quite large. We recently discovered a way to cope with large mask sizes, which we call a “sparse Census transform”,
Optimized Software-Based Implementation of Census-Based Stereo Matching
223
where only every second pixel of every second row of the 16×16 mask is evaluated. This means that the bit string becomes significantly shorter, namely only 64 bits. Empiric evaluations revealed that the quality loss of the stereo matching is very small, so it was clear to us that using it is an excellent tradeoff between quality and speed. A more detailed analysis and quantification of the quality effects is ongoing work within our group. The initial implementation of the 16×16 Census took more than 2100 cpp, while the optimized SSE version runs at up to 135 cpp on a single CPU core. The sparse Census transform brought a further reduction down to 75 cpp. The hamming distance calculation also profits significantly from the reduction of the bit string length, it roughly takes only a quarter of the time compared to the value stated in Sect. 4.4. 4.6
Fast Aggregation
Aggregation operates on 16-bit disparity space images (DSI) where each pixel value represents the matching costs for a given input pixel at a certain disparity level. The aggregated cost value for a certain disparity of a pixel is the sum of cost values over a rectangular neighborhood of size {m, n} around the position {x, y} n
Cagg (x, y) =
m
2 2
Cx+i,y+j .
(5)
m −n 2 − 2
Due to the line-by-line processing of the input images, the program computes the DSI layer-by-layer. We call such a slice through the DSI a Disparity Space Layer (DSL). A DSL comprises the cost values for all disparities of an image line, thus it is a 2D-data structure, which in turn can be treated with image processing functions. It is necessary to keep the last n DSLs in memory, cf. Fig. 2(a). The problem is that the y-neighborhood of cost values is actually spread among different images, which makes it hard to use common optimized filter functions. We faced this by storing the last n DSLs into one single image frame buffer of n-fold height as shown in Fig. 2(b). Now it would be fine to have cost values with equal disparities, but from adjacent y-coordinates, on top of each other. This can be easily achieved by tweaking the image parameters describing width, height, and the number of bytes per line. We finally get a result as shown in Fig. 2(c), which is an image with only n pixels in height, but with D times the width. We notably achieved this without any movement of pixel data in memory. Now, aggregation means not more than applying a linear filter with all filter mask coefficients set to one and the common divisor also set to one. The filter kernel is indicated in its start position in Fig. 2(c). The result of the filtering can be transformed into an “ordinary” DSL in the reverse manner. The implementation of Table 1 used the general linear filter function of the ippIP and a relatively fast value of 20.1 cycles per pixel was achieved for a 5×5 aggregation mask from the beginning on. As a replacement we implemented a dedicated sum filter function with a fixed mask size with extensive use of SSE intrinsics. We achieved a single core peak performance of 4.2 cpp for a 5×5 filter.
224
C. Zinner et al. dmax
d dmax dmin
y
dmax
y-2 y-1 y
dmin dmax
x
(a) Conventional method: storing separate DSLs for each line
dmin
y-2
y
d=dmin
d=dmin+1
Ɣ
Ɣ
Ɣ
d=dmax
y-1 y
x x
(b) Storing the last n DSLs into a common frame buffer
x=0
x
x
x=W
(c) The same frame buffer with tweaked image description parameters. Common filtering functions are now applicable.
Fig. 2. Memory layout for efficient cost aggregation (example with a 3×3 mask)
4.7
Combined DSL for Left/Right Consistency Check
A left/right consistency check procedure usually implies that the DSL for a certain line of an image pair is calculated separately by horizontally shifting the right image line and matching it against the fixed left image line (RL), and also doing this the opposite way (LR). The resulting DSLs are plotted in Fig. 3(a) and Fig. 3(b). To get the disparities with the smallest costs dopt (x), the minimum values of each column have to be searched (winner takes all strategy, WTA). If the minima for a certain pixel are located at the same disparities in the LR-DSL as well as in the RL-DSL, the consistency check passed and the likelihood for having a correct match is high. It is obvious that many pairs of pixels are actually matched twice against each other during this procedure, so we analyzed if an implementation could be improved. The lower part of Fig. 3 points out our approach. When the RL-DSL is skewed as shown in Fig. 3(c), it can be overlayed to the LR-DSL without loss of information. The resulting data structure is shown in Fig. 3(d), which takes almost only half of the computations and memory, compared to the two separate DSLs. The only difference is that for searching the best LR-matches the WTA must operate in a diagonal direction rather than vertically. This technique allows us to reduce the memory usage and calculation effort for the hamming distances as well as for the cost aggregation by almost one half, so the factor of two in Equ. 3 has actually disappeared. 4.8
Using Multiple Cores with OpenMP
Due to the line-by-line processing of major parts of the stereo matching algorithm it is quite easy to partition the workload to several threads which run on different CPU cores. We parallelized S 3E for a dual core CPU, which could be accomplished with little effort and resulted in a decrease of calculation time from 143.8ms to 75.9ms. This corresponds to a speedup factor of almost 1.9.
Optimized Software-Based Implementation of Census-Based Stereo Matching xRL 2/2 2/1 2/0
3/3 3/2 3/1
4/4 4/3 4/2
5/5 5/4 5/3
xLR 6/6 6/5 6/4
7/7 7/6 7/5
8/8 8/7 8/6
9/9 9/8 9/7
0/0 1/0 2/0
dRL
1/1 2/1 3/1
2/2 3/2 4/2
xLR
2/1 3/1
4/4 5/4 6/4
5/5 6/5 6/5
6/6 7/6 8/6
7/7 8/7 9/7
(b) Separately calculated LR-DSL
xRL
2/0
3/3 4/3 5/3
dLR
(a) Separately calculated RL-DSL
dRL
225
2/2 3/2 4/2
3/3 4/3 5/3
4/4 5/4 6/4
5/5 6/5 7/5
6/6 7/6 8/6
7/7 8/7 9/7
(c) Skewed RL-DSL
8/8 9/8
xRL
0/0 1/1 2/2 3/3 4/4 5/5 6/6 7/7 1/0 2/1 3/2 4/3 5/4 6/5 7/6 8/7 2/0 3/1 4/2 5/3 6/4 7/5 8/6 9/7
9/9
dLR
8/8 9/8
9/9
dRL
(d) Union of LR and skewed RL DSL
Fig. 3. Evolving a combined DSL. The example illustrates the matching of a single 10 pixel wide image line for dmin = 0 and dmax = 2. The entries u/v stand for the matching costs of the uth pixel from the left line against the v th pixel from the right line.
5
Summary and Future Work
The performance figures of the final implementation are presented in Table 3. We were able to speed up the system from 8.5s to 75.9ms per frame for input image dimensions of 450×375 and 50 disparities on a Core 2 Duo at 2GHz. This results in a frame rate of 13fps. Table 3. Performance of the final, optimized implementation on a Core 2 Duo CPU at 2GHz. Image dimensions are 450×375 and disparity search range is 50. Optimization speedup factors are derived from a comparison against Table 1.
Function
Image Cycles Function dimensions per pixel calls
Hamming distance WTA minimum search Cost aggregation Census transform Lens undistort. + rectification Disparity to Z-image calc. Disparity to 3D point cloud calc. LR/RL consistency check Confidence thresholding Initialization stage Complete program run
435 × 50 431 × 50 22396 × 1 435 × 360 450 × 375 450 × 375 333 × 356 431 × 1 450 × 375 450 × 375 450 × 375
7.2 2.9 2.7 38.2 24.8 17.2 22.6 9.3 6.0 2.3 899.9
364 712 356 2 2 1 1 356 1 1 -
Time Speedup (ms) factor 28.53 22.07 10.82 5.98 4.18 1.45 1.34 0.72 0.51 0.2 75.93
276.7 3.6 13.2 56.0 8.3 1.6 2.5 1.0 1.3 7.4 112.1
The run-time per frame of the S 3E depends on the input image dimensions and on the size of the disparity search range. If the same scene shall be sensed with a stereo vision system at a higher input resolution, it is also necessary to
226
C. Zinner et al. 60 d=15
50
d=30 d=50
Frame rate / fps
40
d=80 30
d=120
20 10
x6 00 80 0
x4 80 64 0
x3 60 48 0
x2 40 32 0
24 0
x1 80
0
Input image dimensions
Fig. 4. Frame rates of S 3E achieved on a 2GHz Core 2 Duo CPU at various image dimensions and disparity ranges
raise the number of disparities in order to keep the depth-range of perception constant. In this case we can express the behavior of the run-time per frame tf according to the input resolution r as tf ∈ O(r3 ). The results of test runs over a variety of image dimensions and disparity ranges are depicted in Fig. 4. E.g., on QVGA input images (320×240) and a disparity range of 30, S 3E achieves 42fps on an Intel Core 2 Duo at 2GHz. The current implementation still leaves some potential for further optimization. For instance, the WTA minimum search, which is now the second most costly function in Table 3, is not heavily optimized yet. Another option is making the hamming distance calculations faster for CPUs that provide the POPCNT instruction. Using the Intel C/C++ compilers, which are available for Windows and Linux, will probably yield a further speedup. Slight modifications will be necessary to enable the program using more than two cores efficiently. A big topic will be the planned migration of the software on an embedded DSP platform. We expect that this should be possible with less additional work because of the extensive use of PfeLib, which is multi-platform capable, and it contains already many optimized functions for C64x DSPs [14].
References 1. Cyganek, B.: Comparison of Nonparametric Transformations and Bit Vector ˇ c, J. (eds.) IWCIA 2004. Matching for Stereo Correlation. In: Klette, R., Zuni´ LNCS, vol. 3322, pp. 534–547. Springer, Heidelberg (2004)
Optimized Software-Based Implementation of Census-Based Stereo Matching
227
2. Woodfill, J.I., Von Herzen, B.: Real-time stereo vision on the PARTS reconfigurable computer. In: Proceedings of the 5th IEEE Symposium on FPGAs for Custom Computing Machines (1997) 3. Kuhn, M., Moser, S., Isler, O., Gurkaynak, F.K., Burg, A., Felber, N., Kaeslin, H., Fichtner, W.: Efficient ASIC Implementation of a Real-Time Depth Mapping Stereo Vision System. In: Proceedings of the 46th IEEE International Midwest Symposium on Circuits and Systems (2004) 4. Murphy, C., Lindquist, D., Rynning, A.M., Cecil, T., Leavitt, S., Chang, M.L.: LowCost Stereo Vision on an FPGA. In: Proceedings of the 15th IEEE Symposium on FPGAs for Custom Computing Machines (2007) 5. Woodfill, J.I., Gordon, G., Jurasek, D., Brown, T., Buck, R.: The Tyzx DeepSea G2 Vision System, A Taskable, Embedded Stereo Camera. In: Proceedings of the 2006 Conference on Computer Vision and Pattern Recoginition - Workshops (2006) 6. Zabih, R., Woodfill, J.I.: Non-parametric Local Transforms for Computing Visual Correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 151–158. Springer, Heidelberg (1994) 7. Khaleghi, B., Ahuja, S., Wu, Q.M.J.: An Improved Real-Time Miniaturized Embedded Stereo Vision System (MESVS-II). In: Proceedings of the 2008 Conference on Computer Vision and Pattern Recoginition - Workshops (2008) 8. Intel Corporation: Intel Core2 Duo Processors and Intel Core2 Extreme Processors for Platforms Based on Mobile Intel 965 Express Chipset Family, Document Number:316745-005 (January 2008) 9. Texas Instruments: TMS320C6414T, TMS320C6415T, TMS320C6416T FixedPoint Digital Signal Processors Lit. Number: SPRS226K, http://www.ti.com 10. OpenMP Architecture Review Board: OpenMP Application Program Interface (May 2008), http://openmp.org 11. Zinner, C., Kubinger, W., Isaacs, R.: Pfelib: A Performance Primitives Library for Embedded Vision. EURASIP J. on Embed. Syst. 2007(1), 14 pages (2007) 12. Intel Corporation: Intel Integrated Performance Primitives for Intel Architecture. Document Number:A70805-021US (2007) 13. Kuznetsov, A.: BitMagic Library: Document about SSE2 Optimization (July 2008), http://bmagic.sourceforge.net/bmsse2opt.html 14. Zinner, C., Kubinger, W.: ROS-DMA: A DMA Double Buffering Method for Embedded Image Processing with Resource Optimized Slicing. In: RTAS 2006: Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2006), pp. 361–372 (2006)
Mutual Information Based Semi-Global Stereo Matching on the GPU Ines Ernst1 and Heiko Hirschm¨ uller2 1
2
Department of Optical Information Systems, Institute of Robotics and Mechatronics, German Aerospace Center (DLR), Berlin, Germany
[email protected] Department of Robotic Systems, Institute of Robotics and Mechatronics, German Aerospace Center (DLR), Oberpfaffenhofen, Germany
[email protected]
Abstract. Real-time stereo matching is necessary for many practical applications, including robotics. There are already many real-time stereo systems, but they typically use local approaches that cause object boundaries to be blurred and small objects to be removed. We have selected the Semi-Global Matching (SGM) method for implementation on graphics hardware, because it can compete with the currently best global stereo methods. At the same time, it is much more efficient than most other methods that produce a similar quality. In contrast to previous work, we have fully implemented SGM including matching with mutual information, which is partly responsible for the high quality of disparity images. Our implementation reaches 4.2 fps on a GeForce 8800 ULTRA with images of 640 × 480 pixel size and 128 pixel disparity range and 13 fps on images of 320 × 240 pixel size and 64 pixel disparity range.
1
Introduction
Fast stereo matching is necessary for many practical real-time applications, including robotics. Often, it is not necessary to perform stereo matching at the video frame-rate. Instead, processing several frames a second in a VGA like resolution can be sufficient. Commercial real-time stereo systems are either based on special hardware that delivers disparity images at frame-rate [1,2] or are available as pure software development kit [3]. These solutions are all based on local approaches that perform correlation of rectangular windows, followed by a winner-takes-all disparity selection. Correlation methods are fast, but they are known to blur object boundaries and remove small structures [4]. More elaborate methods require higher computational resources, which are available on graphics cards. Modern graphics processing units (GPU) are usable as high-speed coprocessors for general purpose computational tasks. For several years already, high-end graphics processors have been supporting high performance applications through dedicated programmable vertex and fragment processors [5]. However, programs on these GPUs were limited to the capabilities of the specialized hardware. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 228–239, 2008. c Springer-Verlag Berlin Heidelberg 2008
Mutual Information Based Semi-Global Stereo Matching on the GPU
229
With only a few exceptions [6] the existing graphics APIs required the transformation of computationally intensive core algorithms into rendering tasks. In 2007, NVIDIA introduced with the G8 series a new generation of GPUs [5]. The Compute Unified Device Architecture (CUDA) [7] combines a new hardware concept (built around just one type of programmable processor) with a new and more flexible programming model. CUDA provides a C-like abstraction layer for implementing general purpose applications on GPUs without substantial knowledge of underlying hardware or graphics concepts. CUDA cards overcome limitations of earlier hardware such as unsupported data scattering on fragment shader level, memory read- or write-only properties or missing integer arithmetic. The CUDA concept is continued with the introduction of the G9, G200 and TESLA series with ever growing numbers of unified processors and amounts of on-board memory1 . Nevertheless, all new GPUs continue to support the widely used graphics APIs OpenGL and DirectX for their original task, i.e. very fast graphics rendering.
2
Previous Work
There are real-time GPU implementations of local stereo methods that try to increase accuracy by different aggregation methods [8]. Others join multiple resolutions [9] or windows [10]. Plane sweep stereo methods appear particularly well suited for GPU implementations and reach high frame rates [11,12,13]. None of the stereo methods above are included in the Middlebury online evaluation [14]. In fact, almost all methods of this comparison are global stereo approaches that perform pixelwise matching, controlled by an energy function that connects all pixels of the image with each other. Global methods are more accurate than local methods [15], but their internal complexity is typically much higher than the complexity of local methods. Therefore, their run-time is often several orders of magnitudes higher than that of local methods, which makes them unsuitable for real-time applications. An exception are dynamic programming solutions, which are global in one dimension. Gong and Yang [16] reached real-time performance with a two pass dynamic programming method, implemented on the GPU. The average error (i.e. average of errors at non-occluded, all and discontinues pixel areas over four datasets) is 10.7% according to the Middlebury evaluation. Another exception is the Semi-Global Matching (SGM) method [17], which combines several one-dimensional optimizations from all directions. Its complexity is O(width × height × disparityrange), like local methods, which results in efficient computations. On the other hand, its accuracy is comparable to that of global methods. The Middlebury evaluation rates the original method with 7.5%, while the consistent variant for structured environments has an average error of just 5.8%. This is not far away from the average error of 4.2% of the currently very best method in this evaluation. When looking at the run-times, even the CPU implementation of SGM is several orders of magnitudes faster than 1
Developments of other GPU vendors are not reviewed within this article.
230
I. Ernst and H. Hirschm¨ uller
most other methods of the evaluation. Furthermore, the algorithm has a regular structure and the basic operations are very simple, which allows an efficient GPU implementation. Rosenberg et al. [18] implemented the core part of SGM on a NVIDIA 7900 GTX using Cg. Their implementation includes matching with absolute differences, left/right consistency checking and hole filling. The implementation reached 8 fps on images of 320 × 240 pixel size with 64 pixel disparity range. Gibson and Marques [19] used CUDA on a NVIDIA Quadro FX56002 . In their approach, matching is implemented with the sampling-insensitive absolute difference [20] and the smoothness penalty is adapted to the intensity gradient, but no left/right consistency checking is done. The implementation reaches 5.9 fps on images of size 450 × 375 pixels with 64 pixel disparity range. It has been found that matching with mutual information performs better than absolute differences, even on images with no apparent radiometric differences [21]. Furthermore, the hierarchical computation of mutual information (HMI) gives the same result as the iterative computation [17]. Therefore, in contrast to previous work, we implemented the full SGM method on the GPU, including hierarchically calculated mutual information, intensity gradient sensitive smoothness penalty, left-right consistency checking and sub-pixel interpolation. The SGM algorithm is reviewed in Section 3. Its implementation on the GPU is explained in Section 4. The quality and speed of the implementation are evaluated and compared to previous implementations in Section 5. Section 6 concludes the paper.
3
The Semi-Global Matching Algorithm
We assume a rectified, binocular stereo pair as input and refer to the 8 bit intensity values of the left and right image with IL and IR . The following sections describe the individual steps of the method from an algorithmic point of view, which is visualized in Figure 1. The interested reader is referred to the original publications [17] for the derivation and justification of these steps. 3.1
Pixelwise Matching Costs Using Mutual Information
The cost for matching two pixels is derived from mutual information (MI). It is computed from the joint entropy of correspondences of both images (HIL ,IR ) and the entropies of the left (HIL ) and right image (HIR ) as M IIL ,IR = HIL + HIR − HIL ,IR .
(1)
The joint entropy, which is defined as P log(P ), is computed by Taylor expansion, according to Kim et al. [22], as a sum of data terms hIL ,IR over all pixels p and their correspondences q HIL ,IR = hIL ,IR (ILp , IRq ). (2) p 2
According to personal communication with the author.
Mutual Information Based Semi-Global Stereo Matching on the GPU
231
Fig. 1. Flowchart of the Semi-Global Matching method
The data term is computed from the probability distribution (P ) and convolution with a Gaussian kernel for Parzen estimation by hIL ,IR (i, k) = −
1 log(PIL ,IR (i, k) ⊗ g(i, k)) ⊗ g(i, k). n
(3)
The probability distribution is calculated from the histogram of corresponding intensities of both images. This requires an initial guess of correspondences, i.e. an initial disparity image Dinit . Section 3.5 explains where this initial disparity image comes from. Thus, Dinit is used for collecting corresponding intensities from IL and IR , by ignoring occlusions. Dividing all histogram entries by the number (n) of correspondences results in the joint probability distribution PIL ,IR , which is a table of 256×256 entries for 8 bit images. Thereafter, equation (3) is used for computing table hIL ,IR of data elements. The probability distributions PIL and PIR can be computed from PIL ,IR by simply summing over all lines or columns of PIL ,IR . The data terms hIL and hIR are computed similarly to (3), except that they are one dimensional arrays instead of a two dimensional table. Finally, all data terms are summed according to (1) for getting the table of matching costs miIL ,IR (i, k) = hIL (i) + hIR (k) − hIL ,IR (i, k).
(4)
It can be seen that summing with all pixel correspondences over this table results in the mutual information, which is to be minimized. For SGM, the table mi is used for computing the cost for matching pixel p with the corresponding pixels at disparity d C(p, d) = miIL ,IR (L(p), R(p − [d, 0]T )).
(5)
232
3.2
I. Ernst and H. Hirschm¨ uller
Aggregation of Pixelwise Matching Costs
The SGM method approximates the minimization of the energy, E(D) = (C(p, Dp )+ P1 T[|Dp − Dq | = 1]+ P2 T[|Dp − Dq | > 1]). p
q∈Np
q∈Np
(6) The first term sums the pixelwise matching costs for all pixels. The second term penalizes small discontinuities with a penalty P1 , while the third term penalizes all larger discontinuities with a penalty P2 . The approximation is implemented by summing pathwise costs according to (6) into a cost volume. This process can be seen as aggregation. The pixelwise matching costs are aggregated into a cost volume (S(p, d)) by going in 8 directions (r) through all pixels of the image IL . The directions are defined as [1, 0]T , [1, 1]T , [0, 1]T , [−1, 1]T and so on. The first pixels of each path (i.e. the pixels at the image border) are defined by the pixelwise matching cost as Lr (p, d) = C(p, d). The costs at all further pixels in the path direction r are computed according to (6) by Lr (p, d) =C(p, d) + min(Lr (p − r, d), Lr (p − r, d − 1) + P1 , (7) Lr (p − r, d + 1) + P1 , min Lr (p − r, i) + P2 ) − min Lr (p − r, i). i
i
The value of P1 is a constant, while P2 is adapted to the intensity gradient P along the path by P2 = |Ibp −I2bp−r | , with P2 as constant. The values of Lr (p, d) are added to S(p, d) for all disparities d at each pixel p. Additionally, they are kept as previous values for the next step r along the path. 3.3
Disparity Selection
The disparity at each pixel is selected as the index of the minimum cost DL (p) = argmind S(p, d).
(8)
Sub-pixel estimation is implemented by fitting a parabola through neighboring costs S(p, DLp − 1) − S(p, DLp + 1) sub DLp = DLp + . (9) 2S(p, DLp − 1) − 4S(p, DLp ) − 2S(p, DLp + 1) This parabolic fitting is used as an approximation in the absence of a theoretically derived sub-pixel interpolation function for a complex matching cost like MI. However, it has been found that this choice delivers good results. 3.4
Post Filtering and Consistency Checking
The disparities that correspond to the right image are derived from the same cost volume S by a diagonal search for the minimum, i.e. DR (p) = argmind S(p + [d, 0]T , d).
(10)
Both disparity images are filtered with a 3x3 median for removing outliers. The result is used for a left right consistency check. If |DLp − DRq | > 1 with q = p − [d, 0]T , then the value at DLp is set to invalid.
Mutual Information Based Semi-Global Stereo Matching on the GPU
3.5
233
Hierarchical Computation of MI
As discussed earlier, the computation of the matching cost table (4) requires an initial disparity image Dinit . It has been found that MI can be computed hierarchically, by starting with images that are downscaled by factor 2f (e.g. f 16), which results in ILf and IR . The initial disparity is set to random values. The random sampling within the disparity range is sufficient for computing an f initial matching table that is used for matching ILf and IR , which results in the f disparity image DL . The disparity image is upscaled by simple interpolation f −1 f −1 to DL and used as initial disparity image for matching ILf −1 and IR . The process is repeated until f = 1. The hierarchical computation reduces the runtime of an otherwise iterative computation of MI that would require several runs at full resolution. It is important that the hierarchical computation is only used for refining the initial disparity for MI computation and not for reducing the disparity search range, as this could easily lead to loosing small objects. It has been found that the matching quality of the hierarchical computation of MI equals that of the iterative computation [17].
4
GPU Implementation
We started our work on the G7 GPU architecture for which the OpenGL/Cg [23,24] programming technique is available. This decision was supported by the fact that important parts of the algorithm map very well to graphics concepts and the OpenGL drivers offer highly optimized parallelization and scheduling mechanisms for this class of computations. Also, using OpenGL/Cg rather than CUDA allows using older-generation GPUs (e.g. G7). Our current implementation of SGM with HMI is based on this proven technique, but is also intended as a reference for a future migration to CUDA, which will overcome the OpenGL/Cg limitations described in Section 1. Our implementation is based on the usage of three render buffers of a frame buffer object. These RGBA-buffers with a 16-bit-float data type (i.e. 1 sign bit, 5 exponent bits, and 10 mantissa bits) are used in a ping pong technique for keeping the data while the arithmetic operations are done in 32-bit-float precision. Almost all work is carried out through the execution of OpenGL rendering commands and several specialized fragment shader programs. For the MI matching table calculation a vertex shader program is used in addition. 4.1
GPU Implementation of SGM
The priority objective for designing the memory buffer partitioning is to provide the input data in a form that minimizes the number of memory accesses in all computations. Generally it is worth keeping in mind that GPUs are principally designed and optimized for fast 3D rendering of textured objects. Calculations should be done on four values synchronously with respect to the super-scalar
234
I. Ernst and H. Hirschm¨ uller
architecture of the G7 (and earlier) GPUs and the associated data structures in OpenGL/Cg. Initially, the left and right original images IL and IR are considered as textures and loaded with all necessary levels of detail to one color channel of a render buffer. A second channel holds 180 degree rotated versions of the images. According to Section 3.2, a cost volume S is calculated. S is mapped to a render buffer as a sequence of width/2 rectangles, each of dimension disparityrange × height/2. Every color channel of S contains cost values belonging to a quarter of the original image IL . S stays resident in the render buffer until all path costs for all image pixels have been calculated. The path costs for one pixel in one path direction are dependent only on the path costs of the predecessor pixel in this direction, not on the costs of the neighboring pixels perpendicular to the path. Therefore all path costs e.g. in horizontal direction for an entire image column can be calculated in parallel. The calculation of path costs for vertical or diagonal directions r is done analogously, respecting the corresponding predecessor dependencies. The computation of four path directions in the four color channels in parallel produces all Lr values by rendering of width rectangles of size disparityrange × height and height rectangles of size disparityrange × width. Experiments showed that the most efficient way to get the values of the cost volume C for the Lr calculation is to calculate them on the fly rather than storing a separate three-dimensional array C. The values of the adaptive penalty P2 can either be determined during the calculation of Lr or be pre-calculated and stored on a render buffer. The latter option is slightly faster in practice. All Lr values depend not only on their predecessor values, according to (7), but also on the minima of the path costs for all disparities for the previous pixel in the current path. Prior to the calculation for the next pixel row and column, some of the Lr channels are shifted in order to optimize the texture fetch in the next stage. For finding the minima of the Lr values for all disparities, a composite procedure of some comparisons in the fragment shader and application of OpenGL blending equation GL MIN turned out to be the fastest. When the path costs for one column and one row are available they are added to the cost volume S. While the Lr values for one image column can be added by rendering only one rectangle, the values for an image row have to be assigned to S in lines. Care must be taken w.r.t. the range of the values of the Lr because they are accumulated into S through the GPU blending unit with the parameter GL FUNC ADD for the blending equation, and the sum is always clamped to the range [0., 1.]. The next step is the disparity selection as described in Section 3.3. The SGM algorithm does not only require computing the minimum values of the accumulated path cost values in S but also determining the indices of these minimal path costs per pixel, which represent the disparity values. Therefore it is not possible to use the GPU blending unit with blending equation GL MIN. We implemented a reduce method, which finds both, the values and the positions of the minima for all rows of the rectangles representing S by rendering only a
Mutual Information Based Semi-Global Stereo Matching on the GPU
235
few polygons. In the first reduce step explicit indices are generated and stored together with the values. After the last reduce step, the remaining columns of the rectangles of S contain the minimal path costs and the corresponding raw disparity columns for the quarters of IL . The sub-pixel interpolation can be integrated into the first reduce step with marginal computational effort. The cost volume S has to be kept for a diagonal search described in 3.4 for deriving a right disparity image. While all subsequent reduce steps are equivalent to the corresponding ones of the left disparity image, the first reduce pass uses a special address function in the fragment shader. This function handles mapping of the search diagonals to the memory structure of S as described above. In order to work around unexpected results of floating point modulo operations [25] we use a hand-written modulo function. When both disparity images have been derived, they are filtered by a 3 × 3 median and the thresholding as described in Section 3.3 is applied for obtaining the final valid disparity image. 4.2
Implementation of the MI Cost Table on the GPU
In order to avoid unnecessary data transfer to and from the CPU, the MI matching table mi derived in Section 3.1 is calculated on the GPU, too. Here the main challenge is the calculation of the joint probability distribution PIL ,IR . This task requires free data scattering and is not possible on the fragment shader level under OpenGL/Cg. However, if all necessary features of OpenGL 2.1 are supported by the GPU a vertex shader program is able to calculate PIL ,IR 3 . For larger input images, the limited accuracy of the 16-bit float memory buffers available on the GPU requires data partitioning. Thus, the distribution calculation needs to be done for smaller image tiles with a subsequent accumulation phase analogous to [26]. When the joint distribution is available, the single probability distributions are calculated by one-dimensional reduce operations. The unscaled matching table is calculated by applying 5× 5 Gaussian filters, logarithm functions and summing in the way described in Section 3.1. Finally, scale factors for that 256×256- array are determined by a two-dimensional reduce operation. They are utilized for adapting the penalties P1 and P2 and for setting up the final matching table mi. 4.3
GPU Implementation of Hierarchical Computation
The computation of disparity images starts on a coarse level of detail with a random initial matching table. The SGM algorithm is executed and, as described in Section 3.5, the disparity image that is found in this step is up-scaled and a new matching table for the next level of detail is calculated. This iteration terminates when a final disparity image for the original image resolution has been reached. With a G7 GPU, all computationally demanding calculations are 3
Unfortunately on the G7 architecture, vertex textures of our render buffer format are not supported. As a workaround the right image can be remapped according to the disparity image on the GPU and after uploading this, the distribution table is computed on the CPU with a slight performance drop.
236
I. Ernst and H. Hirschm¨ uller
done on the GPU, except for the determination of the joint distribution. If GPU and driver support all required OpenGL features (G8 and higher, Section 4.2) all computationally demanding calculation steps from the original stereo image pair to the resulting disparity image are done on the GPU, without hampering the computation by costly data transfers between CPU and GPU.
5
Results
We have tested the implementation on a NVIDIA GeForce 7800 GTX with 256 MB as well as a GeForce 8800 ULTRA with 768 MB on-board memory. Disparity images that were computed on the graphics card are shown in Figure 2.
Fig. 2. Results of computing SGM on the GPU with mutual information and 5 hierarchical levels
The interesting aspect of the GPU implementation is the run time on different image sizes and disparity ranges. Figure 3 shows the results, which include transfering the stereo image pair onto the graphics card and the disparity image back to main memory. In theory, the run-time of the method should scale linear with the number of pixels in an image as well as the disparity range. However, we have found that the runtime on small images (i.e. 320 × 240) increases only slightly when doubling the disparity range, while the increase comes closer to the expectation when using large images. This is probably because the massive parallelization of the graphics card cannot be properly used for small images. It is similar to the image sizes. The run-time appears to scale linear to the image width and not to the number of pixel of the input images. In contrast, the run-time scales worse than expected to the number of hierarchical levels. The run-time increase for matching the images at five hierarchical levels is expected to be 14% compared to only matching at full resolution [17].
700
600
600
500
500 400
(a) GeForce 7800 GTX
512x512 x200
512x512 x100
640x480 x128
512x512 x100
480x640 x64
0 450x376 x64
100
0 320x240 x64
200
100
640x480 x64
300
200
450x376 x64
300
320x240 x64
400
237
5 hierarchical levels Full resolution only
320x240 x32
Time [ms]
700
320x240 x32
Time [ms]
Mutual Information Based Semi-Global Stereo Matching on the GPU
(b) GeForce 8800 ULTRA
Fig. 3. Run-times of the GPU implementation in different configurations, including the time for transfering the stereo image pair onto the graphics card and the disparity image back to main memory. Not all combinations are possible on the GeForce 7800, due to less memory.
In our measurements (Fig. 3) it even doubles, on lower image sizes. This is probably due to the constant overhead of MI computation. Fortunately, the relative overhead of MI computation is reduced on larger images. The original CPU implementation on an Opteron with 2.2 GHz required 1.8 s on images of size 450 × 375 pixels with 64 pixel disparity range [17]. In contrast, our implementation requires just 114 ms at the same image resolution and disparity range. This is 15 times faster than the CPU implementation. In comparison, Rosenberg et al. [18] implemented SGM in Cg on a NVIDIA GeForce 7900 GTX. They implemented the absolute difference as matching cost instead of Mutual Information, which is faster, but offers a lower quality. Like us, they computed 8 paths for aggregation (but probably without adapting P2 ) and also computed the right image for consistency checking by searching diagonally through the aggregated costs S. In contrast to us, they also did hole filling, but no sub-pixel interpolation. They reached 8 fps using an image size of 320 × 240 pixel and 64 pixel disparity range. Our full implementation on the probably comparable NVIDIA GeForce 7800 GTX reaches 4.7 fps with the same resolution and disparity range, but includes computing 5 hierarchical levels with mutual information in contrast to Rosenberg et al. With just one hierarchical level, our implementation reaches 8.1 fps. Thus, our implementation has the same efficiency, but allows to compute the full method at the cost of a higher run-time. Gibson and Marques [19] implemented SGM in CUDA on a NVIDIA Quadro FX5600. They used the more sophisticated sampling insensitive absolute difference [20] instead of Mutual Information. Like us, they implemented 8 paths with adaptive P2 . However, they did not implement consistency checking or sub-pixel interpolation. They reached 5.9 fps using an image size of 450 × 375 pixel with 64 pixel disparity range. Our full implementation on the probably comparable NVIDIA GeForce 8800 ULTRA reaches 8.8 fps with the same resolution and disparity range, but includes consistency checking and computing 5 hierarchical
238
I. Ernst and H. Hirschm¨ uller
levels with mutual information in contrast to Gibson and Marques. With just one hierarchical level, but still with the consistency checking, our implementation reaches 16.1 fps. Thus, our Cg implementation appears much more efficient.
6
Conclusion
We have shown that it is possible to implement the full SGM algorithm including pixelwise matching with mutual information on the GPU. Our implementation reaches 4.2 fps on a GeForce 8800 ULTRA with images of 640 × 480 pixel size and 128 pixel disparity range and 13 fps on images of 320 × 240 pixel size and 64 pixel disparity range. This is already enough for many real-time applications. According to reported run-times, our implementation has a comparable efficiency to another Cg implementation of SGM and appeared much more efficient than a CUDA implementation. However, since CUDA offers more flexibility and higher abstraction from the graphics hardware, we are going to implement SGM in CUDA and hope to reach at least the same performance as in our Cg implementation.
References 1. Videre Design: Stereo on chip (2008), http://www.videredesign.com/vision/stoc.htm 2. Tyzx: Deep sea g2 vision system (2008), http://www.tyzx.com/products/DeepSeaG2.html 3. Point Grey: Triclops SDK (2008), http://www.ptgrey.com/products/triclopsSDK/index.asp 4. Hirschm¨ uller, H., Innocent, P.R., Garibaldi, J.M.: Real-time correlation-based stereo vision with reduced border errors. International Journal of Computer Vision 47, 229–246 (2002) 5. Owens, J.: GPU architecture overview. In: International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2007 courses, San Diego, CA, USA. ACM, New York (2007) 6. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: Stream computing on graphics hardware. In: SIGGRAPH Conference (2004) 7. NVIDIA: CUDA compute unified device architecture, prog. guide, version 1.1 (2007) 8. Wang, L., Gong, M., Gong, M., Yang, R.: How far can we go with local optimization in real-time stereo matching. In: Third International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT) (2006) 9. Yang, R., Pollefeys, M.: Real-time stereo on commodity graphics hardware. In: IEEE Conference for Computer Vision and Pattern Recognition (2003) 10. Woetzel, J., Koch, R.: Real-time multi-stereo depth estimation on GPU with approximative discontinuity handling. In: March (ed.) 1st European Conference on Visual Media Production, London, UK (2004) 11. Gallup, D., Frahm, J.M., Mordohai, P., Yang, Q., Pollefeys, M.: Real-time planesweeping stereo with multiple sweeping directions. In: IEEE Computer Vision and Pattern Recognition, Minneapolis, MN, USA (2007)
Mutual Information Based Semi-Global Stereo Matching on the GPU
239
12. Cornelis, N., Van Gool, L.: Real-time connectivity constrained depth map computation using programmable graphics hardware. In: IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, vol. 1, pp. 1099–1104 (2005) 13. Yang, R., Welch, G., Bishop, G.: Real-time consensus-based scene reconstruction using commodity graphics hardware. In: Pacific Graphics 2002, Beijing, China (2002) 14. Scharstein, D., Szeliski, R.: Middlebury stereo website (2008), http://www.middlebury.edu/stereo 15. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47, 7–42 (2002) 16. Gong, M., Yang, Y.H.: Near real-time reliable stereo matching using programmable graphics hardware. In: IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, vol. 1, pp. 924–931 (2005) 17. Hirschm¨ uller, H.: Stereo processing by semi-global matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 328–341 (2008) 18. Rosenberg, I.D., Davidson, P.L., Muller, C.M.R., Han, J.Y.: Real-time stereo vision using semi-global matching on programmable graphics hardware. In: International Conference on Computer Graphics and Interactive Techniques - SIGGRAPH (2006) 19. Gibson, J., Marques, O.: Stereo depth with a unified architecture GPU. In: IEEE Conference on Computer Vision and Pattern Recognition (2008) 20. Birchfield, S., Tomasi, C.: A pixel dissimilarity measure that is insensitive to image sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 401–406 (1998) 21. Hirschm¨ uller, H., Scharstein, D.: Evaluation of cost functions for stereo matching. In: IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, USA (2007) 22. Kim, J., Kolmogorov, V., Zabih, R.: Visual correspondence using energy minimization and mutual information. In: International Conference on Computer Vision (2003) 23. OpenGL: Home page (2008), http://www.opengl.org/ 24. NVIDIA: Cg Toolkit, User’s manual, Release 1.4, A Developer’s Guide to Programmable Graphics (2005) 25. Dencker, K.: Cloth Modelling on the GPU. PhD thesis, Department of Computer and Information Science (2006) 26. Scheuermann, T., Hensley, J.: Efficient histogram generation using scattering on GPUs. In: Proceedings of the 2007 symposium on Interactive 3D graphics and games, Seattle, Washington, USA, pp. 33–37. ACM, New York (2007)
Accurate Optical Flow Sensor for Obstacle Avoidance Zhaoyi Wei, Dah-Jye Lee, Brent E. Nelson, and Kirt D. Lillywhite Department of Electrical and Computer Engineering, Brigham Young University, Provo, UT USA
Abstract. In this paper, an accurate optical flow sensor based on our previous design is proposed. Improvements are made to make the optical flow sensor more suitable for obstacle avoidance tasks on a standalone FPGA platform. Firstly, because optical flow algorithms are sensitive to the noise, more smoothing units are added into the hardware pipeline to suppress the noise in real video source. These units are hardware efficient to accommodate limited hardware resources. Secondly, a cost function is used to evaluate the estimated motion vector at each pixel for higher level analysis. Experiment results show that the proposed design can substantially improve the optical flow sensor performance for obstacle avoidance applications.
1 Introduction Optical flow algorithms are widely used for motion estimation. Compact embedded real-time optical flow sensor will find applications in many computer vision tasks such as small unmanned autonomous vehicle navigation, vehicle collision detection, 3D reconstruction, and many other applications requiring standalone real-time motion estimation. Due to the high computational cost, traditional general purpose processors typically cannot meet the size, power consumption, and real-time computation requirements. In recent years, customized hardware logics are designed to calculate optical flow in pipelined hardware structure in order to achieve real-time processing [1]-[8]. Most of them are FPGA-based solutions. FPGA-based solutions can achieve much higher processing speed than softwarebased solutions, but are usually not as accurate. Using the Yosemite sequence as an example, a good software-based algorithm can achieve an angular error around 1° [9]. In comparison, the angular errors of FPGA-based designs are typically more than 7° [4, 7, 8]. There are two main reasons for this performance difference. First, most state-of-the-art optical flow algorithms are software-oriented and not a good match for implementation using a hardware pipeline structure. Second, optical flow algorithms require strong smoothing to suppress noise in order to extract accurate motion information. The limited hardware resources available on most FPGA platforms do not allow sufficient smoothing compared to software. An optical flow sensor using a new hardware structure and ridge regression algorithm is designed to achieve a tradeoff between accuracy and efficiency [8]. This sensor is used for obstacle avoidance for small unmanned ground vehicle (UGV). In practice, the on-board camera generates noisy video compared to the image sequences G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 240–247, 2008. © Springer-Verlag Berlin Heidelberg 2008
Accurate Optical Flow Sensor for Obstacle Avoidance
241
used in simulations. Because optical flow algorithms are sensitive to image noise, the performance would be suboptimal if the images are directly processed without any preprocessing. To address this, a smoothing unit is designed to filter the noise in the raw image frames. It can be efficiently implemented and added in the hardware pipeline. To further smooth the motion field as post-processing, a smoothing unit is also designed to effectively smooth the calculated motion field horizontally based on the noise level. Besides the motion field, the confidence measure of each estimated motion vector is also needed in practice to evaluate the quality of the motion vector. This is important because we want to discriminate pixels with unreliable motion estimation from those with truly inconsistent motion resulted from suspicious obstacle regions. A confidence measure which can be implemented in hardware efficiently is proposed to fulfill this goal. This paper is organized as follows. In Section 2, the algorithm formulation is introduced. Hardware structure is discussed in Section 3. The experimental results on synthesized and real videos are presented in Section 4. Conclusions and future work are discussed finally.
2 Algorithm Formulation The algorithm implemented for this work is similar to the work in [8]. It is presented here briefly for clarify. More detail derivations of the algorithm can be found in [8]. The brightness constancy assumption can be written as g(x+∆x,y+∆y,t+∆t) = g(x,y,t) where x, y are the spatial components, t is the temporal component. An equation regarding gradient components gx, gy, and gt and velocity components vx and vy is derived [10]
g x vx + g y v y + g t + ε = 0 ⇒ -gt = g x v x + g y v y + ε
(1)
where vx=∆x/∆t, vy=∆y/∆t, and ε is the error representing the higher order terms and noise. Each pixel in the image has one set of observation {gti, gxi, gyi}. In a small local neighborhood of n pixels, it can be assumed that they all have the same velocity vx and vy. Then the n sets of observation for these n pixels can be expressed in a matrix form as
− g t = g x vx + g y v y + ε
(2)
where gt={gt1, gt2, …, gtn}T, gx={gx1, gx2, …, gxn}T, gy={gy1, gy2, …, gyn}T, ε ={ε1, ε2, …, εn}T. It is assumed that E(εi)=0 and variance is σ2 i.e. εi~(0, σ2). Denote Ynx1=-gt, Xnx2=(gx, gy), θ2x1=(vx, vy)T, the equation regarding the observation {gti, gxi, gyi} and the parameter θ can be written in the matrix form as Y=Xθ + ε
(3)
A traditional least squares solution of θ in (3) is θˆ LS = (XT X) −1 XT Y
(4)
242
Z. Wei et al.
E (θˆ LS ) = θ and its covariance matrix is Cov( θˆ LS ) = σ 2 ( XT X) −1 . If gx and gy exhibit near linear dependency, i.e. one vector is nearly a scale of the other, small noise in the observation will cause large relative change in the inversion (XTX)-1 and produce very large and inaccurate motion vectors. For hardware-based algorithms, because of resource limitation, size n is usually smaller than software-based algorithms and increases the possibility of having a collinear XTX matrix. Abnormal motion vectors will have a negative impact on neighboring motion vectors in the subsequent smoothing process. Simply restricting the magnitude of a motion vector is not an optimal solution. In this paper, ridge estimator [11]-[14] as formulated in (5) is proposed to address this. θˆ RE = ( XT X + kI p ) −1 XT Y
(5)
In (5), Ip is a unit matrix of the same size as XTX where p equals 2 in this case. k is a weighting scalar for Ip. As to the selection of k, HKB estimator [11] is calculated as: k = pσˆ 2 θˆ TN θˆ N
(6)
In the content, θˆ N is the estimate right above current pixel and it is preset to (1,1)T on the first row. When θˆ is zero, k equals zero. Error variance is estimated as N
σˆ 2 =
(Y − Xθˆ N )T ( Y − Xθˆ N ) n− p
(7)
After obtaining k, optical flow is estimated using (8). In real implementation, an n×n weighting matrix W is used to assign weights to each set of observation based on their distance to the central pixel. Equations (5) and (7) are rewritten as θˆ RE = (XT WX + kI p ) −1 XT WY
(8)
(Y − Xθˆ N ) T W(Y − Xθˆ N ) n− p
(9)
σˆ 2 =
To suppress noise, the raw images are spatially smoothed before the gradient components are calculated. Gradient components gx, gy, and gt are spatio-temporally smoothed respectively before they are used in (2). Motion vectors are also spatially smoothed to obtain a smoother motion field. To evaluate the quality of a motion vector, a cost function [15] can also be calculated as
ct (T) = ⎛ Q t ⎞ ⎛ XT WX + kI p ⎟=⎜ T = ⎜⎜ T m ⎟⎠ ⎜⎝ ( XT WY)T ⎝t
m − t T Q −1t , trace(T)
t4 t5 ⎞ ⎛ t1 + k ⎟ XT WY ⎞ ⎜ ⎟ = ⎜ t4 t2 + k t6 ⎟ . T ⎟ Y WY + k ⎠ ⎜ t6 t3 + k ⎟⎠ ⎝ t5
(10)
Accurate Optical Flow Sensor for Obstacle Avoidance
243
Q is a 2×2 symmetric matrix and its inverse can be easily computed. ct is the cost function which indicates the variation along the motion direction in the spatiotemporal volume. Therefore, the smaller ct is, the more reliable the motion estimation will be and vice versa. This algorithm is not iterative and can be pipelined. It only needs simple arithmetic operations which are ideal for hardware implementation.
3 Sensor System Design The proposed compact optical flow sensor is an embedded vision system including video capture, processing, transferring and other functionalities. Various modules are connected to the buses as shown in Fig. 1. There are two types of buses in the system: PLB (Processor Local Bus) bus and OPB (On-chip Peripheral Bus) bus. The PLB bus connects high speed modules such as DER module (DERivatives calculation), OFC module (Optical Flow Calculation), SDRAM (Synchronous Dynamic Random Access Memory), camera and USB modules. Lower speed modules such as UART (Universal Asynchronous Receiver/Transmitter), interrupt controller, and GPIO (General Purpose Input/Output) are connected to the OPB bus. The PLB and OPB buses are interconnected through a bridge. DER
Multiport SRAM Arbiter
OFC
PLB IPIF
SRAM
PLB IPIF
PLB Bus
PLB IPIF USB
PLB IPIF CAMERA CTRL
PLB IPIF SDRAM
PLB2OPB BRIDGE
GPIO OPB IPIF
OPB
PC OPB IPIF CAMERA
CAM SERIAL
OPB IPIF
OPB IPIF
UART
INTC
Fig. 1. System Diagram
Data flow in the system is directed by software running on built-in PowerPC processors. The DER module shown in Fig. 2 is triggered when a new image frame is captured by the camera and has been stored in SDRAM. After processing, results from the DER module are stored in SRAM and SDRAM. The DER and OFC modules share the high-speed SRAM through a multi-port SRAM arbiter. The OFC module then processes the intermediate results and stores the resulting motion vectors in SDRAM. The intermediate or final results can be transferred to the host PC through the USB interface. A graphical user interface has been developed to observe and store the video and display status variables transferred from the sensor. After the raw images are read into the module, they are spatially smoothed first. The smoothing mask size is 3×3 and its parameters are set based on simulation
244
Z. Wei et al.
results. The smoothed images are stored in SRAM and used to calculate the derivative frames gx, gy, gt. To calculate gt, frame(t), frame(t-1), frame(t-3), frame(t-4) are used. frame(t-2) is used to calculate gx and gy. The results are stored both in SRAM and SDRAM.
frame(t-1) frame(t-3) frame(t-4)
gt(t)
gx(t)
gx,gy Calculation
Smoothing
frame(t-2)
frame(t)
Data stored in SRAM
gt(t)
gt Calculation
Reading Logic
frame(t)
Data stored in SDRAM
Fig. 2. DER module diagram
Fig. 3. OFC module diagram
gx(t) gy(t) gy(t)
Accurate Optical Flow Sensor for Obstacle Avoidance
245
The OFC module diagram is shown in Fig. 3. Derivative frames read into the hardware pipeline are temporally smoothed and then spatially smoothed. These smoothed derivatives frames are combined to build ridge regression model components. These components are spatially smoothed and then fed into the scalar estimator in (6) and (9) to calculate scalar k. The optical flow vector is then spatially smoothed in two dimension followed by a one dimension one. In the OFC module, besides temporal smoothing, there are four spatial smoothing units: derivative frames 2D smoothing, regression model components 2D smoothing, and optical flow 2D smoothing and optical flow horizontal smoothing. Settings of the first three smoothing units are the same as in [8]. The optical flow horizontal smoothing is added to better filter out the noise in the motion field. It could be configured to be either 7×1, 9×1, or 11×1 depending on the noise level. The reason to use a horizontal smoothing unit is to save the hardware resource and achieve better smoothing effect. Weighting parameters for the horizontal smoothing unit are all one. The cost function is calculated at the same time with the optical flow calculation.
4 Experiment Results Most of the algorithm modules have been implemented in hardware as described in [8]. The rest of the design was simulated in MATLAB. For debugging and evaluating purposes, a bit-level accurate MATLAB simulation code was programmed to match the actual hardware circuits. In this paper, our focus is on evaluating the effectiveness of the new smoothing units and cost function calculation module. Fig. 4 shows the simulation result running on the Yosemite sequence. With the help from the new smoothing modules, a smooth and accurate motion field can be calculated. There are some noisy vectors in the lower left region because of the absence of texture. Even the sky region which violates the brightness constancy assumption shows reasonable motion estimation result. Fig. 5 shows the result for the flower garden sequence. There are some noisy motion vectors close to the boundary of trunk due to occlusion. Fig. 6 shows the result for the SRI tree sequence. The motion field is smooth except some pixels located at motion boundaries. Fig. 7 shows the image of the cost function for the Yosemite sequence. The intensity of each pixel is inversed to the cost value to make it more visible. Lower cost pixels which have more reliable motion estimation will appear brighter in the cost function image and vice versa. It can be observed from Fig. 4 and Fig. 7, the lower left region has higher cost function values because this region is lack of texture. Consequently, motion vectors in this region are noisier than other parts of the image. To resolve this issue, larger smoothing window size is need to include more pixels into the motion estimation process. However, this will utilize more hardware resources. There’s always a trade off between the accuracy and the efficiency for hardware implementation. Table 1 shows the simulated accuracies on the Yosemite sequence under different settings. The accuracies are measured in angular error. A1 is the accuracy of the proposed design (using raw image smoothing and velocity horizontal smoothing). A2 is the accuracy using raw image smoothing without velocity horizontal smoothing. A3 is the accuracy using velocity horizontal smoothing without raw image smoothing. A4 is the accuracy in our previous design in [8]. These two smoothing units improve the motion field accuracy substantially at the cost of small increase in the hardware resource utilization.
246
Z. Wei et al.
Fig. 4. Yosemite sequence result
Fig. 6. SRI tree sequence
Fig. 5. Flower garden sequence
Fig. 7. Cost image for Yosemite sequence
Table 1. Accuracies Comparison
A1 4.8°
A2 5.5°
A3 5.9°
A4 6.8°
5 Conclusions and Future Work An optical flow sensor is proposed in our previous work in [8]. In order to be more resistant to noise in images captured from the on-board camera, more smoothing units are imposed. From simulation results, it is shown that the improvements allow the sensor generate more accurate and smoother motion field. To accommodate the limited hardware resources, these smoothing units must be implemented efficiently in hardware. To apply the motion field for obstacle avoidance tasks, we need to evaluate the reliability of each motion vector so that the pixels with unreliable motion estimation can be discriminated from those with true inconsistent motions (suspicious obstacle regions). To fulfill this task, a cost function is calculated at each pixel.
Accurate Optical Flow Sensor for Obstacle Avoidance
247
The system is targeted for obstacle avoidance task for small unmanned ground vehicle. The future work would include developing an obstacle avoidance algorithm using the optical flow and a control strategy based on the obstacle detection result.
References 1. Correia, M., Campilho, A.: Real-time Implementation of An Optical Flow Algorithm. In: Proc. ICPR, vol. 4, pp. 247–250 (2002) 2. Zuloaga, A., Martín, J.L., Ezquerra, J.: Hardware Architecture for Optical Flow Estimation in Real Time. In: Proc. ICIP, vol. 3, pp. 972–976 (1998) 3. Martín, J.L., Zuloaga, A., Cuadrado, C., Lázaro, J., Bidarte, U.: Hardware Implementation of Optical Flow Constraint Equation Using FPGAs. Computer Vision and Image Understanding 98, 462–490 (2005) 4. Díaz, J., Ros, E., Pelayo, F., Ortigosa, E.M., Mota, S.: FPGA-based Real-time OpticalFlow System. IEEE Trans. Circuits and Systems for Video Technology 16, 274–279 (2006) 5. Arribas, P.C., Maciá, F.M.H.: FPGA Implementation of Camus Correlation Optical Flow Algorithm for Real Time Images. In: 14th Int. Conf. Vision Interface, pp. 32–38 (2001) 6. Niitsuma, H., Maruyama, T.: High speed computation of the optical flow. In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 287–295. Springer, Heidelberg (2005) 7. Wei, Z.Y., Lee, D.J., Nelson, B.E., Archibald, J., Edwards, B.B.: FPGA-based Embedded Motion Estimation Sensor. International Journal of Reconfigurable Computing (2008) 8. Wei, Z.Y., Lee, D.J., Nelson, B.E., Archibald, J.K.: Accurate Optical Flow Sensor for Obstacle Avoidance. In: Proc. ICPR (accepted, 2008) 9. Farnebäck, G.: Very High Accuracy Velocity Estimation Using Orientation Tensors, Parametric Motion, and Simultaneous Segmentation of the Motion Field. In: Proc. ICCV, vol. 1, pp. 77–80 (2001) 10. Horn, B., Schunck, B.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 11. Groß, J.: Linear Regression. Lecture Notes in Statistics, vol. 175. Springer, Berlin (2003) 12. Hoerl, A., Kennard, R.: Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 12, 55–67 (1970) 13. Comaniciu, D.: Nonparametric Information Fusion for Motion Estimation. In: Proc. CVPR, vol. 1, pp. 59–66 (2003) 14. Hoerl, A., Kennard, R., Baldwin, K.: Ridge Regression: Some Simulations. Communications in Statistics, Theory and Methods 4, 105–123 (1975) 15. Wei, Z.Y., Lee, D.J., Nelson, B.E.: A Hardware-friendly Adaptive Tensor Based Optical Flow Algorithm. In: 3rd International Symposium on Visual Computing, pp. 43–51 (2007)
A Novel 2D Marker Design and Application for Object Tracking and Event Detection Xu Liu1 , David Doermann1, Huiping Li1 , K.C. Lee2 , Hasan Ozdemir2 , and Lipin Liu2 1
2
Applied Media Analysis Inc., College Park, MD
[email protected] Panasonic R&D Company of America, Princeton, NJ
[email protected]
Abstract. In this paper we present a novel application which uses 2D barcode for object tracking and event detection. We analyze the relationship between the spatial efficiency of a marker and its robustness against defocusing. Based on our analysis we design a spatially efficient and robust 2D barcode, M-Code (MarkerCode) which can be attached to the surface of static or moving objects. Compared with traditional object detection and tracking methods, M-Code not only identifies and tracks an object but also reflects its position and the orientation of the surface where it is attached. We implemented an efficient algorithm that tracks multiple M-Codes in real time in the presence of rotation and perspective distortion. In experiments we compare the spatial efficiency of M-Code with existing 2D barcodes, and quantitatively measure its robustness, including its scaling capability and tolerance to perspective distortion. As an application we use the system to detect door movements and track multiple moving objects in real scenes.
1 Introduction In this paper we present a novel barcode based object detection and tracking method. The method detects and tracks 2D barcodes attached to the surface of an object through real-time decoding. 2D barcodes have been widely used in context aware and mobile computing environments. Traditional 2D barcodes (Fig.1a) carry information up to a few hundred bytes. Some research [1,2] suggests that 2D barcodes can be used to identify the position and orientation of the capturing device to facilitate mobile interaction (Fig.1b) which often stores an index to online content. Inspired by these novel applications and the application of visual markers [3] in video surveillance, we found that 2D barcodes can be used to identify the location of the object which it is attached to (Fig.1c). This capability fits the requirements of tracking multiple objects and detecting concurrent events. Other visual trackers have been explored for unconstrained environments for video surveillance, but they require extensive processing for object detection and classification. Markers attached to moving objects can be detected and tracked automatically by computers at significantly reduced labor cost. Existing 2D barcodes, such as QRcode, MaxiCode or Datamatrix, are typically not designed for the purpose of tracking and are often decoded at a close distance. “Visual Code” and “Spot Code” (Fig.1b) are designed for interacting with mobile devices, G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 248–257, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Novel 2D Marker Design
249
Fig. 1. Existing 2D-barcodes & MCode
which implicitly require cooperative users to spot the code. “Array Tags” [4] were designed to be located and decoded at a long distance, but require a considerable area to print and attach. The spatial efficiency of a marker is an important factor since an object may have limited area. Codes with higher coding efficiency are often more robust since they contain a smaller number of units (a unit typically refers to a black/white square), correspondingly a larger unit is more resistant to distortions caused by blurring, noise and/or perspective distortion. It is worth mentioning that colors have been used to improve the spatial efficiency of markers [5] as well, but color may be affected by environmental lighting and distance and is not a stable feature to recognize. Binary (black and white) markers are more convenient to print and more reliable to read, so we will focus on binary marker in this paper. We describe the code design in Section 2, present the locating and decoding algorithm including a fast perspective ratification method in Section 3. We evaluate the spatial efficiency of MCode, the speed and robustness of decoding in Section 4 and draw conclusions in Section 5.
2 Code Design When an object is being tracked, the image of the attached M-Code may be defocused if the object is not within the best focal range of the camera. Defocusing is usually modeled by Gaussian convolution [6]. We first analyze the relationship between the unit width of the marker and the size of the Gaussian kernel to show the importance of spatial efficiency of the marker. For a black and white marker, Gaussian convolution may increase the gray scale value of a black cell and decrease the gray scale value of a white cell in the captured image. When a black cell appears to be brighter than a white cell, the captured M-Code image can no longer be read correctly. This situation is most
250
X. Liu et al.
Fig. 2. Defocus
Fig. 3. MCode design
likely to happen when a black cell is surrounded by white cells, or vice verse, as shown in Figure 2. The gray scale value at the center of a black unit cell with width w after Gaussian convolution with kernel size σ may be as high as H:
H=
|x|>w/2 |y|>w/2
=1− =1−
x2 +y2 2σ2 2πσ2
−
e
−
e
dxdy
x2 +y2 2σ2
2πσ2
dxdy
(1)
|x|<w/2 |y|<w/2 Erf ( 2√w2σ )2
In order for the black cell to be distinguishable from white cells, we must have H < 0.5: H < 0.5 √ => Erf ( 2√w2σ ) > 22 (2) √ √ 2 −1 => w > 2 2Erf ( 2 )σ
A Novel 2D Marker Design
251
√ √ and therefore the unit width w must be greater than 2 2Erf −1 ( 22 )σ to keep the marker readable under Gaussian convolution with kernel size σ. So we should utilize every cell in a limited print area and keep the unit width w as large as possible. Taking this requirements into consideration, we designed a new marker which reflects its position (four corners) with high spatial efficiency. As shown in Fig.1a, traditional 2D barcodes usually use part of the image as a finding pattern. The finding pattern, however, decreases the coding efficiency. Therefore, we do not use an explicit finding pattern. As shown in Fig. 3, MCode consists of 8×8 black and white cells bounded by a black box with half unit width. The black box plays an important role to facilitate decoding. It is a robust feature that can still be detected even if the code image is blurred or defocused. More importantly, the location of the black box determines the location of every cell in the marker. MCode encodes 28 bits of information with a 4 bit checksum and 32 bits of Reed-Solomon[7] error correction data, which can correct 2 bytes (16 bits) of error at any location in the MCode.
3 Locating and Decoding The detection and tracking process is actually performed by a single procedure: detecting and decoding the MCode in the captured frames. The decoding of MCode consists of three steps: code location, corner registration and perspective correction. To track MCodes in real time, code location and perspective correction must be performed efficiently. To ensure successful decoding and find the exact position of an object, registration has to be highly accurate. 3.1 Code Location Unlike the mobile barcode reading which relies on users to locate the barcode, we have to locate the MCode anywhere in the captured image. The location process must be performed very efficiently to satisfy the real-time performance requirements. We use a multi-resolution approach to accelerate the location algorithm. Multi-resolution has been applied to image retrieval[8] and object extraction [9] to improve the efficiency. We first down sample the original image by averaging every 5 × 5 pixel block into one super pixel (Fig. 4b). Meanwhile we compute the histogram of the grayscale values and find the threshold to binarize the image using 2-means classification. The binarization is adaptive to environmental lighting and can extract the bounding box of M-Code under global lighting changes. We then search the super pixel image for connected components using a breadth-first-search (BFS). The complexity of BFS is liner in the number of pixels in the image, so we accelerate the connect components search 25 times by running on the super pixel image rather than running on the original image. A Hough transform is then performed to locate the four edges of each connected component and the approximate position of four corners are calculated as well.
252
X. Liu et al.
Fig. 4. Code location, the original image and its super-pixel representation
3.2 Corner Registration After locating the approximate position of a MCode, we compute the exact location (Fig. 5a) of the four corners which determine the full geometry (homography). We convolve a template in the neighborhood of the approximate corner position and find the maximum response. An example of the templates is shown in Fig. 5b.
Fig. 5. Corner detection and the template used to exactly locate the corner
3.3 Perspective Correction MCode may be captured from any arbitrary angle. To decode the data we first need to correct the perspective distortion i.e. we need to calculate the mapping between an ideal, non-perspective image and a camera captured image, which can be described as a
A Novel 2D Marker Design
253
Fig. 6. Fast Perspective Correction
plane-to-plane homography. Unlike classic homogeneous estimation methods [10] we use a fast perspective correction which simplifies the perspective correction into seven cross-products and avoids floating point operations. As shown in Fig. 6, we first perform an affine transformation and then a perspective transformation. Suppose the coordinates of four corners in the image plane are (P1 , P2 , P3 , P4 ), and the top and bottom boundaries of the bounding box intersect at vanishing point A. Under homogeneous coordinate A = L1 × L2 = (P 1 × P 4) × (P 2 × P 3). Similarly B = L3 × L4 = (P 1 × P 2) × (P 3 × P 4). A and B are infinite points in the original plane. The third element of A and B under homogeneous ⎛− →⎞ H1 − → coordinates should be 0 in the affine image. Any homography H = ⎝ H2 ⎠ that maps − → H3 the perspective image back into an affine image should map A and B to infinity, which implies H3 ∼ ((P1 × P4 ) × (P2 × P3 )) × ((P1 × P2 ) × (P3 × P4 )) (3) This suggests that we can calculate H3 using seven cross products. Any homography H with the third row H3 computed by (3) maps the perspective image (III in Figure 6) to an affine image (II). The next task is to fill in the first and second row of H. The reason to calculate this homography H is: given any matrix coordinate we can quickly tell its pixel coordinate in the image. From the matrix coordinate (I) to the affine image (II), the transformation is linear and can be directly computed by transforming the base
254
X. Liu et al.
of the coordinate system. In the final step we transform the affine image (II) to the perspective image (III) by computing H −1 . Therefore, we choose the first and second row of H so that it has a neat inverse: ⎛ ⎞ h33 0 0 h33 0 ⎠ H ±1 ∼ ⎝ 0 (4) ±h31 ±h32 h33 This “inverse” only requires the reverse of two signs in the third row of H. In this way it accelerates the coordinate transformation with numerical stability. Normally the numerical inverse is subject to “division by zero” when H is nearly singular and our method is division free. For any entry (i, j) in a w − by − h matrix we compute its −−−→ −−−→ affine coordinate wi P1 P4 + hj P1 P2 and use (4) to map this affine coordinate to the image coordinate. All these computation can be performed with integer operations only and therefore is highly efficient.
4 Evaluation 4.1 Spatial Efficiency Evaluation As previously addressed, the spatial efficiency is directly related to the robustness of the marker. The code that requires more unit cells are more vulnerable to degradation. As shown in Fig.7, A MCode and a QRCode storing one integer undergo the same level of Gaussian blur. The cells in the QRCode cannot be separated while the MCode is still recognizable. We quantitatively compare MCode’s spatial efficiency to QR Code and Data Matrix in Table 1. To encode the same amount of data MCode contains 9×9=81 unit cells, while a QR code (version 1) contains at least 21×21 unit cells and a DataMatrix code contains at least 12×12 unit cells, where 44 bits are used for encoding the finding pattern. From here we can see MCode is most efficient.
Fig. 7. Robustness under blurriness, MCode vs. QRCode
A Novel 2D Marker Design
255
Table 1. Comparison of 2D barcodes
Code Type Unit Cells Bits Cells/bit MCode 9 × 9=81 28 2.89 DataMatrix(C80) 12 × 12=144 24 6 QRCode(M) 21 × 21=441 112 3.94
4.2 Processing Speed As addressed above, the object detection and tracking are performed through a decoding process. Therefore, the decoding speed is very important for real-time tracking. The MCode decoder is implemented on a P4 3.0G PC with Panasonic WV-NP224 surveillance camera and WV-LZ62/8S lens. The camera can capture raw VGA images at 10 fps and runs at 9 fps while detecting M-Codes. So the decoding time on each frame is 0.011 second, which is equivalent to a frame rate of 90 fps, if we ignore the time spent on image acquiring. 4.3 Scale and View Angle Test To let MCode be applicable in complex environments, one major concern is its scalability i.e. at what distance a MCode can be recognized.
Fig. 8. Scale and View Angle Test
Our test result shows that a MCode printed on an A4 paper can be recognized at 30 meters (Fig.8a) with VGA resolution. Or be printed at 0.5 inch × 0.5 inch and be recognized at less than 2 inches distance (Fig.8b). When an object is moving, the attached MCode may not always be captured from an upper-front view angle. When this happens perspective distortion exists. To quantitatively measure the robustness to perspective distortion, we use the ratio between the longest edge and the shortest edge (Fig. 3) as a criterion:
256
X. Liu et al.
K=
max(|P1 P2 |, |P2 P3 |, |P3 P4 |, |P4 P1 |) min(|P1 P2 |, |P2 P3 |, |P3 P4 |, |P4 P1 |)
K = 1 indicates no perspective distortion. The larger K, the stronger perspective distortion. Fig.9 shows the histogram of K when we arbitrarily move a MCode in front of the camera. Our decoder can decode when K is as large as 3.45.
Fig. 9. Histogram of K
Fig. 10. Application Scenario
A Novel 2D Marker Design
257
5 Application Scenario and Future Work Our system has been applied to detecting door events and tracking multiple moving objects (Fig. 10)1 . The experimental results show that MCode and its decoder works robustly for these tasks. In the future work, we will use M-Code as a tool for self calibration. With the calibrated camera, we will fully re-construct the 3D geometry and track the exact 3D coordinate of the moving object.
References 1. Rekimoto, J., Ayatsuka, Y.: Cybercode: designing augmented reality environments with visual tags. In: Proceedings of DARE 2000 on Designing augmented reality environments, pp. 1–10 (2000) 2. Rohs, M., Gfeller, B.: Using camera-equipped mobile phones for interacting with real-world objects. In: Ferscha, A., Hoertner, H., Kotsis, G. (eds.) Advances in Pervasive Computing, Vienna, Austria, pp. 265–271. Austrian Computer Society (OCG) (2004) 3. Schiff, J., Meingast, M., Mulligan, D., Sastry, S., Goldberg, K.: Respectful Cameras: Detecting Visual Markers in Real-Time to Address Privacy Concerns. In: International Conference on Intelligent Robots and Systems (IROS) (2007) 4. Little, W., Baker, P.: Data with perimeter identification tag, US Patent 5,202,552 (1993) 5. Cheong, C., Kim, D., Han, T.: Usability Evaluation of Designed Image Code Interface for Mobile Computing Environment. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4551, pp. 241– 251. Springer, Heidelberg (2007) 6. Hwang, T.l., Clark, J., Yuille, A.: A depth recovery algorithm using defocus information. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1989, pp. 476–482 (1989) 7. Wicker, S.B., Bhargava, V.K. (eds.): Reed-Solomon Codes and Their Applications. John Wiley & Sons, Inc., New York (1999) 8. Nikulin, V., Bebis, G.: Multiresolution image retrieval through fusion. In: Proceedings of SPIE, vol. 5307, pp. 377–387 (2004) 9. Porikli, F., Wang, Y.: An unsupervised multi-resolution object extraction algorithm usingvideo-cube. In: Proceedings of 2001 International Conference on Image Processing, vol. 2 (2001) 10. Criminisi, A., Reid, I., Zisserman, A.: A plane measuring device. Image and Vision Computing 17, 625–634 (1999)
1
Video demos of object tracking and event detection using MCode can be found at http://sites.google.com/site/mcodedemos/m-code-demos
Automatic Lung Segmentation of Volumetric Low-Dose CT Scans Using Graph Cuts Asem M. Ali and Aly A. Farag Computer Vision and Image Processing Laboratory (CVIP Lab) University of Louisville, Louisville, KY 40292 {asem,farag}@cvip.uofl.edu www.cvip.uofl.edu
Abstract. We propose a new technique for unsupervised segmentation of the lung region from low dose computed tomography (LDCT) images. We follow the most conventional approaches such that initial images and desired maps of regions are described by a joint Markov-Gibbs random field (MGRF) model of independent image signals and interdependent region labels. But our focus is on more accurate model identification for the MGRF model and the gray level distribution model. To better specify region borders between lung and chest, each empirical distribution of volume signals is precisely approximated by a linear combination of Gaussians (LCG) with positive and negative components. LCG models parameters are estimated by the modified EM algorithm. Initial segmentation (labeled volume) based on the LCG models is then iteratively refined by using the MGRF with analytically estimated potentials. In this framework the graph cuts is used as a global optimization algorithm to find the segmented data (labeled data) that minimize a certain energy function, which integrates the LCG model and the MGRF model. To validate the accuracy of our algorithm, a special 3D geometrical phantom motivated by statistical analysis of the LDCT data is designed. Experiments on both phantom and 3D LDCT data sets show that the proposed segmentation approach is more accurate than other known alternatives.
1
Introduction
Isolating the lung from its surrounding anatomical structures is a crucial step in many studies such as detection and quantification of interstitial disease, and the detection and/or characterization of lung cancer nodules. For more details, see Sluimer’s et al. survey paper [1]. But CT lung density depends on many factors such as image acquisition protocol, subject tissue volume, volume air, and physical material properties of the lung parenchyma. These factors make lung segmentation based on threshold technique difficult. So, developing new accurate algorithms with no human interaction, which depends on gray level difference between lung and its background, is important to precisely segment the lung. The literature is rich with approaches of lung segmentation in CT images. Hu et al. [2], proposed an optimal gray level thresholding technique which is G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 258–267, 2008. c Springer-Verlag Berlin Heidelberg 2008
Automatic Lung Segmentation of Volumetric Low-Dose CT Scans
259
used to select a threshold value based on the unique characteristics of the data set. In [3], Brown et al. integrated region growing and morphological operations with anatomical knowledge expert to automatically segment lung volume. A segmentation-by-registration scheme was proposed by Sluimer et al. [4] for automated segmentation of the pathological lung in CT. In that scheme, a scan with normal lungs is registered to a scan containing pathology. When the resulting transformation is applied to a mask of the normal lungs, a segmentation is found for the pathological lungs. Although shape-based, or Atlas-based (e.g.[5]), segmentation overcomes the problem of gray level inhomogeneities, the creation of a 3D shape model of the lung is not an easy task in the 3D case. Also these techniques need a registration step. Conventional methods that perform lung segmentation in CT depend on a large contrast in Hounsfield units between the lung and surrounding tissues. Although these methods accurately segment normal lung tissues from LDCT, they tend to fail in case of gray level inhomogeneity, which results form the abnormal lung tissues. The main advantage of our proposed segmentation approach over the conventional techniques is that it is based on modeling both the intensity distribution and spatial interaction between the voxels in order to overcome any region inhomogeneity existing in the lung region. Moreover, the proposed segmentation algorithm is fast which makes it suitable for clinical applications. Recently, graph cuts has been used as an interactive N-D image segmentations tool (For more details see [6]). Many studies used graph cuts for lung segmentation. Boykov and Jolly in [7] introduced an interactive segmentation framework. In that work, the user must identify some voxels as object and others as background seeds. Then graph cut approach is used to find the optimal cut that completely separates the object seeds from the background seeds. To overcome the time complexity and memory overhead of the approach in [6] for high resolution data, Lombaert et al. [8] performed graph cuts on a low-resolution image/volume and propagated the solution to the next higher resolution level by only computing the graph cuts at that level in a narrow band surrounding the projected foreground/background interface. Although the results of these approaches looked promising, manual interaction was still required. Interactive segmentation imposes some topological constraints reflecting certain high-level contextual information about the object. However, it depends on the user input. The user inputs have to be accurately positioned. Otherwise the segmentation results are changed. Chen et al. [9] used morphological operations and graph cuts to segment the lung from radiographic images automatically. In that work, the authors initialized an outer boundary for each lung region by shrinking 10 pixels from the boundaries of both vertical halves of an image. As in our case, this method does not work in axial CT slices, where there is a lung part in the middle of the image. Inner boundaries were obtained by dilating the “regional minimum”. However, due to the inhomogeneity in the given image, there were many “regional minimums” so they selected a “regional minimum” based on a threshold. Then the authors in [9] used graph cuts to find the boundaries of each lung region between its inner and outer boundaries. The data penalty and
260
A.M. Ali and A.A. Farag
discontinuity penalty were chosen to be inversely proportional to the gray levels difference of the neighborhood pixels. This selection will be improper in the case of the axial CT lung slices, due to their gray level inhomogeneities. In this paper, we propose a novel automatic lung volume segmentation approach that uses graph cuts as a powerful optimization technique to get the optimal segmentation. Different from the previous graph cuts studies, in our segmentation approach, no user interaction is needed; instead, we use the volume gray level to initially pre-label the volume. To model the low level information in the CT lung volume, the gray level distribution of lung volume is approximated with a new Linear Combination of Gaussian (LCG) distributions with positive and negative components. Due to the closeness of the gray levels between the lung tissues and the chest tissues, we do not depend only on volume gray level, but we use the graph cuts approach to combine the volume gray level information and the spatial relationships between the region labels in order to preserve the details. Often, the potential of Potts model which describes the spatial interaction between the neighboring voxels is estimated using simple functions that are inversely proportional to the gray scale difference between the two voxels and their distance. Another contribution in this work is that the potentials of Potts model are estimated using a new analytical approach. After the lung volume is initially labeled we formulate a new energy function using both volume appearance models. This function is globally minimized using s/t graph cuts to get the final and optimal segmentation of lung.
2
Proposed Graph Cuts Segmentation Framework
To segment a lung, we initially labeled the volume based on its gray level probabilistic model. Then we create a weighted undirected graph with vertices corresponding to the set of volume voxels P, and a set of edges connecting these vertices. Each edge is assigned a nonnegative weight. The graph also contains two special terminal vertices s (source) “Lung”, and t (sink) “Chest and other tissues”. Consider a neighborhood system in P, which is represented by a set N of all unordered pairs {p, q} of neighboring voxels in P. Let L the set of labels {“0”, “1”}, correspond to lung and its background respectively. Labeling is a mapping from P to L, and we denote the set of labeling by f = {f1 , . . . , fp , . . . , f|P| }. In other words, the label fp , which is assigned to the voxel p ∈ P, segments it to lung or background. Now our goal is to find the optimal segmentation, best labeling f , by minimizing the following energy function: E(f ) = Dp (fp ) + V (fp , fq ), (1) p∈P
{p,q}∈N
where Dp (fp ), measures how much assigning a label fp to voxel p disagrees with the voxel intensity, Ip . Dp (fp ) = −ln P (Ip | fp ) is formulated to represent the regional properties of segments, Sec.2.1. The second term is the pairwise interaction model which represents the penalty for the discontinuity between voxels p and q, Sec.2.2.
Automatic Lung Segmentation of Volumetric Low-Dose CT Scans
2.1
261
Gray Level Probabilistic Model
To initially label the lung volume and to compute the data penalty term Dp (fp ), we use the modified EM [10] to approximate the gray level marginal density of each class fp , lung and background, using a LCG with Cf+p positive and Cf−p negative components as follows: C−
C+
P (Ip |fp ) =
fp
wf+p ,r ϕ(Ip |θf+p ,r ) −
r=1
fp
wf−p ,l ϕ(Ip |θf−p ,l ),
(2)
l=1
where ϕ(.|θ) is a Gaussian density with parameter θ ≡ (μ, σ 2 ) with mean μ and variance σ 2 . wf+p ,r means the rth positive weight in class fp and wf−p ,l means Cf+ the lth negative weight in class fp . This weights have a restriction r=1p wf+p ,r − Cf−p − l=1 wfp ,l = 1. Fig. 1 illustrates an example for the lung gray level LCG model and its components.
0.015
0.014 0.012
− Empirical Density − − − Gaussian Components . . . Estimated Density
0.01
0.01 0.008 0.006 0.005 0.004 0.002 0 0
50
100
150
200
250
0 0
50
100
150
200
250
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. Example of a lung gray level LCG Model. (a) The empirical and initial estimated densities and the dominant components. (b) The scaled absolute deviations between the empirical and initial estimated densities. (c) Approximation error for the scaled absolute error as a function of the number of Gaussians, which is used to approximate the scaled absolute error in (b). (d) The components of the final LCG model. (e) The final LCG density approximation. (f) The LCG models of each class with the best separation threshold t = 109.
262
2.2
A.M. Ali and A.A. Farag
Spatial Interaction Model
The homogenous isotropic pairwise interaction model which represents the penalty for the discontinuity between voxels p and q is defined as follows: γ if fp = fq ; V (fp , fq ) = . (3) 0 if fp = fq The simplest model of spatial interaction is the Markov Gibbs random field (MGRF) with the nearest 6-neighborhood. Therefore, for this specific model the Gibbs potential can be obtained analytically using the maximum likelihood estimator (MLE) for a generic MGRF [11]. So, the resulting approximate MLE of γ is: K2 ∗ γ = K− fneq (f ) . (4) K −1 where K = 2 is the number of classes in the volume and fneq (f ) denotes the relative frequency of the not equal labels in the voxel pairs and it is defined as follows. 1 fneq(f ) = δ(fp = fq ), (5) |TN | {p,q}∈TN
where the indicator function, δ(A) equals one when the condition A is true, and zero otherwise, TN = {{p, q} : p, q ∈ P; {p, q} ∈ N } is the family of the neighboring voxel pairs supporting the Gibbs potentials. 2.3
Graph Cuts Optimal Segmentation
To segment a lung volume, instead of independently segmenting each 2D slice of the volume, we segment the 3D lung using a 3D graph (e.g. Fig. 2) where each vertex in this graph represents a voxel in the lung volume. Then we define the weight of each edge as shown in table 1. After that, we get the optimal segmentation surface between the lung and its background by finding the minimum cost cut on this graph. The minimum cost cut is computed exactly in polynomial time for two terminal graph cuts with positive edges weights via s/t Min-Cut/Max-Flow algorithm [12].
3
Experiments and Discussion
To assess the performance of the proposed segmentation framework, we demonstrate it on axial human chest slices obtained by spiral-scan low-dose computer tomography (LDCT), (the 8-mm-thick LDCT slices were reconstructed every 4 mm with the scanning pitch of 1.5 mm). Each volume is initially labeled using the gray level LCG model’s threshold. However, due to the gray levels inhomogeneities, one can not precisely segment the lung using only this threshold as shown in Fig. 3 where, the misclassified voxels may include abnormal lung
Automatic Lung Segmentation of Volumetric Low-Dose CT Scans
263
Table 1. Graph Edges Weights Edge
Weight γ {p, q} 0 {s, p} −ln[P (Ip | “1”)] {p, t} −ln[P (Ip | “0”)]
for fp = fq fp = fq p∈P p∈P
Fig. 2. Example of graph that used in Lung Volume Segmentation. Note: Terminals should be connected to all voxels but for illustration issue we did not do this.
Fig. 3. Samples of segmented lung slices using LCG model’s threshold. (Error shown in red).
tissues. After computing the initially labeled volume, the potential, which indicates the spatial interaction between its voxels, is computed using this labeled volume by Eq.(4). After that, we construct a graph for the given volume using the 6-neighborhood system (e.g. Fig. 2). Then the s/t graph cuts approach gives the minimum of the binary energy Eq.(1), which corresponds to optimal segmentation. The segmentation errors are evaluated, with respect to the ground truth produced by an expert (a radiologist). Fig. 4 shows samples of segmented slices for different subjects as well as their segmented 3D lung volumes. Table 2. Accuracy and time performance of our segmentation on 7 data sets in comparison to ICM and IT. Average volume 256x256x77. Algorithm Our ICM IT Minimum error, % 1.66 3.31 2.34 Maximum error, % 3.00 9.71 8.78 Mean error, % 2.29 7.08 6.14 Standard deviation,% 0.5 2.4 2.1 Significance, P 2 ∗ 10−4 5 ∗ 10−4 Average time, sec 46.26 55.31 7.06
264
A.M. Ali and A.A. Farag
(a) 2.08%
(b) 2.21%
(c) 2.17%
(d) 1.95%
Fig. 4. Samples of segmented lung slices using the proposed algorithm,(Error are shown in red). Corresponding 3D lung volumes (Error are shown in green).
Evaluation: To evaluate the results we calculate the percentage segmentation error as follows: 100 ∗ N umber of misclassif ied voxels error% = . (6) N umber of lung volume voxels We ran the proposed approach on 23 data sets. The statistical analysis of seven of them, for which we have their ground truths (radiologist segmentation),
Automatic Lung Segmentation of Volumetric Low-Dose CT Scans
265
are shown in Table 2. For comparison, the statistical analysis of Iterative Conditional Modes (ICM) [13] technique results, and the Iterative Threshold (IT) [2] approach results are also shown. The unpaired t -test is used to show that the differences in the mean errors between the proposed segmentation, and (ICM/and IT) are statistically significant (the two-tailed value P is less than 0.0006). The main problem in the segmentation of ICM and IT is that the misclassified voxels include abnormal lung tissues (lung cancer), bronchi and bronchioles as shown in Fig. 5. These tissues are important if lung segmentation is a pre-step in a detection of lung nodules system. The motivation behind our segmentation is to exclude such errors as far as possible. All algorithms are run on a PC 3Ghz Pentium 4, and 2GB RAM. All implementations are in C++.
(Subj1)
(Subj2) (a)
(b)
(c)
Fig. 5. Examples of segmented lung slices that have nodules (bounded by yellow circle). (a) IT and (b) ICM approaches misclassified these parts as chest tissues (error is shown in red). However, (c) proposed algorithm correctly classified them as a lung.
4
Validation
Due to the hand shaking errors, it is difficult to get accurate ground truth from manual segmentation. To assess the robustness of the proposed approach, we have created a 3D geometric lung phantom (256x256x81). To cerate this phantom, we started with a segmented (by a radiologist) lung volume (lung regions, arteries, veins, bronchi, and bronchioles). Then lung and its background signals of the phantom are generated according to the distributions P (I|0), and P (I|1), respectively, in Fig. 1(f) using the inverse mapping approach [10]. Fig. 6 shows some slices of the lung volume phantom. The error 0.71% between our results and ground truth confirms the high accuracy of the proposed segmentation framework. Fig. 7 shows proposed approach segmentation as well as the ICM segmentation of the phantom volume. As expected, in our approach the
266
A.M. Ali and A.A. Farag
(a)
(b)
Fig. 6. Slices from the synthetic volume
Fig. 7. Lung phantom segmentation results (a) The proposed algorithm 0.71%, and (b) The ICM technique 2.31%
misclassified voxels are located at the boundary, and the misclassified voxels in ICM result lose abnormal lung tissues.
5
Conclusion
In this paper, we have presented a novel framework for automatic lung volume segmentation using the graph cuts approach. Our proposed method addresses the interactive techniques’ limitation. We initially pre-label the volume using its gray level information. The gray level distribution of lung volume is approximated with a LCG distributions with positive and negative components. A MGRF model is used to describe the spatial interaction between the lung voxels. A new analytical approach to estimate 3D spatial interaction potentials for the MGRF model is presented. Finally, an energy function using the previous models is formulated, and is globally minimized using graph cuts. Experimental results show that the developed technique gives promising accurate results compared to other known algorithms.
References 1. Sluimer, I., Schilham, A., Prokop, M., van Ginneken, B.: Computer analysis of computed tomography scans of the lung: a survey. IEEE Transactions on Medical Imaging 25, 385–405 (2006)
Automatic Lung Segmentation of Volumetric Low-Dose CT Scans
267
2. Hu, S., Hoffman, E.A., Reinhardt, J.M.: Automatic lung segmentation for accurate quantitation of volumetric X-ray CT images. IEEE Transactions on Medical Imaging 20, 490–498 (2001) 3. Brown, M.S., McNitt-Gray, M.F., Mankovich, N.J., Goldin, J., Hiller, J., Wilson, L.S., Aberle, D.R.: Method for segmenting chest ct image data using an anatomical model: Preliminary results. IEEE Transactions on Medical Imaging 16, 828–839 (1997) 4. Sluimer, I., Prokop, M., van Ginneken, B.: Toward automated segmentation of the pathological lung in ct. IEEE Transactions on Medical Imaging 24, 1025–1038 (2005) 5. Zhang, L., Hoffman, E.A., Reinhardt, J.M.: Atlas-driven lung lobe segmentation in volumetric x-ray ct images. In: Proc. of the SPIE, vol. 5031, pp. 306–315 (2003) 6. Boykov, Y., Funka-Lea, G.: Graph cuts and efficient N-D image segmentation. International Journal of Computer Vision 70, 109–131 (2006) 7. Boykov, Y., Jolly, M.P.: Interactive organ segmentation using graph cuts. In: Delp, S.L., DiGoia, A.M., Jaramaz, B. (eds.) MICCAI 2000. LNCS, vol. 1935, pp. 276– 286. Springer, Heidelberg (2000) 8. Lombaert, H., Sun, Y., Grady, L., Xu, C.: A multilevel banded graph cuts method for fast image segmentation. In: IEEE Proceedings of International Conference on Computer Vision, vol. I, pp. 259–265 (2005) 9. Chen, S., Cao, L., Liu, J., Tang, X.: Automatic segmentation of lung fields from radiographic images of sars patients using a new graph cuts algorithm. In: International Conference on Pattern Recognition, vol. 1, pp. 271–274 (2006) 10. Farag, A., El-Baz, A., Gimelfarb, G.: Density estimation using modified expectation maximization for a linear combination of gaussians. In: IEEE Proceedings of International Conference on Image Processing, vol. 3, pp. 1871–1874 (2004) 11. Gimelfarb, G.L.: Image Textures and Gibbs Random Fields. Kluwer Academic, Dordrecht (1999) 12. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/maxflowalgorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1124–1137 (2004) 13. Besag, J.E.: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society B 48, 259–302 (1986)
A Continuous Labeling for Multiphase Graph Cut Image Partitioning Mohamed Ben Salah1 , Amar Mitiche1 , and Ismail Ben Ayed2 1
INRS-EMT, Institut National de la Recherche Scientifique 800, de La Gaucheti`ere West, Bureau 6900 Montr´eal, H5A 1K6, QC, Canada {bensalah,mitiche}@emt.inrs.ca 2 General Electric (GE) Canada 268 Grosvenor, E5-137, London, N6A 4V2, ON, Canada
[email protected]
Abstract. This study investigates a variational multiphase image segmentation method which combines the advantages of graph cut discrete optimization and multiphase piecewise constant image representation. The continuous region parameters serve both image representation and graph cut labeling. The algorithm iterates two consecutive steps: an original closed-form update of the region parameters and partition update by graph cut labeling using the region parameters. The number of regions/labels can decrease from an initial value, thereby relaxing the assumption that the number of regions is known beforehand. The advantages of the method over others are shown in several comparative experiments using synthetic and real images of intensity and motion.
1
Introduction
Image segmentation serves many useful applications in diverse fields such as remote sensing, medicine, robotics, computer libraries, and art. It is also a fundamental problem in computer vision. It has been the focus of an impressive number of theoretical, methodological, and practical studies [27]. To segment an image is to divide it into regions each answering a given description. Image segmentation was first studied along two main veins: edge detection, to find the region boundaries, and region growing, to reach whole regions from seed regions. The most serious problems with these methods are that the computed edges are irregular and do not form closed contours necessitating ad hoc processes. These shortcomings can be avoided altogether by variational formulations of segmentation. Such formulations seek a segmentation which minimizes an objective function or functional containing terms that embed descriptions of the segmentation regions and of their boundaries. Following the seminal work of Mumford and Shah [1], many studies have focused on variational image segmentation. Variational problem statements can be discrete or continuous. In the continuous case, an image is a continuous function defined over a continuous domain [1,5,6]. Continuous formulations which use active curve functionals, minimized via level set evolution, have been the most flexible [6,27,2,24]. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 268–277, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Continuous Labeling for Multiphase Graph Cut Image Partitioning
269
However, purely continuous algorithms still suffer from the drawbacks of local optimization techniques such as gradient descent. They may lead to local minima, and consequently, they are sensitive to initializations. High computational complexity is also a significant limitation of continuous methods [4]. Discrete formulations use discrete objective functions at the start, as they take images to be discrete functions defined over positional arrays [7,9,3,10]. Graph cut combinatorial algorithms, which view segmentation as a discrete label assignment problem, have been of intense interest recently, although they have been used for binary images much earlier [23]. Several studies have shown that graph cuts can be quite useful in image analysis. For instance, very fast methods have been implemented in image segmentation [13,12,22], motion/stereo segmentation [15], tracking [21], and restoration [3,11]. More importantly, graph-cut algorithms allow to obtain nearly-global optima [11]. However, using a huge number of labels as well as discretization artifacts are significant limitations of discrete methods [16], particularly in the context of unsupervised image segmentation. In this connection, segmentation by graph cut optimization of the objective function gives a partition of the image domain described by a subset of the set of all possible labels. In [13,11], the label set is the gray level set, in which case each segmentation region is characterized by a gray level label. The removal of labels from the computed segmentation is devolved on the regularization term, the prior, because the data term increases with decreasing number of regions [28]. Therefore, the regularization term must be able to decrease when the number of regions decreases in order to discard some of the initial labels in the computed segmentation. The contrast term used in [11] allowed this. However, the result is, in general, an over-segmentation because the objective function makes no explicit reference to the number of regions. Also, the use of more labels than necessary lengthens the execution time. Several recent studies have shown the benefits of combining discrete and continuous optimizations in the context of motion estimation [16] and motion/image segmentation [15,17,18,19]. The general motivation of this recent trend is to take advantage of the ability of graph-cut algorithms to obtain nearly-global optima while avoiding the drawbacks of purely discrete methods, namely the huge number of labels as well as discretization artifacts [16]. In [15], a clever combination of continuous and discrete processing was possible in the context of motion segmentation by assuming that the number of regions is known and using a parametric representation of motion within each region. The resulting algorithm iterated a step of gradient descent to estimate the parameters of motion in each region followed by a segmentation update by graph cut optimization. This scheme essentially replaces the segmentation update by level set evolution in [2] by a graph cut update, producing similar results in lesser time. Our study is most closely related to the recent two-region segmentation methods in [17,18,19]. In [17,18,19], a two-region segmentation is sought via graph-cut optimization of the continuous Chan-Vese functional [6] which contains two characteristic terms: a data term measuring the conformity of image data within the two regions (a foreground and a background) to the parametric piecewise constant model and
270
M.B. Salah, A. Mitiche, and I.B. Ayed
a smoothness prior related to the length of the boundary between the two regions. In this connection, two points can be brought up. The first is related to the assumption that the number of regions is known and is equal to two. This assumption can be relaxed by starting with a larger number of regions and using a prior to decrease it. The length prior in [15,18,19] would not allow this, in general, because merging two disjoint regions does not change its value. In the continuous level set framework, several studies embedded region-merging priors in the segmentation functional [20] which allows decreasing the initial number of regions. The optimization of such priors in the graph-cut framework is, unfortunately, not straightforward. The second point is that in the context of graph cuts, the expression estimating the length prior contains a Dirac function which calls for a regularization to compute its derivative. This can slow the graph cut algorithm when one of its motivations is the rapidity of execution. In addition to that, parameters of segmentation can not be estimated in a closed form manner. The purpose of our study is to investigate variational multiphase image segmentation into an unknown number of regions by a method which shares the advantages of graph cuts and continuous labeling. The labels are seen both as region indices as in the discrete framework [11] and variable parameters which characterize the segmentation regions as in the continuous framework [6]. Different from the length prior in [15,18,19], our prior is rather expressed as a function of the regions continuous parameters that are updated iteratively along with the segmentation process. Used in conjunction with a data term which measures the conformity of image data within each region to the piecewise constant model, the proposed prior leads to an original closed-form update of the region parameters. Furthermore, it allows decreasing the number of regions in a more general way than the length prior. Also, our multiphase data term is different from the one in [11] because it is a function of variable continuous labels rather than fixed integer labels. The proposed method iterates two steps: an original closed-form update of the regions continuous labels and segmentation by graph cut combinatorial optimization. The number of regions/labels can decrease from an initial value, thereby relaxing the need to know the number of regions beforehand. We use, instead, the weaker assumption that the actual or desired number of regions is less or equal than some value.
2
Segmentation Functional
In this study we will not assume that the desired number of regions is known beforehand. Instead, we will assume that an upper bound, Nreg , on this number is available. We state segmentation as the minimization of a functional which is the sum of two characteristic terms: a data term to measure the conformity of the segmentation to a parametric piecewise constant image model and a smoothness prior. To write down the functional, we use an indexing function λ : p ∈ Ω −→ λ(p) ∈ L which assigns each point of the image positional array Ω to a region. L is a finite set of region indices which have a cardinality less or equal to Nreg . The objective function can then be written as:
A Continuous Labeling for Multiphase Graph Cut Image Partitioning
271
F (λ) = D(λ) + α R(λ) (1) where D, R are, respectively, the data term and the prior, and α is a positive factor. Let μl be the piecewise constant model parameter of region Rl . The data term and the prior are respectively given by: D(λ) = Dp (λ(p)) = (μl − Ip )2 p∈Ω
R(λ) =
l∈L p∈Rl
r{p,q} (λ(p), λ(q))
(2)
{p,q}∈N
where N is the point neighborhood set and r{p,q} (λ(p), λ(q)) is a smoothness regularization function given by the truncated squared absolute difference [11,13] r{p,q} (λ(p), λ(q)) = min(const2 , |μλ(p) − μλ(q) |2 ). Now we notice that we can rewrite the functional in a form amenable to graph cut processing: F ({μl }, λ) = (μl − Ip )2 + α r{p,q} (λ(p), λ(q)) (3) {p,q}∈N
l∈L p∈Rl
3
Functional Optimization
Function (3) is minimized by iterating two consecutive steps: closed form update of the regions parameters (Section 3.1) and segmentation by graph cut combinatorial optimization (Section 3.2). 3.1
Continuous Labels Updating
To optimize (3) for a given segmentation, we differentiate with respect to μk , k ∈ L: ∂F ({μl }, λ) =2 (μk − Ip ) + 2α (μλ(p) − μλ(q) ) (4) ∂μk p∈Rk
{p,q}∈N ,λ(p)=k
|μλ(p) −μλ(q) |
We choose a neighborhood system, N , of size 4. Let N p be the set of neighbors q of p such that λ(q) = λ(p) and |μλ(p) − μλ(q) | < const. Function (4) can be simplified to: ∂F ({μl }, λ) =2 (μk − Ip ) + 2α (μk − μλ(q) ) (5) ∂μk p∈Rk
p∈Ck q∈N p
where Ck is the boundary of each region Rk . The label optimizing the functional, given a segmentation, is thus deduced from the last equation and expressed in a closed-form as follows Ip + α μλ(q) μk =
p∈Rk
p∈Ck q∈N p
Rk + α
p∈Ck
where designates set cardinality.
N p
(6)
272
3.2
M.B. Salah, A. Mitiche, and I.B. Ayed
Graph Cuts Segmentation Updating
In this section we describe the graph cut algorithm for the segmentation step. Let G = V, E be a weighted graph, where V is the set of vertices (nodes) and E the set of edges. V contains a node for each pixel in the image and two additional nodes called terminals. Commonly, one is called source and the other is called sink. There is an edge {p, q} between any two distinct nodes p and q. A cut C ⊂ E is a set of edges verifying : – Terminals are separated in the graph G(C) = V, E \ C – No subset of C separates terminals in G(C) This means that a cut is a set of edges the removal of which separates the terminals into two graphs. In addition, this cut is maximal in the sense that none of its subsets separates the terminals into the same two graphs. The minimum cut problem consists of finding the cut C, in a given graph, with the lowest cost. The cost of a cut, |C|, is the sum of its edge weights. The study in [11] gives two algorithms based on graph cuts which find a local minimum of an objective function of the type of equation (3) using two kinds of large moves: expansion moves and swap moves. Large moves proceed by changing labels of a large set of pixels simultaneously to decrease, iteratively, the objective function. In this work, we use the algorithm based on swap moves because it handles more general energy functions. For more details, we refer the reader to [11].
4
Experimental Results
To validate the proposed method, we ran experiments with grey-level and motion maps. Initializations consist of arbitrary partitions and region parameters. All results were run on MATLAB 7.0.1 on a Pentium(R) 1.73GHz, 504MB computer. 4.1
Grey Level Images
Synthetic data. Figure 1 shows simple synthetic images of geometric shapes segmented with our method. In this first experiment, each shape in these images corresponds to a region where the intensity is a constant plus a Gaussian noise with different parameters. Thus, the number of regions in each image is known in advance (Figure 1; (a): 2 regions, (b): 3 regions, (c): 4 regions and (d): 5 regions). Different from the method in [11] which uses a set of fixed labels corresponding to all the possible grey levels, we have the possibility to initialize the segmentation with a smaller number, 5 in this experiment, which corresponds to the maximum number of regions in the different examples. The results are exactly the same if we initialize with the actual number of regions. The method in [11] allows the number of labels to decrease. However, it uses the set of fixed labels corresponding to all possible grey levels, i.e., 256 grey levels. Thus, the final segmentation is described by a subset of the initial set of labels and is an implicit over-segmentation since no explicit number of regions guides the segmentation. As shown in the last row of table 1, the final segmentation
A Continuous Labeling for Multiphase Graph Cut Image Partitioning
(a)
(b)
(c)
273
(d)
Fig. 1. Synthetic images with a Gaussian noise: (a) two regions; (b) three regions; (c) four regions; (d) five regions; (a)-(d): segmentations with this method using 5-region initialization. Images size: 128 × 128. α = 1.2.
is described, in all cases, by a number of regions which is much larger than the right number of regions. Different from [11], our method allows regions parameters to vary jointly with the segmentation; thus, the initial set of parameters and partition are arbitrarily set equal to the chosen maximum number of regions. It is shown in figures 1(a)-(d) that our method segments the images into the right number of regions (see also table 1). Table 1. Running time (in seconds) and the number of final remaining regions for synthetic grey level images Number of regions This method Final regions Boykov et al. [11] Final regions
2 0.12 2 124.9 53
3 0.28 3 132.4 58
4
5
0.31 4 110.01 51
1.57 5 136.12 98
In addition to that, our method decreases significantly the running time because we deal with a number of regions much more smaller than with the method in [11]; the regions parameters are calculated iteratively and jointly with the segmentation process, which removes the need of using the set of all possible grey levels. Table 1 reports the running times of our method and the discrete optimization method in [11] and shows that, for segmentation, it leads to an improvement in speed and accuracy over the method in [11]. Real data. In general, the number of regions in an image is not known and, consequently, can not be fixed in advance. However, in most applications, a maximum number of regions is given, i.e., we know that the actual number of regions is less than a maximum number. This maximum number is, in most cases, much smaller than the number of all possible grey levels. The latter was commonly used [11] to guide graph cut segmentation without an explicit reference to the number of regions. In this study, we set the initial number of regions to such
274
M.B. Salah, A. Mitiche, and I.B. Ayed
a maximum number. In the following, we give a representative sample of tests with real images. When applied to segmentation, the Boykov algorithm [11] may not be able to segment these images into the desired number of segments; it gives, generally, an over-segmentation which results in a much larger time of convergence.
(a)
(b) Fig. 2. Segmentation results for real images: (a) images; (b) final segmentations; α = 2.2
Figure 2 shows five different real images with different number of regions. As our method computes iteratively regions parameters jointly with the segmentation process, it is possible to affect a maximum value to the initial number of regions. The latter is decreased by the regularization term, which allows region merging. With an arbitrary initialization for each image in figure 2 and using the same regularization term penalty (α = 2.2) we display, in the second row, the final segmentation of each image in which regions are represented by the corresponding parameter (equation (6)) at convergence. Table 2. Running time (in seconds) for real grey level images with different methods Image (1) (2) (3) (4) (5)
Chan & Vese [6] Mansouri et al. This method [24] 30.72 0.17 0.15 54.84 0.07 0.38 — 3.96 0.40 — 324.5 25.50 — 87.1 5.50
Table 2 reports the running time of our method and other methods from the variational optimization framework based on active contours and level sets [6], [24]. In the first column, we give the running time of the well known Chan and Vese method [6] applied to images of two regions (images (1) and (2)). The running time with the method by Mansouri et al. [24] is reported in the second column. For multiregion image segmentation with more than one active
A Continuous Labeling for Multiphase Graph Cut Image Partitioning
(a)
(b)
(c)
275
(d)
Fig. 3. Segmentation of the Road and Marmor sequences with a 5-region initializations: (a) The motion map of the Road sequence, (b) remaining regions, (c) The motion map of Marmor sequence, (d) remaining regions
contour, this recent method [24] guarantees a partition using a costly correspondance between the interior of active contours and the regions of segmentation. The current study removes this partition problem with the combinatorial optimization framework. Consequently, when the number of regions augments, our method becomes increasingly more rapid than the two level set methods [24,8] (refer to table 2 for images (3) to (5)). For images with only two regions (images (1) and (2)), the Chan and Vese method requires more convergence time than both other methods. For a large number of regions, this method becomes time-consuming and difficult to implement [8]. The method by Mansouri et al. [24] handles better the partition problem but, as reported in table 2, its running time is significantly superior than ours. 4.2
Motion
In this section, we will run experiments on vectorial images in order to demonstrate the effectiveness of our method for more complex images. We show a representative sample of the tests with motion sequences. In this part, we use image sequences compound of: two regions in Figure 3(a) which are the moving truck and the background; and three regions in Figure 3(c) which are the background and two objects moving in different directions. The purpose is to segment the optical flow image into motion regions. The optical flow field is calculated using the method in [26] and is represented, at each pixel, by a two-dimensional vector as shown in figure 3(a) and figure 3(c). We initialized with 5 regions and fixed α, equal to 2. In figures 3(b,d), we display the remaining regions after convergence: only two regions remain at convergence in the Road sequence and three regions in Marmor sequence. The colored curves shown in figures 3(b,d) separate the moving objects from the background.
5
Conclusion
The subject of our study was variational multiphase image segmentation. We investigated a method which combined the advantages of graph cut combinatorial
276
M.B. Salah, A. Mitiche, and I.B. Ayed
optimization and continuous labeling. In this combination, the labels were seen both as region indices and parameters which characterize the segmentation regions according to a piecewise constant image model. The algorithm iterated two consecutive steps: an original closed form update of the regions continuous labels and segmentation by graph cut combinatorial optimization. The formulation used an upper bound on the number of regions rather than the actual number of regions. The advantages of the method over others were shown in several experiments with synthetic and real images of intensity, color and motion.
References 1. Mumford, D., Shah, J.: Optimal approximation by piecewise smooth functions and associated variational problems. Comm. Pure Appl. Math. 42, 577–685 (1989) 2. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approches to level set segmentation: integrating color, texture, motion and shape. IJCV 72(2) (2007) 3. Boykov, Y., Funka-Lea, G.: Graph Cuts and Efficient N-D Image Segmentation. IJCV 70(2) (2006) 4. Ben Ayed, I., Mitiche, A., Belhadj, Z.: Polarimetric Image Segmentation via Maximum Likelihood Approximation and Efficient Multiphase Level Sets. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1493–1500 (2006) 5. Zhu, S.C., Yuille, A.: Region competetion: Unifying snakes, region growing, and bayes/MDL for multiband image segmentation. IEEE Trans. on PAMI 18(6), 884– 900 (1996) 6. Chan, T.F., Vese, L.A.: Active Contours Without Edges. IEEE Trans. on Image Processing 10(2), 266–277 (2001) 7. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. on PAMI 22(8), 888–905 (2000) 8. Vese, L.A., Chan, T.F.: A multiphase level set framework for image segmentation using the Mumford ans Shah model. Int. J. Comput. Vis. 50(3), 271–293 (2002) 9. Leclerc, Y.G.: Constructing Simple Stable Descriptions for Image Partitioning. International Journal of Computer Vision 3(1), 73–102 (1989) 10. Mignotte, M., Collet, C., P´erez, P., Bouthemy, P.: Sonar image segmentation using a hierarchical MRF model. IEEE Transactions on Image Processing IP-9(7), 1216– 1231 (2000) 11. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. on Patt. Anal. and Mach. Intell. 23(11), 1222–1239 (2001) 12. Boykov, Y., Kolmogorov, V.: An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. IEEE Trans. on PAMI 26(9), 1124– 1137 (2004) 13. Veksler, O.: Efficient Graph-Based Energy Minimization Methods in computer Vision, PhD Thesis, Cornell Univ. (July 1999) 14. Bagon, S.: Matlab Wrapper for Graph Cut (December 2006), www.wisdom.weizmann.ac.il/∼ bagon 15. Schoenemann, T., Cremers, D.: Near Real-Time Motion Segmentation Using Graph Cuts. In: DAGM-Symposium, pp. 455–464 (2006) 16. Lempitsky, V., Roth, S., Rother, C.: FusionFlow: Discrete-Continuous Optimization for Optical Flow Estimation. In: CVPR 2008, Alaska (June 2008)
A Continuous Labeling for Multiphase Graph Cut Image Partitioning
277
17. Chang, H., Yang, Q., Auer, M., Parvin, B.: Modeling of Front Evolution with Graph Cut Optimization. In: IEEE International conference on Image Processing, vol. 1, pp. 241–244 (2007) 18. Zeng, X., Chen, W., Peng, Q.: Efficient solving the piecewise constant MumfordShah model using graph cuts, Technical report, Dept. of computer science, Zhejiang university, P.R. China (2006) 19. El-Zehiry, N., Xu, S., Sahoo, P., Elmaghraby, A.: Graph cut optimization for the Mumford-Shah model. In: Proc. of the Int. conf. Visualization, Imaging, and Image Processing, Palma de Mallorca, Spain (August 2007) 20. Brox, T., Weickert, J.: Level Set Segmentation With Multiple Regions. IEEE Transactions on Image Processing 15(10), 3213–3218 (2006) 21. Xiao, J., Shah, M.: Motion Layer Extraction in the Presence of Occlusion Using Graph Cuts. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1644–1659 (2005) 22. Vicente, S., Kolmogorov, V., Rother, C.: Graph cut based image segmentation with connectivity priors. In: CVPR 2008, Alaska (June 2008) 23. Greig, D., Porteous, B., Seheult, A.: Exact maximum a posteriori estimation for binary images. Jour. of the Roy. Stat. Soc. Series B 51(2), 271–279 (1989) 24. Mansouri, A.-R., Mitiche, A., Vazquez, C.: Multiregion competition: A Level Set extension of Region Competition to Multiple Region Image Partitioning. Computer Vision and Image Understanding 101(3), 137–150 (2006) 25. Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam Library of Object Images. International Journal of Computer Vision 61(1), 103–122 (2005) 26. Vazquez, C., Mitiche, A., Laganiere, R.: Joint Multiregion Segmentation and Parametric Estimation of Image Motion by Basis Function Representation and Level Set Evolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 782–793 (2006) 27. Aubert, G., Kornprobst, P.: Mathematical Problems in Image Processing: Partial Differential Equations and the Calculus of Variations. Springer, New York (2006) 28. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, Hoboken (2000)
A Graph-Based Approach for Image Segmentation Thang V. Le1, Casimir A. Kulikowski1, and Ilya B. Muchnik2 1
Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA 2 DIMACS, Rutgers University, Piscataway, NJ 08854, USA Abstract. We present a novel graph-based approach to image segmentation. The objective is to partition images such that nearby pixels with similar colors or greyscale intensities belong to the same segment. A graph representing an image is derived from the similarity between the pixels and partitioned by a computationally efficient graph clustering method, which identifies representative nodes for each cluster and then expands them to obtain complete clusters of the graph. Experiments with synthetic and natural images are presented. A comparison with the well known graph clustering method of normalized cuts shows that our approach is faster and produces segmentations that are in better agreement with visual assessment on original images.
1 Introduction Image segmentation is a challenging problem in computer vision. Depending on user objectives, different definitions and criteria have been proposed and employed for image segmentation. In this paper, we address the problem of segmenting a greyscale or color image into a set of disjoint regions such that each region is composed of nearby pixels with similar intensities or colors. We represent an image by a proximity graph in which nodes correspond to image pixels, and edges reflect pairwise similarities between the pixels. Weights of edges are computed by a similarity function based on properties of corresponding pixels such as their location, brightness and color. With this representation, image segmentation can be solved by graph clustering methods. Let us consider an undirected graph G = (V, E, W), where the set of nodes V represents a set of data objects, the set of edges E represents the relationships between data objects, and W is a symmetric matrix where the entry wij ∈ [0, 1] is the weight of the edge between nodes i and j. If G is a proximity graph, then the edge weight wij represents the degree of similarity between the objects corresponding to i and j, a higher value of wij implies a higher similarity degree between i and j. In the graph clustering problem, a graph is partitioned into subgraphs such that nodes of a subgraph are strongly or densely connected, while nodes belonging to different subgraphs are weakly or sparsely connected. In other words, we discover subgraphs such that the sum of weights of edges inside a subgraph is high, while the sum of weights of edges connecting different subgraphs is low. Applying a graph clustering algorithm to a proximity graph will partition it into subgraphs, such that each subgraph corresponds to a group of similar objects, which are dissimilar to objects of groups corresponding to other subgraphs. Thus, the set of nodes of each cluster of the proximity graph representing an image correspond to the set of pixels of a segment of that image. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 278–287, 2008. © Springer-Verlag Berlin Heidelberg 2008
A Graph-Based Approach for Image Segmentation
279
In this paper, we extend a new graph clustering technique [1] to tackle the problem of segmenting greyscale and color images. The clustering approach used here is called coring method and it works very well if each cluster of the data consists of a region of high density surrounded by a region of low density. We describe the graph representation for images in Section 2. The segmentation algorithm is discussed together with a calibration example for parameter determination in Section 3. Section 4 presents our comparisons with the normalized cut approach tested for efficiency and segmentation performance on images taken from the Berkeley dataset [5].
2 Graph-Based Representation for Images Based on a digital image, we construct a proximity graph G = (V, E, W). Each node of V represents a pixel, the weight wij of the edge between two nodes corresponding to pixels i and j reflects the likelihood that i and j belong to the same segment in the image. Since we want to group nearby pixels that have a similar intensity/color, the weights of graph edges are computed by a likelihood function based on the location and intensity/color of neighboring pixels. For greyscale images, we use the weight function described in [2]:
⎧ − ⎛⎜ I ( i ) − I ( j ) ⎞⎟ ⎪ ⎝ σI ⎠ wij = ⎨ e ⎪ ⎩0
2
⎛ dist ( i , j ) ⎞ −⎜ ⎟ ⎝ σd ⎠
2
if dist (i , j ) < r ,
(1)
otherwise ,
where I(i) ∈ [0,1] is the intensity of pixel i, dist(i, j) is the Euclidean distance in pixels between i and j. For color images, we replace the difference of intensity in (1) by the normalized Euclidean distance between color components of the pixels:
⎧ −⎜⎛ C ( i ) − C ( j ) 2 ⎟⎞ ⎟ 3σ I ⎠ ⎪ ⎜⎝ wij = ⎨ e ⎪ ⎩0
2
⎛ dist ( i , j ) ⎞ −⎜ ⎟ ⎝ σd ⎠
2
if dist (i , j ) < r ,
(2)
otherwise ,
where C(i) is a vector of three features, namely red, green, and blue color components of pixel i, so C (i ) − C ( j ) 2 ∈ [0, 3] . There is an edge linking two nodes only if the distance between the corresponding pixels is less than r pixels, so each node is connected to approximately πr2 other nodes. The proximity graph is very sparse as πr2 << |V|. By using the above weight functions, strong edges exist between nodes whose corresponding pixels are close to each other and have similar intensity/color. Therefore, pixels inside a segment with homogeneous intensity/color (i.e., the inner or core region of a segment) have their nodes in the proximity graph strongly connected. On the other hand, pixels at boundaries of segments often have their neighbor pixels with dissimilar intensity/color, so corresponding nodes are usually weakly connected to one another. Note that by using the weight functions (1) and (2), we can easily evaluate and validate segmentation results by visual inspection and assessment as pixels with similar intensity or color should be in the same segment. A good result for a graph-based segmentation method obviously depends on the condition that segments of the image translate into well-separated clusters of the proximity
280
T.V. Le, C.A. Kulikowski, and I.B. Muchnik
graph. The settings for parameters σI, σd, and r are therefore important as they determine how images are transformed into proximity graphs. The value of σI is a trade-off between the similarity of intensity/color of pixels in the same segment and the dissimilarity of intensity/color of pixels belonging to different segments. A higher value for σI would allow a higher tolerance for differences of pixel intensity/color within each segment, however, it would then be harder to distinguish two segments that have a similar average intensity/color. The setting of σI should depend on the contrast of the image, low contrast images may use a smaller value while high contrast images may use a larger value of σI. Parameters σd and r specify how spatial information is incorporated into the weight function. They determine the likelihood that neighbor pixels belong to the same segment. Higher values for σd and r make a segment span to greater distances over regions of heterogeneous intensity/color. This is a trade-off between detecting weakly separated segments and not breaking a large segment having some heterogeneous regions inside into smaller parts. We may need to increase values of σd and r for larger images because of the likelihood that large images contain larger segments. We typically set σI = 0.07, σd = 8, and r = 11 for images of size less than 200×300. In our experiments, these settings are robust as they worked well across a wide range of image types.
3 Image Segmentation Algorithm As shown in Section 2, an image can be represented by a proximity graph such that segments of the image correspond to clusters of the graph and each cluster has a region of high density (at the core of the corresponding segment) surrounded by a region of low density (at the boundaries of the corresponding segment). To find clusters in proximity graphs of images, we use a new graph clustering method described in [1]. The key idea is that cores of clusters are identifiable by analyzing neighborhood relationships between objects in a particular space which reflects object similarities. In most practical problems, direct analysis is unrealistic due to the high dimensionality of the space, a heuristic allows reducing this to a one-dimensional ordinal sequence of density variation for a set of objects contained in a graph G. The local density at node i of a subgraph H ⊆ V is measured by a function d(i, H): d (i , H ) = ∑ j∈H wij .
(3)
Based on d(i, H), we define the minimum density of H as follows: D( H ) = min i∈H d (i , H ) .
(4)
The node m = argmin i∈H d (i, H ) is called the weakest node of H as it has the minimum local density. The coring method computes the minimum density D value while the weakest node is iteratively removed from the graph. If clusters of the graph have a dense core, we can identify nodes belonging to cluster cores by analyzing the variation of D values. Specifically, if there is a significant drop in D value after the removal of a node, this node is highly connected with other nodes in a dense region and it is potentially a core node because its elimination drastically reduces the density of the region around it.
A Graph-Based Approach for Image Segmentation
281
3.1 Image Segmentation Algorithm In this section, we describe the segmentation algorithm which partitions an image by building and clustering its proximity graph. The algorithm consists of 5 steps. Input: An image I. Output: Segmentation of the image I. 1. Build a proximity graph G for the input image I. 2. Compute the sequence of density variation for G. 3. Identify a set of core pixels based on the sequence of density variation. 4. Partition the set of core pixels into groups. 5. Expand the groups of core pixels to get the image segmentation. 3.1.1 Build a Proximity Graph G for the Input Image Inputs: Image I; Parameters σI, σd, and r. Output: Proximity graph G. Create a node in G for each pixel of I. for each node v of G Create edges in G connecting v with nodes corresponding to pixels locating within a distance r from the pixel of v, edge weights are computed by (1) if I is a greyscale image, by (2) if I is a color image. return Each pixel is connected to its neighbors within a radius of r pixels. So roughly each node of the proximity graph has about πr2 incident edges, and therefore we can estimate 2|E| ≈ πr2|V|. The time complexity to build this graph is O(|E|) which is equivalent to O(|V|) because of the sparseness of the proximity graph. 3.1.2 Compute the Sequence of Density Variation for G The sequence of density variation for a graph G is constructed iteratively. At each iteration t, we determine the minimum density Dt by (3) and (4) and the weakest pixel Mt. The procedure produces the sequences of Dt values and Mt pixels. Input: Proximity graph G = (V, E, W). Output: Sequences of Dt and Mt. n ← |V| H←V for t = < 1, 2, ..., n−1, n > Dt ← mini∈H Σj∈H wij Mt ← argmini∈H Σj∈H wij H ← H − Mt return The procedure computes the local density at every node and then iterates |V| times. At each iteration, we find the pixel with the minimum density, remove it, and update the local densities of its neighbors on the graph. Using a Fibonacci heap, we can
282
T.V. Le, C.A. Kulikowski, and I.B. Muchnik
quickly extract the minimum value and decrease key values for neighboring pixels. The amortized cost of this step is O(|E| + |V| log |V|). 3.1.3 Identify a Set of Core Pixels Based on the sequences of Dt and Mt, core pixels are identified using two parameters δ ∈ [0,1) and β ∈ N. The role of δ is to specify the minimum rate of decrease for the D values of the core pixels, and β is to control the minimum size of a group of successive core pixels in the Mt sequence. Inputs: Sequences of Dt and Mt; Parameters δ, β. Output: A set of core pixels. Compute the local rates of decrease Rt on the Dt sequence: Rt = ( Dt − Dt +1 ) Dt . Sort the list of positive Rt in ascending order. α ← The value of Rt at the position δ relative to the beginning on the sorted list of positive Rt. for t = < 1, 2, ..., n−1, n > Mt is extracted as a core pixel if it is among a set of at least β successive Mt that have the corresponding Rt > α. return The procedure scans the sequences of Dt and Mt to find core pixels satisfying the conditions, so it runs in O(|V| + |Vp| log |Vp|), where Vp is the set of pixels that have a positive rate of decrease, note that |Vp| << |V|. 3.1.4 Partition the Set of Core Pixels The set of core pixels produced by the last step is partition into groups; each group is considered the core of a segment. Input: The set of core pixels. Output: Cores of segments. In the graph induced by the core pixels: Remove edges whose weights are smaller than θ. Find connected components of the core graph; each component represents the core of a segment. return Weak edges are removed to disconnect neighboring core groups which happen to reside within r pixels from each other. We fix the threshold θ at a small value of 0.1. Breadth-first or depth-first search is used to find connected components in the graph of the core pixels. Its time complexity is O(|Vc| + |Ec|), where Vc and Ec are the sets of core pixels and incident edges, respectively. Note that |Vc| << |V| and |Ec| << |E|. 3.1.5 Expand Core Groups to Find Image Segmentation The sequence of density variation progresses from low to high density regions on the proximity graph. Therefore in this expanding step, we scan the Mt sequence in the backwards direction to go from dense to sparse regions of the graph. While scanning the sequence, we assign pixels to the most similar segment.
A Graph-Based Approach for Image Segmentation
283
Inputs: Proximity graph G = (V, E, W); Mt sequence; Cores of segments. Output: Segmentation S. n ← |V| S ← {cores of segments} L ← {} for t = < n, n−1, ..., 2, 1 > if Mt is not a core pixel then m1 ← max C∈S averagei∈C , i∉L , wM i >0 wM t i t
s ← argmax C∈S averagei∈C , i∉L , wM i >0 wM t i t
m2 ← max2C∈S averagei∈C , i∉L , wM i >0 wMt i t
if m1 > 0 then Add Mt to segment s of S. if m2 > λ*m1 then L ← L ∪ {Mt} end if else Add a new segment containing Mt to S. end if end if return In the above procedure, max2 is a function for determining the second maximum value of a set. The similarity between a pixel Mt and a segment C ∈ S is computed by averagei∈C , i∉L , wM i >0 wM t i , where L is the set of low-confident pixels created with the t
condition m2 > λ*m1. We typically fix λ value at 0.5, so a pixel belongs to L if its similarity with the second nearest segment is greater than half of its similarity with the nearest segment. Note that if a pixel is not connected to any known segments, a new segment will be created to accommodate it. This situation may occur if the values of parameters δ and β are so high that step 3 excludes all the representatives of some strong segments from the set of core pixels. With the graph adjacency-list representation, the time complexity of this step is O(|E|). 3.2 Experimental Study for Determining Parameter Settings Fig. 1 (a) is a greyscale image consisting of 7 regions of nearby pixels with similar intensities. To illustrate the role of core pixels for finding a segmentation of the image, Fig. 1 (b) shows the location of core pixels with β = 3, δ = 90%. The set of core pixels consists of 7 connected components representing the cores of 7 segments. So this number of cores matches the number of regions in the original image. Expanding these cores gives us the expected segmentation as in Fig. 1 (e) where different segments are shown by different intensities. It is interesting that the same parameter settings work well for segmenting a wide range of more complex natural images, such as those from the widely referenced Berkeley dataset [4] as in Fig. 5. Here, along with the original images, we show the segmentations and the derived boundaries. By visual assessment, these results satisfy our original objective of grouping nearby pixels with similar color into the same segment.
284
T.V. Le, C.A. Kulikowski, and I.B. Muchnik
(a)
(b)
(c)
(d)
(e)
Fig. 1. (a) is a greyscale image. (b), (c), and (d) show core pixels when β = 3, δ = 90%, 96%, and 99%, respectively. The segmentation result (e) is the same for any settings.
Since the number of nodes is finite, there are a limited number of possible settings for β and δ. Parameter β is an integer with the role of eliminating noisy pixels in the set of pixels of high decreasing rates of D values. Usually we set β value to 3 or 4. The main parameter of the clustering method is δ, changing δ can result in increasing or decreasing the number of segments. In (b), (c), and (d) of Fig. 1, we illustrate the effect of δ on the set of core pixels. The cores of segments shrink if we increase δ, and they are enlarged if δ is decreased. In general, we set δ to a value greater than 90% because for noisy images, setting δ too low may produce core pixels which are not very reliable. For images that contain a mixture of segments with different sizes, by increasing or decreasing δ, we can obtain a coarse or fine-grained segmentation. In Fig. 2, (a) shows a greyscale image, (b), (c), and (d) show segmentation results of (a) with different values of δ. Higher δ produces coarser segmentation while lower δ yields finer segmentation. This demonstrates the point that core pixels are arranged in an order such that the pixels having higher rates of decrease in their D values belong to the cores of stronger segments.
(a)
(b)
(c)
(d)
Fig. 2. (a) is a greyscale image. (b), (c), and (d) shows segmentations of the image with δ = 97%, 98%, and 99%, respectively.
4 Comparisons with the Normalized Cut Method [2] Spectral clustering in general and normalized cut method in particular are well known graph clustering approaches [2, 6, 7]. Application of normalized cut to image segmentation has been shown in [2] and an implementation of this method developed by its authors is available on the web at www.cis.upenn.edu/~jshi/software. The method basic idea is to partition a graph G = (V, E, W) into k subgraphs A1, A2, ..., Ak based on the minimum k-way normalized cut, which is defined by: N cut k =
k
∑ i =1
∑ ∑
u ∈ Ai , v ∈V − Ai u ∈ Ai , v ∈V
w uv
w uv
.
(5)
A Graph-Based Approach for Image Segmentation
285
Finding exact minimum normalized cut is NP-hard, so an approximate solution is estimated using eigenvectors of the normalized Laplacian matrix L = I – D-1W, where D is the diagonal matrix of vertex degrees. The complexity of solving this approximation is relatively high, about O(|V|2.5) for sparse graphs [2]. As it is true of many clustering methods, the number of clusters k has to be pre-specified. This is problematic for image segmentation because the number of segments is highly variable depending on the scene in images. The number of segments of an image is not something that we would think about in the first place, but rather is a natural result of the perception process. Yet a more significant problem with minimizing normalized cuts is that the normalizing factors in the criterion function (5) make the method favor cutting the graph into clusters of similar sizes. As a result, small clusters are omitted very easily while large clusters are often split up into smaller parts. In Fig. 3, (b) shows the segmentation of the image in (a) by a two-way normalized cut. The cut partitions the image into two segments similar in size. It fails to find the central oval as one segment because of the disparity between the number of pixels inside and outside the oval. In Fig. 4, (g), (h), (i), (j), (k), and (l) show the segmentations for the image in (a) by normalized cuts with k = 2, 3, 4, 5, 6 and 7, respectively. Clearly, for any k, the image is partitioned into segments of roughly similar sizes. Moreover, varying parameter k dramatically changes the segmentation result.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. (b) shows the segmentation result by the normalized cut method on the image in (a). Images (c) and (e) are created by adding different kinds of noise to (a). (d) and (f) show the segmentations by our method on (c) and (e), respectively.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 4. Effects of parameters on segmentation results. (a) is a greyscale image from the Berkeley dataset. (b), (c), (d), (e), and (f) show segmentations by our method with δ = 99%, 97%, 95%, 93%, and 90%, respectively. (g), (h), (i), (j), (k), and (l) show segmentations of the normalized cut method on the same proximity graph with k = 2, 3, 4, 5, 6 and 7, respectively.
286
T.V. Le, C.A. Kulikowski, and I.B. Muchnik
Fig. 5. Segmentations and derived boundaries on images from the Berkeley dataset. The same parameter settings β = 3, δ = 97% are used for all the images. Table 1. Execution times of the normalized cut and coring methods for clustering graphs
Proximity graph sizes #nodes 30*103 45*103 60*103 75*103 90*103 105*103 120*103 500*103 1000*103
#edges 4.7*106 7.1*106 10.7*106 12.2*106 14.8*106 17.6*106 18.7*106 50*106 70*106
Normalized cut (seconds)
Coring method (seconds)
9 16 30 60 106 146 n/a n/a n/a
0.5 0.7 1.0 1.2 1.4 1.6 1.8 5.2 7.5
Based on the same proximity graphs, our algorithm produces better results. In Fig. 4, (b), (c), (d), (e), and (f) are our segmentations for the images in (a) with β = 3 and δ = 99%, 97%, 95%, 93%, and 90%, respectively. Note that lowering δ yields finer segmentation and the results are very consistent in their sensitivity to perturbations of the different parameter settings. In terms of speed, our implementation has a total time complexity of O(|E| + |V| log |V|). Experiments show that it runs much faster than the normalized cut method on the same proximity graphs. In Tab. 1, we show the execution times on different proximity graphs of the two methods using a PC with a CPU of Core 2 Duo 2.4GHz and 2GB RAM. It can be seen that the running time of the coring method is roughly linear to the number of edges of the proximity graphs.
A Graph-Based Approach for Image Segmentation
287
The running times of the normalized cut are the average time for partitioning the graph into 2, 3, and 4 segments. For the cases that the graphs contain more than 105*103 nodes and 18*106 edges, the normalized cut method fails to execute because of an ‘Out of memory’ error. In addition, an advantage of the coring method is that we can change β or δ and quickly obtain a new result by re-executing fast steps 3, 4 and 5 of the method. In contrast, for the normalized cut method, changing k will result in re-computing the cut from scratch.
5 Conclusion We have developed and evaluated a graph-based method for image segmentation. Based on proximity graphs, we partition images into segments of nearby pixels with similar intensities or colors. The method is simple and very fast. Because of exploiting the pixels in cores of segments which are generally stable and reliable, the method is robust to noise. As in Fig. 3, by adding noise to (a) we obtain images (c) and (e), the segmentations of these noisy images shown in (d) and (f) remain stable. In addition to speed and robustness, an advantage is that parameters can be adjusted to yield a range of results from a coarse to fine-grained segmentation. The current weight functions (1) and (2) do not take into account texture information. It is possible to apply the method to texture segmentation by additionally incorporating statistical texture information around each image pixel into the weight functions, which should help refine the segmentation for images in that texture is an important feature.
References 1. Le, T., Kulikowski, C., Muchnik, I.: Coring method for clustering a graph, DIMACS Technical Report (2008) 2. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 888–905 (2000) 3. Charikar, M.: Greedy approximation algorithms for finding dense components in a graph. In: Jansen, K., Khuller, S. (eds.) APPROX 2000. LNCS, vol. 1913, pp. 84–95. Springer, Heidelberg (2000) 4. Berkeley segmentation dataset, http://www.cs.berkeley.edu/projects/vision/ bsds/ 5. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th International Conference on Computer Vision, pp. 416–423 (2001) 6. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing, 395–416 (2007) 7. Kannan, R., Vempala, S., Vetta, A.: On Clusterings: Good, Bad and Spectral. In: Proc. 41st Annual Symposium on the Foundation of Computer Science, pp. 367–380 (2000)
Active Contours Driven by Supervised Binary Classifiers for Texture Segmentation Julien Olivier, Romuald Bon´e, Jean-Jacques Rousselle, and Hubert Cardot Universit´e Fran¸cois Rabelais de Tours, Laboratoire Informatique 64 Avenue Jean Portalis, 37200 Tours, France {julien.olivier,romuald.bone,rousselle,hubert.cardot}@univ-tours.fr
Abstract. In this paper, we propose a new active contour model for supervised texture segmentation driven by a binary classifier instead of a standard motion equation. A recent level set implementation developed by Shi et al in [1] is employed in an original way to introduce the classifier in the active contour. Carried out on a learning image, an expert segmentation is used to build the learning dataset composed of samples defined by their Haralick texture features. Then, the pre-learned classifier is used to drive the active contour among several test images. Results of three active contours driven by binary classifiers are presented: a knearest-neighbors model, a support vector machine model and a neural network model. Results are presented on medical echographic images and remote sensing images and compared to the Chan-Vese region-based active contour in terms of accuracy, bringing out the high performances of the proposed models.
1
Introduction
Initially developed by Kass et al in [2], active contours or snakes are powerful segmentation tools widely used in the segmentation field. Snakes are defined as parameterized curves C mapping a parameter s to a point (x, y) in the image Ω. The curve is initialized in the image Ω and evolves, under some constraints, according to its normal and tangential directions until it stops on the boundary of the object. As the curve evolves in time, C(s, t) represents the family of curves obtained during the evolution. An energy attached to the model is defined to control its evolution as: F (C(s, t)) = f (s, t)ds. (1) The model evolves by minimizing F (C(s, t)), so that the final curve is on a local minimum of F . Originally the energy attached to the model was obtained by integrating f only on the curve C. These models are called boundary-based but their applications are restricted to objects whose boundaries are defined by gradient and are not very efficient on textured images. Later, region-based models emerged by integrating f inside the curve or over the entire domain of G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 288–297, 2008. c Springer-Verlag Berlin Heidelberg 2008
Active Contours Driven by Supervised Binary Classifiers
289
Ω. Several region-based active contours have been developed since the past ten years [3,4] and a review of main models can be found in [5]. Originally, Kass et al chose to represent the curve explicitly (the curve is sampled and its evolution is guided by several control points) but this leads the model to an impossibility to handle topology changes without additional implementation. Yet, an implicit representation for active contours called the level set implementation emerged from the works of Osher and Sethian [6]. The main advantage of the level-set implementation is to allow active contours to handle topology changes automatically. The curve can actually naturally split or merge with others without any additional implementation. Let us consider that the family of parameterized curves C(s, t) evolves according to the following partial differential equation: ∂C(s, t) = F N. ∂t
(2)
where F is the speed function of the model and N its inward normal vector. The main idea of the level set implementation is that curve C(s, t) can be seen as the zero-level of a Lipschitz function φ(x, t), x being a pixel of the image. The evolution of the active contour (ie: the zero-level of φ(x, t)) will be implicitly deduced from the entire evolution of φ(x, t), given by: ∂φ(x, t) = F (x, t) ∇φ(x, t) , ∂t
∀x ∈ Ω, ∀t ≥ 0.
(3)
PDEs of models implicitly represented with level sets are usually deduced from the energy of the curve using Euler-Lagrange equations (with a prior use of Green-Riemann theorem for region-based active contours) [7,4], even though the first models implemented with level sets were geometric ones, directly defined by their PDEs [8,9]. Unfortunately the level-set method requires significant computational time. Several methods have been developed in order to accelerate it, such as the narrow band implementation [10]. Shi et al have recently developed in [1] an acceleration method based on the narrow band implementation. The authors use two lists to represent the active contour: the list of outside boundary points Lout and the list of inside boundary points Lin . Lout = {x | φ(x) > 0, ∃ y ∈ N (x), φ(y) < 0} .
(4)
Lin = {x | φ(x) < 0, ∃ y ∈ N (x), φ(y) > 0} .
(5)
with N (x) the discrete neighborhood of x composed of the eight nearest pixels. The authors assume φ(x) to take only four integer values, according to the position of x: ⎧ 1 if x ∈ Lout ⎪ ⎪ ⎨ −1 if x ∈ Lin φ(x) = (6) 3 if x is outside C and x ∈ / Lout ⎪ ⎪ ⎩ −3 if x is inside C and x ∈ / Lin .
290
J. Olivier et al.
Then, the discrete optimality condition for the curve is defined as: The curve C with boundary points Lin and Lout is optimal if the speed function F satisfies: F (x) < 0 ∀x ∈ Lout and F (x) > 0 ∀x ∈ Lin . (7) Until this optimality condition is reached, the speed function F is calculated for every point in Lin and Lout . If F (x) > 0 at a point in Lout the curve is moved outward. If F (x) < 0 at a point in Lin the curve is moved inward. Once the curve has moved, the two lists are updated. Thus, the main idea of this method is to make the curve evolve without solving the complete PDE of the model (which requires significant computational time) but only by computing the speed function F to determine its sign. In other words, the role of the speed function is only to determine if each pixel of the curve belongs to the exterior or interior of the object to be segmented, which makes it act exactly as a binary classifier. This paper proposes to replace the decision made with the speed function of the active contour by a supervised binary classifier. Using Shi et al ’s level set implementation, three models of active contours driven by a classifier are compared: a k-nearest-neighbors model, a support vector machine model and a neural network model. In order to handle complex images, it was decided to characterize pixels with texture features. Among the several approaches widely used in the texture characterization domain (wavelet transform [11], Markov Random Fields [12], Gabor filters [13]), it was decided to use Haralick features from the cooccurrence matrix [14] as several works have brought out their efficiency [15]. The proposed model evolves in two steps as a supervised process. First, the use of supervised classifiers induces to develop a learning phase using classified samples. During an interactive step, an expert segmentation is carried out on a learning image. Then, local Haralick texture features from each pixel of this segmentation are used to create the samples of the learning dataset, allowing the classifier to carry out its learning task. In a second step, the active contour driven by the pre-learned classifier is launched on several test images to carry out the desired segmentations. Fig. (1) depicts the complete segmentation process.
Fig. 1. The complete segmentation process: the expert segmentation is used to determine the learning dataset and carry out the learning task of the classifier, the segmentation is then launched on several test images
Active Contours Driven by Supervised Binary Classifiers
291
This article is organized as follows: the next section describes the creation of the learning dataset used by the supervised active contours, section 3 describes the binary classifiers introduced in the active contours. Section 4 compares the performances of the models with the Chan-Vese active contour on echographic and remote sensing images and finally, section 5 concludes on our work and describes future developments.
2
Extraction of the Learning Dataset
Using supervised classifiers requires the definition of a learning dataset composed of classified samples. In order to determine the samples defining the learning dataset X, during an interactive step, in a learning image Ω ∗ , the user is asked to create an expert segmentation giving two ideal regions C in and C out , manually. C in will contain pixels giving information about the object to be segmented while C out will be composed of pixels belonging to the background. C in and C out may be composed of several sub regions. For each pixel belonging to C in and C out , using a neighborhood window (with a size defined by the user), a coocurrence matrix is determined by observing the horizontal transition with a distance of one pixel (the transition is considered as symmetric). Then, m Haralick features [14] are computed for each cooccurence matrix (and therefore for each pixel of C in and C out ). Thus, a learning dataset X composed of n = n1 + n2 samples, n1 belonging to C in and n2 belonging to C out , is determined. A sample s(i) will be described as: s(i) = {k1 (i), k2 (i)...km (i), l(i)} . −1 if s(i) belongs to C in l(i) = 1 if s(i) belongs to C out .
(8)
with k1 (i),k2 (i)...km (i) the m Haralick coefficients of sample i and l(i) a label representing its region belonging. Fig. (2) shows an example of the interactive step of the process. The classifier used in the segmentation process may have several parameters. For them to be set, the learning dataset is divided into two subsets Xl and Xe such as X = Xl ∪ Xe with Card(Xl ) = 23 Card(X) and Card(Xe ) = 13 Card(X).
(a)
(b)
Fig. 2. Learning image for a series of echographic images : a) input image, b) texture features of pixels from C in (white) and C out (gray) will compose the learning dataset
292
J. Olivier et al.
By making its parameters vary regularly, the classifier is learned with different configurations on the learning set Xl and tested on the evaluation set Xe . Then, the parameter configuration leading to the best classification rate on Xe is kept.
3
Active Contours Driven by Classifiers
It was first chosen to guide the active contour with the k-nearest-neighbors algorithm (KNN), one of the simplest supervised classification algorithms in machine learning. The learning task only consists of storing samples of the learning dataset X and their corresponding labels L. To classify a test sample, elements of the learning dataset are sorted regarding their distance to the sample. Then, a majority vote among the k neighbors with the minimal distance is carried out and the test sample is affected to the most represented class. It is usual to use euclidian distance to sort the learning samples, but using Manhattan or Hamming distances has already been seen in the literature. The choice of parameter k influences the method. Basically, high values of k will prevent the influence of noise on the classification but will add a blur on class boundaries. When dealing with binary classifications, it is recommended to choose odd values for k to avoid tied votes. Subsets Xl and Xe are used to determine the best value of k. Then, the speed function of the active contour applied on test images is given by: F (x) = lk .
(9)
with lk being the majoritary represented label among the k nearest neighbors of sample represented by the m local Haralick texture features of pixel x. The second classifier we chose to introduce in the evolution of the active contour is a support vector machine. SVM algorithm, introduced by Vapnik in [16], is the most recent classifier in machine learning. Its efficiency is due to high generalization performances and its ability to solve classification problems where the learning dataset X does not necessarily possesses a linear separation in the feature space. Indeed, the SVM reconsiders the problem by projecting it, using a kernel function, in a space of higher dimension (potentially of infinite dimension) where a linear separation will appear. The SVM incorporated in the active contour evolves with soft margins (which allows some learning samples to be inside the margin or even misclassified) and is driven by the Gaussian kernel. The learning task is achieved using the Sequential Minimal Optimization algorithm (SMO), which determines the support vectors and their associated Lagrange multipliers α∗i (see [17] for details). The two subsets of X are used to determine the best values for C (maximum value of the α∗i ) and σ (kernel parameter). Once the SVM learned, the expression of the classification of a test sample x ∈ IRm , x ∈ / X is introduced in the speed function of the active contour as: n F (x) = α∗i li K(xi , x) + w0 . (10) i=1 2
K(xi , x) = exp(−
xi − x ). (Gaussian kernel) 2σ 2
(11)
Active Contours Driven by Supervised Binary Classifiers
293
Introduced in 1957 with works of Rosenblatt, neural networks are biologicalbased computational models using a connectionist approach to interconnect artificial neurons. One of the most popular model is the multi-layer perceptron (MLP) developed by Rumelhart et al in [18], as it is known as a universal function approximator. The MLP is a multi-layer feed forward network composed of one input layer, one hidden layer and one output layer. The MLP used in the active contour model is composed of twelve input neurons (the texture features), n hidden neurons and two output neurons. The synaptic weights are randomly initialized and the learning task is achieved using the error’s gradient back propagation algorithm.The two subsets of X are used to determine the best values for the number of hidden neurons. Once the synaptic weights of the networks determined, the value of the output neurons for a test sample feature vector p is given by the output matrix a using: a = f (HWf (IWp + b1 )b2 ).
(12)
IW and HW are respectively the input and hidden layer weight matrices, b1 and b2 are the bias matrices introduced respectively in the input and output neurons, f is the transfer function (log-sigmoid). The classification of the test sample will be given by the output neuron with the highest potential. In order to introduce the classification rule of the neural network in the active contour, we define the speed function to be used as: F (x) = a(0) − a(1).
(13)
The output neuron with the higher potential will determine the sign of F . Thus it will guide the evolution of the curve.
4
Experimental Results
Due to the interactive step, the presented models are independent from the image type. As a consequence, it has been decided to test the proposed models on two different image types. Because medical imaging represents an important field in computer vision, a series of echographic images is first presented. This type of image is known as very noisy and the object’s boundaries are not clearly defined. In the second series of images, we present results of segmentation of urban zones in remote sensing images obtained from the ASTER image database [19,20]. The accuracy of our models is evaluated using the generic discrepancy measure based on the partition distance [21]. The partition distance is defined by: Given two partitions P and Q of S, the partition distance is the minimum number of elements that must be deleted from S, so that the two induced partitions (P and Q restricted to the remaining elements) are identical. The generic discrepancy measure is defined as the partition distance between the reference segmentation R and the segmentation under study S, normalized by the number of pixels in the image. Thus, the better the segmentation, the nearer
294
J. Olivier et al.
Table 1. Error comparison using the generic discrepancy measure (×103 ) on the echographic images series (echo) and remote sensing images series (RS) Echo 1 Echo 2 (fig. 3) Chan-Vese 5.6 19.6 KNN 5.0 15.7 SVM 3.7 11.7 Neur. Net. 3.5 12.0
Echo 3 Echo 4 (fig. 4) 18.8 10.9 21.9 4.8 16.9 4.0 16.0 3.7
Mean value 13.7 11.8 9.1 8.8
RS 1 RS 2 (fig. 6) 108.8 507.3 68.6 139.4 112.5 132.5 67.9 129.3
RS 3 (fig. 7) 122.9 104.9 98.9 94.1
RS 4 Mean value 183.3 230.6 133.8 111.7 108.0 113.0 100.1 98.0
(a) Input image
(b) Ground truth
(a) Input image
(b) Ground truth
(c) Chan-Vese
(d) KNN model
(c) Chan-Vese
(d) KNN model
(e) SVM model
(f) Neural network
(e) SVM model
(f) Neural network
Fig. 3. Segmentation result comparison for the first echographic image
Fig. 4. Segmentation result comparison for the second echographic image
to zero the evaluation will be. In order to situate our work we also compared our models to the region-based active contour developed by Chan and Vese in [22], which tries to determine the most homogeneous two regions in the image regarding pixel intensity means. Table (1) shows the high performances of the proposed models by comparing their segmentation error, using the generic discrepancy measure, to the ChanVese active contour on the two series of test images. Haralick features have been calculated using a neighborhood window with a width of eleven pixels. Figures (3) and (4) show the visual segmentation results of our models on two images of the echographic series, compared to the Chan-Vese model and an expert segmentation representing a ground truth. The models were learned on figure (2). Figure (5) illustrates the learning image of the remote sensing series and figure (6) and (7) show segmentation results on two images from that series. The presented results bring out the high performances of the active contours driven by classifiers, especially for neural network and SVM-based models.
Active Contours Driven by Supervised Binary Classifiers
(a) input image
295
(b) manual segmentation
Fig. 5. Learning image for the remote sensing image series
(a) Input image
(b) Ground truth
(d) KNN model
(c) Chan-Vese
(d) KNN model
(f) Neural network
(e) SVM model
(f) Neural network
(a) Input image
(b) Ground truth
(c) Chan-Vese
(e) SVM model
Fig. 6. Segmentation result comparison for the first remote sensing image
(a) Chan-Vese model
Fig. 7. Segmentation result comparison for the second remote sensing image
(b) Classifier-based model (KNN)
Fig. 8. Comparison of the influence of initialization
Regarding computational times, the segmentation process of these two models is as fast as with the Chan-Vese model, but the segmentation process of the KNN-based model is slower. Because computational time of the learning phase
296
J. Olivier et al.
is higher for the SVM (about half an hour) than for the Neural Network (about five minutes), the neural network model can be considered as the more efficient. Figure (8) illustrates the low dependence of our models to initialization by comparing initializations of the Chan-Vese and KNN-based models. The initial curve of the classifier-based model requires only few pixels inside the object to ensure a good segmentation. The part of the active contour placed outside the object will collapse and stop on the boundary (or totally disappear) whereas the inside part will automatically grow to reach it. On the contrary, the Chan-Vese model stays strongly dependent to the initialization, as this last one determines initial values for the region intensity means.
5
Conclusion and Discussion
In this paper we have proposed three models of active contours driven by classifiers based on a recent fast level set implementation. The learning task of the classifiers is carried out using an expert segmentation, then, the models are introduced in active contours and launched on several test images. Experimental results have brought out the high accuracy of the proposed models, especially when using neural network and SVM. Furthermore, the three models have shown a great independence to initialization. Because of its high accuracy and fast segmentation process, the neural network model is considered as the most efficient. These results show that, although the presented models do not have terms controlling the geometry of the curve in their motion equation any more, using only a classifier to guide the evolution is efficient and provides smooth curves. Regarding the learning step of the models, it would be interesting to evaluate how much improvements of the classification rate of each classifier influences the accuracy of the final segmentation. In order to increase the accuracy of the model by taking advantage of several classifiers, a multi-classifier model is under development. The accuracy of the classifiers is dependent from the features used to determine the learning samples. In this article Haralick texture features were chosen but in order to take advantage of many other widely used texture features (Gabor filters, wavelet transform....), we intend to incorporate more texture descriptors and a feature selection process during the learning phase. Because our models have shown their efficiency on 2D echographic images, we plan to extend the principle of active contours driven by classifiers to 3D echographic images.
References 1. Shi, Y., Karl, W.: A fast level set method without solving PDEs. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia, USA, vol. 2, pp. 97–100 (2005) 2. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Int. Jour. of Comp. Vis. 1, 321–331 (1988)
Active Contours Driven by Supervised Binary Classifiers
297
3. Paragios, N., Deriche, R.: Geodesic active regions and level set methods for supervised texture segmentation. Int. Jour. of Comp. Vis. 46, 223–247 (2002) 4. Zhu, S., Yuille, A.: Region competition: unifying snake/balloon, region growing, and Bayes/MDL/energy for multi-band image segmentation. IEEE trans. on Pat. Anal. and Mach. Int. 18, 884–900 (1996) 5. Jehan-Besson, S., Barlaud, M., Aubert, G.: Dream2s: Deformable regions driven by an eulerian accurate minimization method for image and video segmentation. Int. Jour. of Comp. Vis. 53, 45–70 (2003) 6. Osher, S., Sethian, J.: Fronts propagation with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations. Jour. of Comp. Phys. 79, 12–49 (1988) 7. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. Int. Jour. of Comp. Vis. 22, 61–79 (1997) 8. Malladi, R., Sethian, J., Vemuri, B.: Shape modeling with front propagation: a level set approach. IEEE trans. on Pat. Anal. and Mach. Int. 17, 158–175 (1995) 9. Caselles, V., Catte, F., Coll, T., Dibos, F.: A geometric model for active contours in image processing. Numerische Mathematik 66, 1–31 (1993) 10. Adalsteinsson, D., Sethian, J.: A fast level set method for propagating interfaces. Jour. of Comp. Phys. 118, 269–277 (1995) 11. Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE trans. on Pat. Anal. and Mach. Int. 11, 674–693 (1989) 12. Li, S.Z.: Markov Random Field Modeling in Image Analysis. Springer, Heidelberg (2001) 13. Gabor, D.: Theory of communication. Jour. of the Institute of Electrical Engineers (London) 93, 429–457 (1946) 14. Haralick, R.M.: Textural features for image classification. IEEE Trans. on Systems, Man, and Cybernetics 3, 610–621 (1973) 15. Tesar, L., Smutek, D., Shimizu, A., Kobatake, H.: Medical image segmentation using cooccurrence matrix based texture features calculated on weighted region. In: Proc. of Advances in Computer Science and Technology (2007) 16. Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995) 17. Platt, J., Scholkopk, C., Smola, A.: Fast training of support vector machines using sequential minimal optimisation. MIT Press, Cambridge (1999) 18. Rumelhart, D., Hinton, G., Williams, R.: Parallel Distributed Processes, vol. 1. MIT Press, Cambridge (1986) 19. Lin, J.C., Tsai, W.H.: Feature-preserving clustering of 2-d data for two-class problems using analytical formulas: An automatic and fast approach. IEEE trans. on Pat. Anal. and Mach. Int. 16, 554–560 (1994) 20. Jimenez-Munoz, J.C., Sobrino, J.A.: Feasibility of retrieving land-surface temperature from ASTER TIR bands using two-channel algorithms: A case study of agricultural areas. IEEE Geoscience and Remote Sensing Letters 4, 60–64 (2007) 21. Cardoso, J.S., Corte-Real, L.: Toward a generic evaluation of image segmentation. IEEE Trans. on Image Processing 14, 1773–1782 (2005) 22. Chan, T., Vese, L.: Active contours without edges. IEEE Trans. on Image Processing 10, 266–277 (2001)
Proximity Graphs Based Multi-scale Image Segmentation Alexei N. Skurikhin MS D436, Space and Remote Sensing Group, Los Alamos National Laboratory, Los Alamos, NM, 87545, USA
[email protected]
Abstract. We present a novel multi-scale image segmentation approach based on irregular triangular and polygonal tessellations produced by proximity graphs. Our approach consists of two separate stages: polygonal seeds generation followed by an iterative bottom-up polygon agglomeration. We employ constrained Delaunay triangulation combined with the principles known from visual perception to extract an initial irregular polygonal tessellation of the image. These initial polygons are built upon a triangular mesh composed of irregular sized triangles, whose spatial arrangement is adapted to the image content. We represent the image as a graph with vertices corresponding to the built polygons and edges reflecting polygon relations. The segmentation problem is then formulated as Minimum Spanning Tree (MST) construction. We build a successive fine-to-coarse hierarchy of irregular polygonal partitions by an iterative graph contraction. It uses local information and merges the polygons bottom-up based on local region- and edge- based characteristics.
1 Introduction The problem of image segmentation remains one of the greatest challenges for computer vision. Many image segmentation algorithms have been developed. Early in the 20th century the Gestalt psychology has shown the importance of perceptual organization for image segmentation and interpretation. Wertheimer approached the problem by postulating principles that affect perceptual grouping, such as proximity, similarity, good continuation, symmetry, that can be used for image segmentation [18]. Inspired by the visual psychology a great number of image segmentation methods have been developed, e.g. [1, 2, 8, 11]. Many of them try to partition the image by optimizing a suitable cost function that combines different criteria of image element grouping. Recently progress has been achieved in the area of graph-theoretic approach to the image segmentation and perceptual grouping. According to this approach, image structures such as pixels and edges are described using graph, and the image segmentation is formulated as a graph-partitioning problem. Typically segmentation is achieved by minimizing a criterion that takes into account the similarity between possible partitions relative to the similarity within each partition [3, 17, 20]. Building a global cost criterion, capturing salient relationships among the image elements, and making it's optimization computationally tractable continues to be a difficult problem. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 298–307, 2008. © Springer-Verlag Berlin Heidelberg 2008
Proximity Graphs Based Multi-scale Image Segmentation
299
Another category of algorithms seeks optimal image hierarchic partitioning through a sequence of local computations. We emphasize approaches that are based on the use of proximity graphs, specifically MST. In contrast with global optimization approaches, the MST based image segmentation seeks image partitioning by iteratively linking image elements through the lowest cost tree edges, which represent similarity of neighboring elements. One of the earliest applications of tree-based data clustering to visual like point data sets analyzed histogram of MST edges and investigated tree characteristics such as MST "relative compactness", tree diameter, and point densities [21]. In [9] the tree-based concept was applied to image segmentation. It was suggested to use global homogeneity criterion to control construction of an irregular pyramid starting from a regularly sampled pixel grid. [12, 10] use irregular tessellations to generate an adaptive multi-scale image representation. The approach employs an irregular sampling of the pixel grid to build the initial (lower scale) image representations. The irregular sampling hierarchy is then recursively built from the lower scales. The result depends on the stochastic nature of the sampling procedure. [19] uses Kruskal's algorithm to construct MST of the image from a regularly sampled pixel grid. The tree is then partitioned by an optimization algorithm into subtrees based on the subtrees' spectral similarities. A set of produced subtrees represents a sought image partition. Similar to [19], [5] starts from a regular pixel grid and uses Kruskal's algorithm to construct MST of the image. However, MST construction is based on thresholding a ratio of the variation between neighboring pixel patches and the variation within the patches. To avoid over-fragmentation (generation of too many small regions), the approach adjusts the measure of variation using the sizes of patches. The extent of this adjustment controls how easily small patches are agglomerated in comparison with the larger ones. While it works in many situations, nontrivial optimization of this size based term is required for applications, where salient image elements are of small size and may not stand out strongly of the image background. The approach [6, 7, 11] is similar to [5] in how it controls the grouping of pixels into patches based on image variation. The approach [6, 7, 11] uses Boruvka's MST construction algorithm instead of Kruskal's algorithm. Our framework belongs to the image segmentation approaches, producing an irregular image pyramids. However, in contrast with the stated approaches, that start from regular or irregular pixel grids, we build an irregular hierarchy of image partitions starting from triangular and polygonal tessellations adapted to the image content. The hierarchy is built based on the spectral discontinuities detected in the original image. To produce initial triangular and polygonal tessellations of the image, we combine constrained Delaunay triangulation, the Gestalt principles of visual perception, such as proximity and closure, and exploit spatial relations between detected image edges. The extracted small polygonal patches, that are built up using this combination, are then iteratively grouped bottom-up using polygons' pairwise spectral and spatial relations. The polygon agglomeration is based on Boruvka’s MST algorithm. This adaptive polygon-based hierarchic image segmentation distinguishes our method from the previous tree-based image segmentation approaches.
300
A.N. Skurikhin
2 Polygon-Based Multi-scale Image Segmentation In a polygon-based pyramid, each level represents a polygonal tessellation of the image. The pyramid is built iteratively from bottom-up using only local interactions of the neighboring polygons. On the lowest level (l=0, fine level of detail) of the pyramid the polygons are constructed from an irregular triangular tessellation of the image; they are unions of triangles. On higher level (l>0, coarser level of detail) of the pyramid the polygons are unions of neighboring polygons on a lower finer level (l-1). The polygons on level l of the pyramid are considered as the vertices of an undirected graph Gl. The edges of the graph describe the adjacency relations between the polygons on level l. Thus Gl =( Vl, El ), where Vl is the set of vertices, and El is the set of edges. The derivation of Gl+1 from Gl is formulated as construction of an MST of Gl. The built pyramid P is described as a set of graphs Gl representing the image in a fine-to-coarse hierarchy. 2.1 Construction of Fine Level of Detail Polygons on the lowest level of a pyramid are built upon the triangular tessellation of the image. We employ the image vectorization approach [13, 14] to process the generated triangle grid. First, we detect edges in the image, e.g using Canny edge detector [4] (Figs. 1a, 1b). This is followed by constrained Delaunay triangulation (CDT) [16] where the detected edges are used as constraints for the triangulation (Fig. 1c). Thus, the CDT tessellation grid is adapted to the image content, since triangle vertices and edges reflect the structure and spatial adjacency of the detected edges, such as Canny edges. CDT generated triangular mesh is then processed by edge filtering. The filtering keeps constraints (such as detected Canny edges) and selectively deletes generated triangle edges. Triangle edge filtering is based on prespecified rules inspired by the principles of visual perception, such as proximity and closure. Proximity filters out triangle edges based on their length (Fig. 1e). As a result, the detected edges that are spatially close to each other are linked by the kept triangle edge. Otherwise, the detected edges remain separated. The closure rule is responsible for filtering out triangle edges which are bounded by the same detected edge (e.g., “U”-shape) or the same pair of detected edges (e.g., "| |" configuration) (Fig. 1e). This results in a set of closed contours consisting of combination of the generated triangle edges and spectrally detected edges. Finally, a graph traversal algorithm (e.g., depth-first search or breadth-first search) groups triangles within the constructed closed contours into polygons (Fig. 1d). These polygons are assigned median color based on a sampling of pixels covered by the grouped triangles. Thus the image is segmented in a set of spectrally attributed polygonal patches (Fig. 2). The constructed polygon boundaries are built upon the detected edges, and thus reflect important spectral discontinuities in the image. This produces visually appealing results and reduces the amount of data, number of pixels to number of generated polygons, by 20-80 times depending on the image content. However, this algorithm [13, 14] does not produce a triangle grouping that can be directly utilized for object recognition or for interactive image segmentation; the image is still over-fragmented.
Proximity Graphs Based Multi-scale Image Segmentation
(a)
(b)
(c)
(d)
301
(e) Fig. 1. Construction of a polygon-based image representation. We show a sequence of steps, which produce image segmentation on fine level of detail. (a) an original image. (b) edge detection using Canny edge detector. (c) generation of an irregular triangular mesh based on Constrained Delaunay triangulation (Canny edges are shown in black color, triangle edges are shown in gray color). (d) polygon creation by filtering the triangle edges and creating closed contours (contours are shown in black color, deleted triangle edges are shown in gray color). (e) an example of triangle edge filtering. Canny edges are shown in black color, deleted triangle edges are shown in light gray color, and the kept triangle edges (closing the gaps between Canny edges) are shown in darker gray color.
2.2 Pyramid Construction While fine level-of-detail polygon-based image representation is over-fragmented, building larger polygons on top of it has the following distinct advantage: agglomeration of polygons will be implicitly directed in the sense that boundaries of agglomerated polygons will also be authentic to the image spectral discontinuities. The latter distinguishes our approach from other approaches, where selection of good seed pixels or pixel patches is challenging. It is due to the fact that pixel itself does not carry any object-oriented information.
302
A.N. Skurikhin
(a)
(b)
Fig. 2. The result of segmentation of the image shown in Fig. 1(a): (a) Contours of the created polygons are superimposed on the original image and shown in gray. (b) Created polygons on a fine level-of-detail are shown with their colors estimated during the segmentation process.
Once the polygon-based image representation on the lowest level of a pyramid is produced, we iteratively group polygons, sharing their contour fragments, on level l into larger polygonal chunks, producing level (l+1) of the image pyramid. Polygon agglomeration is based on Boruvka’s algorithm to construct MST. Boruvka’s algorithm proceeds in a sequence of stages, and in each stage it identifies a forest F consisting of the minimum-weight edge incident to each vertex in the graph G, then forms the graph G1 = G\F as the input to the next stage. G\F denotes the graph derived from G by contracting edges in F. Boruvka’s algorithm takes O(ElogV) time, where E is number of edges and V is number of vertices. The overall quality of segmentation depends on the pairwise polygon adjacency matrix, containing El. The attributes of edges are defined using two features: color similarity, ∆Cij, and strength of the contour segment separating polygons, Pw. We evaluate the affinity wij between two neighboring polygons i and j:
P wij = k ⋅ ΔCij ⋅ exp⎛⎜ w ⎞⎟ , ⎝ σ⎠ N
Pw = ∑ k =1
sk ⋅ mk S
(1)
(2)
Fig. 3. An example of the contour fragment shared by two neighboring polygons A and B. The shared contour fragment consists of five edge fragments of corresponding lengths s1 through s5.
Proximity Graphs Based Multi-scale Image Segmentation
303
(b)
(a)
Fig. 4. Result of the multi-scale segmentation of the image shown in Fig. 1(a). Contours of the created polygons are shown in white color. Created polygons are shown with their estimated colors. (a) level of detail # 4, 39 polygons. (b) level of detail # 6, 22 polygons.
where contour segment shared by two neighboring polygons consists of N edge fragments (Fig. 3), S is the length of the shared contour segment, sk is the length of the shared edge fragment belonging to a given contour fragment, k and σ control the scale of polygons similarity, and mk is the magnitude of the shared edge segment. mk is 0 for triangle edge and non-zero for spectrally detected edge (such as Canny edges). Thus the cost of merging two polygons separated only by triangle edges is less than the cost of merging polygons separated by spectrally detected edges. The algorithm constructs level (l+1) of a pyramid containing coarser image partitioning by running one Boruvka's stage on level l using the evaluated color and contour relations between the polygons. Once a coarser level is constructed, the color characterization of agglomerated polygons on level (l+1) is evaluated:
C
l +1 i
M
=∑ k =1
Akl ⋅ Ckl . l +1 Ai
where M is the number of polygons merged into polygon i on level (l+1), color of a polygon i on level l,
(3)
Cil is the
Ail is the area of a polygon i on level l.
The spatial layout of the polygons changes as well. As a result, the adjacency matrix corresponding to level (l+1) is evaluated. This generation of coarser level based on finer level of a pyramid iteratively goes (Fig. 4) while spectral dissimilarity between polygons is less than a predefined dissimilarity threshold.
3 Experimental Results For experiments we have used the Berkeley segmentation dataset1. Figures 5 and 6 show the results of image segmentation based on our method. For comparison we have included the results produced by normalized cuts (NC) based segmentation to highlight the differences due to the use of global (NC) and local (MST) based approaches. We used publicly available normalized cuts software2. We have also 1 2
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/ http://www.cis.upenn.edu/~jshi/software/
304
A.N. Skurikhin
321×481
481×321
24 regions
12 regions
125 polygons, 8th level of detail 2.13 secs
55 polygons, 10th level of detail 2.5 secs
Fig. 5. Segmentation results. 1st row: input image (originally color images). 2nd row: results of the normalized cuts segmentation. 3rd row: results of our approach. 4th row: human sketches. The MST-based segmentation was prototyped in C++ and Standard Template Library and ran on 2.60 GHz machine operating under 64-bit Windows Vista. The shown time is the total segmentation time that includes extraction of fine LOD and building an irregular image pyramid on top of it. Because NC-based segmentation software is implemented in Matlab, we do not show it's running time.
Proximity Graphs Based Multi-scale Image Segmentation
305
481×321
481×321
12 regions
24 regions
15 polygons, 6th level of detail 1.8 secs
27 polygons, 9th level of detail 1.9 secs
Fig. 6. Segmentation results (cont.). 1st row: input image (originally color images). 2nd row: results of the normalized cuts segmentation. 3rd row: results of our approach. 4th row: human sketches.
included integrated representation of human sketches as a reference, since different subjects produced different segmentations for the same image. It can be argued that to produce meaningful segmentation it is necessary to utilize object-based knowledge in addition to bottom-up information agglomeration. The necessity for an objectoriented knowledge component is obvious when looking at the human sketches of multi-story building and marina in Figure 5. Though in these two examples the exploitation of multi-scale texture analysis can improve the quality of segmentation; in many others more elaborated analysis would be required. In this paper we focus on the bottom up approach to image segmentation. Fine level of detail of a pyramid is constructed using Canny edge detection with the σCanny = 1., hysteresis low threshold = 2.5, and hysteresis high threshold = 5. We
306
A.N. Skurikhin
use the Triangle code3 to generate triangular tessellation over the detected edge map. Color images were processed using either RGB space, or converting them into gray scale or HSI representations. The results are shown for RGB space. Global threshold was set to 30. Scale parameters k and σ for adjacency relations were both set to 0.5. On average the produced hierarchies contained 6-11 levels of detail. For some of the images NC produced 5 times less regions than our approach. This is mostly due to the exploitation of texture, that is a component of NC approach, but is not included yet into our framework. We have also experimented with generalization of our approach by investigating convexity measure of the groupings of neighboring polygons to give preference to a construction of more convex objects; however, it did not produce consistent results at current stage.
4 Conclusions We have introduced a polygon based method to construct a fine-to-coarse hierarchy of irregular image partitions. Experimental results support the validity of the proposed method. It uses spectral and contour relations between polygons as criteria of their agglomeration. Computational complexity of the algorithm makes it possible to use it for processing large images. Our method is different from other tree based image segmentation approaches because it builds an irregular hierarchy of image partitions based on triangular and polygonal tessellations adapted to the image content. Additional research is required to generalize the method to process texture. The problem to address is an evaluation of local texture properties, which could be associated with polygons. Another extension will include integration of the proposed method with top-down analysis to cue an analysis on polygonal patches, meeting a prespecified set of criteria (e.g., on shape).
References 1. Bhandarkar, S.M., Zeng, X.: Evolutionary approaches to figure-ground separation. Applied Intelligence 11, 187–212 (1999) 2. Boyer, K.L., Sarkar, S. (eds.): Perceptual Organization for Artificial Vision Systems. Kluwer Acad. Publ., Dordrecht (2000) 3. Boykov, Y., Veksler, O., Zabin, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. On Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) 4. Canny, J.: A computational approach to edge detection. IEEE Trans. on Pattern Analysis and Machine Intelligence 8(6), 679–698 (1986) 5. Felzenszwalb, P.F., Huttenlocher, D.P.: Image segmentation using local variation. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 98–104 (1998) 6. Haxhimusa, Y., Kropatsch, W.G.: Segmentation graph hierarchies. In: Fred, A., Caelli, T.M., Duin, R.P.W., Campilho, A.C., de Ridder, D. (eds.) SSPR&SPR 2004. LNCS, vol. 3138, pp. 343–351. Springer, Heidelberg (2004) 7. Haxhimusa, Y., Ion, A., Kropatsch, W.G.: Comparing hierarchies of segmentations: Humans, normalized cut, and minimum spanning tree. In: Digital Imaging and Pattern Recognition, pp. 95–103 (2006) 3
http://www.cs.cmu.edu/~quake/triangle.html
Proximity Graphs Based Multi-scale Image Segmentation
307
8. Herault, L., Horaud, R.: Figure-ground discrimination: a combinatorial optimization approach. IEEE Trans. On Pattern Analysis and Machine Intelligence 15(9), 899–914 (1993) 9. Horowitz, S.L., Pavlidis, T.: Picture segmentation by a tree traversal algorithm. Journal of the Association for Computing Machinery 23(2), 368–388 (1976) 10. Jolion, J.-M., Montanvert, A.: The adaptive pyramid, a framework for 2D image analysis. CVGIP: Image Understanding 55(3), 339–348 (1992) 11. Kropatsch, W.G., Haxhimusa, Y., Ion, A.: Multiresolution image segmentation in graph pyramids. In: Kandel, A., Bunke, H.H., Last, M. (eds.) Applied Graph Theory in Computer Vision and Pattern Recognition, vol. 52, pp. 3–42 (2007) 12. Montanvert, A., Meer, P., Rosenfeld, A.: Hierarchical image analysis using irregular tesselations. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(4), 307–316 (1991) 13. Prasad, L., Skourikhine, A.N.: Vectorized image segmentation via trixel agglomeration. Pattern Recognition 39(4), 501–514 (2006) 14. Prasad, L., Skourikhine, A.N.: Vectorized image segmentation via trixel agglomeration. Pattern Recognition, U.S. Patent No. 7127104 (2006) 15. Sarkar, S., Soundararajan, P.: Supervised learning of large perceptual organization: graph spectral partitioning and learning automata. IEEE Trans. On Pattern Analysis and Machine Intelligence 22(5), 504–525 (2000) 16. Shewchuk, J.R.: Triangle: engineering a 2D quality mesh generator and Delaunay triangulator. LNCS, vol. 1148, pp. 203–222. Springer, Heidelberg (1996) 17. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 18. Wertheimer, M.: Principles of perceptual organization. In: Beardslee, D., Wertheimer, M. (eds.) Readings in Perception. Van Nostrand, D. Princeton, NJ, pp. 115–135 (1958) 19. Xu, Y., Uberbacher, E.C.: 2D image segmentation using minimum spanning trees. Image and Vision Computing 15, 47–57 (1997) 20. Yu, S.X.: Segmentation using multiscale cues. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 247–254 (2004) 21. Zahn, C.T.: Graph-theoretic methods for detecting and describing gestalt clusters. IEEE Transactions on Computers 20(1), 68–86 (1971)
Improved Adaptive Spatial Information Clustering for Image Segmentation Zhi Min Wang, Qing Song, Yeng Chai Soh, and Kang Sim Department of Electrical and Electronic Engineering Nanyang Technological University, Singapore {wang0062,eqsong,eycsoh}@ntu.edu.sg Institute of Mental Health/Woodbridge Hospital, Singapore kang
[email protected]
Abstract. In this paper, we propose a different framework for incorporating spatial information with the aim of achieving robust and accurate segmentation in case of mixed noise without using experimentally set parameters, called improved adaptive spatial information clustering (IASIC) algorithm. The proposed objective function has a new dissimilarity measure, and the weighting factor for neighborhood effect is fully adaptive to the image content. It enhances the smoothness towards piecewise-homogeneous segmentation and reduces the edge-blurring effect. Furthermore, a unique characteristic of the new information segmentation algorithm is that it has the capabilities to eliminate outliers at different stages of the IASIC algorithm. These result in improved segmentation result by identifying and relabeling the outliers in a relatively stronger noisy environment. The experimental results with both synthetic and real images demonstrate that the proposed method is effective and robust to mixed noise and the algorithm outperforms other popular spatial clustering variants.
1
Introduction
Segmentation is an essential preprocessing step for many applications in the field of image processing. Its main goal is to divide an image into parts that have a strong correlation with objects or areas of the real world contained in the image. Clustering based image segmentation methods have been widely use on many applications [1,2,3,4,5]. However, pixel-based clustering algorithms rely only on the intensity distribution of the pixels, and disregard their geometric information, making them very sensitive to noise and other artifacts introduced during the imaging process. Recently, the issue of incorporating spatial smoothness into clustering techniques has attracted much interest, and several successes have been reported [4,5,6,7,8]. A popular method to incorporate the spatial context is modifying the objective function. Ahmed et al. [6] modified the dissimilarity measure in a BiasCorrected Fuzzy C-Means (BCFCM) algorithm for bias field estimation and segmentation, and achieved a better performance when an additional term is G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 308–317, 2008. c Springer-Verlag Berlin Heidelberg 2008
Improved Adaptive Spatial Information Clustering for Image Segmentation
309
added to the new objective function. The influence of the neighborhood effect can be adjusted with a constant α, without influencing the center pixel in the objective function. It is not effective if the center pixel is highly different from the neighbors, for example in the case of the salt and pepper noise. Another variant of FCM algorithm called the Robust Fuzzy C-means algorithm (RFCM) was proposed by Dzung [4]. A penalty term is added into the objective function to constrain the membership values. It favors configurations where the central pixel has a high membership value in one class while adjacent pixels have low membership values in other classes. Similar to other spatial models, this RFCM algorithm still suffers from the over-smoothed edges problem. Moreover, the cross-validation scheme for choosing the smoothing parameter β will not always find the appropriate β. Recently, Szil´ agiyi et al. [7] proposed an Enhanced Fuzzy C-Means (EnFCM) clustering algorithm to segment the image more effectively while using the spatial information. However, this method still depends on a fixed spatial parameter α which needs to be adjusted. The shortcoming of using fixed spatial parameter is evident. Another disadvantage emerges rom this problem is that it will also blur the image feature like sharp edges while smoothing out noises. In order to overcome these problems, Cai et al. [8] proposed a Fast Generalized Fuzzy C-Means (FGFCM) algorithm by introducing a new localized similarity measure. This new similarity measure make use of local and spatial intensity information, and so it performs better than the EnFCM algorithm. It has lower misclassification rate by reducing the blurring effect. However, the need for experimentally adjusted parameter and the blurring problems still exist in FGFCM. In this paper, we propose an Improved Adaptive Spatial Information-theoretic Clustering (IASIC) algorithm for image segmentation. The IASIC algorithm is an extension of the ASIC algorithm proposed in [9]. It is still a two-step image segmentation algorithm. In the mutual information (MI) minimization step, our adaptive spatial clustering algorithm is based on a 3×3 window and has a specific dissimilarity measure that incorporates the spatial information to eliminate low level outliers. In the MI maximization step, the remaining outliers (high level outliers after the spatial information clustering), i.e. the remaining unreliable pixel assignments are identified from an information-theoretic point of view, and reassigned based on an adaptive window. This feature provides IASIC with the capabilities to eliminate outliers in a relatively stronger noisy environment as compared to other conventional algorithms with spatial variants.
2
Improved Adaptive Spatial Information Clustering (IASIC) Algorithm
In general, there are two types of misclassified pixels, called outliers, in image segmentation: (1) noise pixels caused by intensity shift, for example, salt and pepper noise. (2) border pixels, in other words, the data points with equal similarity distances to their adjacent cluster centers. The existence of outliers may shift the true cluster center into wrong positions or may cause difficulty to assign
310
Z.M. Wang et al.
the data label using correct cluster numbers [10]. We propose the IASIC algorithm with a two-step solution, i.e. the MI minimization and MI maximization to eliminate the two kinds of outliers, respectively. 2.1
Improved Dissimilarity Measure and MI Minimization for Basic Clustering and Low Level Outlier Elimination
We propose a new dissimilarity measure ds and express the distortion measure Ds as follow: Ds =
M N K
p(xi,j )p(wk |xi,j )ds (xi,j , wk )
(1)
i=1 j=1 k=1
where ds (xi,j , wk ) = λi,j ||xi,j − wk ||2 +
(1 − λi,j ) NR
||xr,c − wk ||2
(2)
xr,c ∈Ni,j
with the image patterns xi ∈ X converted into M × N matrix, i.e. xi,j ∈ IM×N . The location of each pixel are presented by (i, j), where i ∈ [1, M ] and j ∈ [1, N ]. The parameter λi,j is a spatial factor which can be computed locally in advance. The Ni,j is the subset of neighborhood pixels of (i, j) in a 3 x 3 window. NR is the window size. p(xi,j ) is a fixed arbitrary value in the MI minimization step. p(wk |xi,j ) is the conditional probability of assigning observation xi,j to center wk . ||xi,j − wk ||2 is usually represented by Euclidean distance. In order to make the smoothing parameter {λi,j } adaptive to the image content, the value of {λi,j } should be adjusted according to the homogeneousness of the neighboring window. Consider a 3 × 3 window, for the pixel in location (i, j), we can calculate the dispersion of intensity differences between its intensity value and its neighbors’: d(xi,j , xi+r,j+c ) = ||(xi,j − xi+r,j+c )||2
(3)
with r = {−1, 0, 1}, and c = {−1, 0, 1}, but (r, c) = (0, 0). Thus, the homogeneousness of this 3 × 3 neighboring window can be measured by the standard derivation of the intensity difference between the center pixel and its neighboring pixels, leading to 1 1 1 1 ηi,j = (d(xi,j , xi+r,j+c ) − μ)2 2 NR r=−1 c=−1 μ=
1 1 1 d(xi,j , xi+r,j+c ) NR r=−1 c=−1
(4)
In order to take into account the magnitude of intensity transition in the edge region, we further divide ηi,j by the local variance of intensity of all the pixels over the 3 × 3 window, i.e. ˆ i,j = ηi,j /ζi,j λ (5) 1 1 1 1 1 1 where ζi,j = 9 r=−1 c=−1 (xi+r,j+c −τ )2 2 ,τ = 19 r=−1 c=−1 x(i+r,j+c) .
Improved Adaptive Spatial Information Clustering for Image Segmentation
311
ˆ i,j and obtain λi,j Finally, we normalize λ ˆ i,j / max(λ ˆ i,j ) λi,j = λ
(6)
Obviously, this approach is totally adaptive to the local image content, and can be preprocessed before the clustering operation. There is no experimentally adjusted parameters in the whole process. Fig. 1 shows four examples of factor {λi,j } under different window features. Example (i) shows the case when the center pixel is located along the image edge, and example (ii) is for a less inhomogeneous region. As can be seen from Fig. 1, both λi,j s are big in order to preserve the fine details of the image. On the other hand, example (iii) and (iv) show the cases that the center pixel is in the homogeneous region. It is clearly shown in Fig. 1 that λi,j s are very small for such situations.
Fig. 1. The adaptive spatial parameter {λi,j } under homogeneous regions and inhomogeneous regions
According to the rate-distortion theory [10],[11], any data compression scheme must preserve the average amount of information of the input data such that we can reproduce the input data from the compressed data with an average distortion that is less than or equal to some specified distortion D¯s (to be defined later). With this in mind, the objective function of the MI minimization of the IASIC algorithm can be defined as R=
min
{p(wk |xi,j )}∈QD¯s
I(X; W)
(7)
where QD¯s = {p(wk |xi,j ) ∈ RK × RM×N : M N K
p(xi,j )p(wk |xi,j )ds (xi,j , wk ) ≤ D¯s ;
i=1 j=1 k=1
p(wk |xi,j ) = 1, p(wk |xi,j ) ≥ 0}
(8)
k
I(X; W) is the mutual information (MI) between {p(xi,j )} and {p(wk |xi,j )}, which can be expressed as I(X; W) =
M N K i=1 j=1 k=1
p(xi,j )p(wk |xi,j ) log
p(wk |xi,j ) p(wk )
(9)
312
Z.M. Wang et al.
Minimization of R in (7) with respect to {p(wk |xi,j ), wk } would lead to an iterative scheme p(wk )eds (xi,j ,wk )/T p(wk |xi,j ) = K ds (xi,j ,wk )/T k=1 p(wk )e wk =
M N 1 (1 − λi,j ) p(xi,j )p(wk |xi,j ) λi,j xi,j + p(wk ) i=1 j=1 NR
(10)
xr,c
(11)
xr,c ∈Ni,j
By iteratively minimizing (7) via (10) and (11), we can deal with the first type of outliers and cope with the problems of the conventional spatial algorithms by using the propsed new objective function. 2.2
MI Maximization and Adaptive Window for High Level Outlier Elimination
In the last section, by minimizing the MI, we have obtained the sub-optimal membership values without considering the high level outlier, i.e. {¯ p(wk |xi,j )}, when the probability distribution of input data {p(xi,j )} is fixed. We now want to reassess the reliability of the input data under the obtained sub-optimal membership values {¯ p(wk |xi,j )}. Some of the input data are more desired than the others when we know the sub-optimal membership values. For example, if the input data is closer to its adjacent cluster center than the rest, this data point is more likely to belong to that cluster, i.e. high reliability. The MI maximization method in information theory provides us a way to evaluate the reliability of input data. It can be formulated by maximizing MI against the input probability distribution p(xi,j ): C = max I(X; W) = max {p(xi,j )}
{p(xi,j )}
M N K
p(xi,j )¯ p(wk |xi,j ) log
i=1 j=1 k=1
p(xi,j |wk ) (12) p(xi,j )
where C represents the channel (cluster) capacity. According to the information theory [11],[10], for fixed p¯(wk |xi,j ), the MI in (12) is maximized by K exp p ¯ (w |x ) log p(x |w ) k i,j i,j k k=1 p(xi,j ) = M N K ¯(wk |xi,j ) log p(xi,j |wk ) i=1 j=1 exp k=1 p where p(xi,j )¯ p(wk |xi,j ) p(xi,j |wk ) = M N p(x p(wk |xi,j ) i,j )¯ i=1 j=1
(13)
For any value of (i, j), if p(xi,j ) in (13) has a zero value, such that a corresponding input data point xi,j will have no contribution to the MI maximization, and
Improved Adaptive Spatial Information Clustering for Image Segmentation
313
should be removed from the sum in the MI maximization procedure and treated as outlier (unreliable data point). This forms the basis of our IASIC algorithm for high level outlier elimination. Note also that these unreliable data points are not to be deleted, they are labeled as outliers and are going to be clustered again with a bigger neighboring window. In this paper, we introduce a simple and fast reassignment method to label the ambiguous pixel for the high level outlier elimination. This method is based on the concept that the center pixel should be assigned to the cluster that most of its neighbors belong. This has been widely used in a number of existing spatial clustering algorithms [12,6,7,5]. The only difference is that we do it within an adaptive window for the reassignment of outlier based on the MI maximization. The implementation of high level outlier elimination and reassignment is given as follows: 1. Consider an initial 3×3 window centered at the outlier, obtain the clustering labels for all neighboring pixels within this neighborhood window from the minimization step. 2. Obtain all label counts for each cluster, and store the count values into ξk , k = 1, 2, ..K. 3. Find the maximum value of ξk and its corresponding cluster index m, i.e. ξm = max(ξk ). 4. If ξ1 = ξ2 = ... = ξK , increase the current window edge by 2 more pixels per iteration, (for example, from 3 × 3 to 5 × 5, or even larger if there are still equal values), go to 1); otherwise, update the outlier label with m. 5. Iterate for all outlier pixels until there is no further change, stop.
3
Experimental Results
In this section, we will examine the performance of our algorithm through a variety of simulated images and real images, and compare it with the other three algorithms of BCFCM, EnFCM, and FGFCM. In all examples, the cooling factor α = 0.95 was used unless otherwise stated. The fuzzyness control parameter q has been set to 2, and NR = 8 (a 3 × 3 neighbor window is centered around each pixel). The spatial parameter α in BCFCM algorithm is obtained by searching the interval [0.5, 0.9] with a step size of 0.05 in terms of the optimal SA value. Similarly, the α in EnFCM is obtained by searching the interval [0.2 8] as suggested in [8]. The λg in FGFCM is selected from 0.5 to 6 with an increment of 0.5. 3.1
Simulated Images
We use the synthetic test image as shown in Fig. 2(a), and the resolution is 256 × 256 pixels and the grey-scale range is 0 to 255. To validate the segmentation results, we use the segmentation accuracy (SA) which is computed by SA =
number of misclassified pixels total number of pixels
(14)
314
Z.M. Wang et al.
(a)
(b)
(c)
(d)
(e)
Fig. 2. Synthetic Image. (a) Original image. (b) BCFCM algorithm. (c) EnFCM algorithm. (d) FGFCM algorithm. (e) IASIC algorithm. Table 1. SA values on synthetic images for BCFCM, EnFCM, FGFCM and IASIC methods Noise levels salt&pepper 3% salt&pepper 5% salt&pepper 7% salt&pepper 15% Gaussian 4% Gaussian 8% Gaussian 12% Gaussian 16% salt&pepper 5% + Gaussian 4% salt&pepper 5% + Gaussian 12%
BCFCM 2.38% 3.93% 5.49% 11.45% 0.16% 0.69% 2.08% 4.43% 4.01% 6.27%
EnFCM 2.37% 3.93% 5.49% 11.45% 0.16% 0.67% 1.98% 3.02% 3.99% 6.20%
FGFCM 0.59% 0.98% 1.37% 4.40% 0.02% 0.05% 0.53% 1.76% 0.72% 1.17%
IASIC 0.34% 0.50% 0.69% 1.99% 0.13% 0.17% 0.24% 0.40% 0.52% 0.77%
Segmentation Results on a Synthetic Image Corrupted by Mixed Noise. Fig. 2 (e) shows the segmentation result by using our algorithm. Compared to the clustering results of the BCFCM algorithm in Fig. 2 (b), the EnFCM algorithm in Fig. 2 (c), the superiority of our IASIC algorithm is apparent. More detailed comparisons for other types of noise and edge blurring effect will be given in the next subsections. Quantified Comparison of SA Results on Synthetic Images of Different Noise Types and Levels. Table 1 shows the SA values obtained when applying five different algorithms to Fig. 2 for different levels and types of noise. Three types of noise are used in this experiment: salt & pepper noise, Gaussian noise, and mixed noise. From Table 1, we can see that our proposed algorithm has the lowest misclassification rates in most cases. When the noise level is low, for example 4% and 8% Gaussian noise, the FGFCM algorithm shows slightly better performance than our algorithm. However, under mixed noise and median-tohigh noise levels, our IASIC algorithm is much more robust to the presence of noise as compared to other algorithms, and it is easier to implement in real applications because there is no experimentally adjusted parameter. Outliers Identification by MI Maximization with High Level Outlier Elimination. It is interesting to illustrate that how the robust estimation step plays a key role in our IASIC algorithm.
Improved Adaptive Spatial Information Clustering for Image Segmentation
(a)
(b)
(c)
(d)
315
(e)
Fig. 3. High level outlier elimination. (a) original image; (b) outlier identified by MI maximization; (c) outlier pixels which have been corrected during the high level outlier elimination process; (d) Segmentation result by step1 of IASIC algorithm; (e) Segmentation result by IASIC algorithm.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 4. Example B. (a) Original image. (b) Noisy image corrupted by 3% salt & pepper + 4% Gaussian mixed noise. (c) Segmentation results using BCFCM. (d) Segmentation result using EnFCM. (e) Segmentation result using FGFCM. (f) Segmentation result using IASIC.
A synthetic image with mixed noise (5% salt & pepper + 4% Gaussian) is shown in Fig. 3 (a). With the IASIC algorithm, the outlier pixels are obtained and shown in Fig. 3 (b). The result is expected as described in subsection 2.1. Besides the pixels along the borders, some other pixels located exactly or around the contaminated positions have been detected as outlier. By relabeling these outliers with bigger neighborhood window, the segmentation result in Fig. 3 (e) visually shows a better performance as compared to the one shown in Fig. 3 (d). In order to appropriately illustrate the effectiveness of our high level outlier elimination method, we plot those corrected outlier pixels in Fig. 3 (c). From the figure, we found that not only the outlier pixels in homogeneous regions have
316
Z.M. Wang et al.
been corrected, but also the ones along the borders and edges. It experimentally proved that this simple but fast high level outlier elimination method is efficient for our purpose. 3.2
Real Images
In this section, we will examine the performance of our algorithm on real image. In order to investigate the performance (in terms of noise removal performance and edge blurring effect handling) of the four segmentation algorithms, a mixed noise (3% salt & pepper + 4% Gaussian) will be artificially added into these real images unless specified otherwise. The original image is shown in Fig. 4 (a). The resolution is 512 × 512 and the gray scale range is 0 to 255. The contaminated image is shown in Fig. 4 (b). For the class number, we choose c = 3 because it produces a better representation of the image. From Figs. 4 (c)-(f), we can see that our adaptive algorithm gives a much better segmentation, even though a small portion of the wing of the small plane is not well recovered. The mixed noise effect has been greatly reduced by our IASIC algorithm.
4
Conclusion
We have presented a new adaptive spatial robust information clustering algorithm for automatic segmentation of images corrupted with noises. The objective function was defined by incorporating the spatial information to allow for the clustering of a pixel to be influenced by its neighborhood. Our algorithm is effective in that it is adaptive to the image content to favor the solution of piecewisehomogeneous labeling without affecting the edges, thus preserving most of the details of the image. The amount of spatial constraint is controlled by the image itself without the need of user intervention. Furthermore, the two-step approach of IASIC provides the capabilities to eliminate outliers in a relatively noisy environment. A comparison of our IASIC algorithm with other variants of spatial clustering algorithms, like the BCFCM, EnFCM and FGFCM algorithms, illustrates that while these algorithms are effective in eliminating the noise or small isolate regions, our algorithm has a better performance in terms of preserving the details and reducing the edge blurring artifact in a relatively noisy environment.
References 1. Pappas, T.: An adaptive clustering algorithm for image segmentation. IEEE Trans. Signal Processing 40, 901–914 (1992) 2. Clarke, L.P., Velthuizen, R.P., Camacho, M.A., Heine, J.J., Vaidyanathan, M., Hall, L.O., Thatcher, R.W., Silbiger, M.L.: MRI segmentation: Methods and applications. Magn. Res. Imag. 13, 343–368 (1995) 3. Caillol, H., Pieczynski, W., Hillion, A.: Estimation of fuzzy gaussian mixture and unsupervised statistical image segmentation. IEEE Trans. Image Processing 6, 425– 440 (1997)
Improved Adaptive Spatial Information Clustering for Image Segmentation
317
4. Pham, D.L.: Spatial models for fuzzy clustering. Computer Vision and Image Understanding 84, 285–297 (2001) 5. Makrogiannis, S., Economou, G., Fotopoulos, S.: A region dissimilarity relation that combines feature-space and spatial information for color image segmentation. IEEE Trans. Syst. Man, Cybern. B 6. Ahmed, M.N., Yamany, S.M., Mohamed, N., Farag, A.A., Moriarty, T.: A modified fuzzy c-means algorithm for bias field estimation and segmentation of MRI data. IEEE Trans. Med. Imag. 21, 193–199 (2002) 7. Szil´ agyi, L., Beny´ o, Z., Sizl´ agyi, S.M., Adam, H.S.: MR brain image segmentation using an enhanced fuzzy C-means algorithm. In: Proc. of the 25th Annual International Conference of the IEEE EMBS, Cancun, Mexico. Engineering in Medicine and Biology Society, pp. 724–726 (2003) 8. Cai, W., Chen, S., Zhang, D.: Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognition 40, 835–838 (2007) 9. Wang, Z.M., Song, Q., Soh, Y.C., Yang, X.L., Sim, K.: Adaptive spatial information clustering for image segmentation. In: International Joint Conference on Neural Networks (2006) 10. Song, Q.: A robust information clustering algorithm. Neural Computation 17, 2672–2698 (2005) 11. Blahut, R.E.: Principle and Practice of Information Theory. Addison-Wesley, Reading (1988) 12. Liew, A.W.C., Hong, Y.: An adaptive spatial fuzzy clustering algorithm for 3D MR image segmentation. IEEE Trans. Med. Imag. 22, 1063–1075 (2003)
Stable Image Descriptions Using Gestalt Principles Yi-Zhe Song and Peter M. Hall Media Technology Research Centre Department of Computer Science University of Bath
Abstract. This paper addresses the problem of grouping image primitives; its principal contribution is an explicit definition of the Gestalt principle of Pr¨ agnanz, which organizes primitives into descriptions of images that are both simple and stable. Our definition of Pr¨ agnanz assumes just two things: that a vector of free variables controls some general grouping algorithm, and a scalar function measures the information in a grouping. Stable descriptions exist where the gradient of the function is zero, and these can be ordered by information content (simplicity) to create a “grouping” or “Gestalt” scale description. We provide a simple measure for information in a grouping based on its structure alone, leaving our grouper free to exploit other Gestalt principles as we see fit. We demonstrate the value of our definition of Pr¨ agnanz on several real-world images.
1
Introduction
Partitioning an image into meaningful structures is a central problem of Computer Vision. The “top-down” solutions fit a model of some kind, they can perform well but restrict image content. In contrast, “bottom-up” solutions aggregate image primitives, they impose less restriction on content but performance can be questionable. The essential problem with bottom-up methods is deciding which groupings to use from amongst a vast possible number. This paper addresses the problem of bottom-up grouping by appeal to Gestalt principles. In the early 1920s, psychologists proposed that Gestalt principles play an important role in human perceptual organization, including proximity, continuity, similarity, closure, and symmetry; later common-region and connectedness were added [1,2,3]. We call these “simple” principles because they act on a few primitives at any one time. Many of these principles have been used in the computational literature. Lowe [4] uses proximity; Carreira et al use parallelism [5]. Others have sought to use more than one principle at once. Dolan and Weiss [6] use proximity and continuity, a pairing also used by Parent and Zucker [7], and also Feldman [8]. Elder and James [9] aimed for contour completion by studying the mutual relationships amongst proximity, continuity and similarity in the task of contour grouping, and concluded that proximity is the most important among G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 318–327, 2008. c Springer-Verlag Berlin Heidelberg 2008
Stable Image Descriptions Using Gestalt Principles
319
those studies. Despite this work, a full computational account of how Gestalt principles interact is yet to be given. Although the above work groups all primitives, and some of them operate hierarchically so can handle more than one primitive at once, it remains true that none of them make use of any principle of global organization. It is a common observation that context, by which we mean the presence (or absence) of structures in an image, affects the outcome of grouping. Therefore, we argue, some notion of global organization must be included in any account that seeks to form groupings, and such an account should be integral to the way in which Gestalt principles are combined. Pr¨ agnanz is the Gestalt principle that seeks to organize all primitives at once, and so acts in a “global” sense. Introduced by Wertheimer [3], Pr¨ agnanz was developed by Koffka’s [1] who advocated that “of several geometrically possible organizations that one will actually occur which possesses the best, simplest and most stable shape.” Kanizsa [10] too suggested Pr¨agnanz implies an orderly, rulebased , non-random and stable organization of primitives. In a computational setting, the notion of non-randomness influenced Marr [11], and was explicitly proposed as “common cause” (amongst other names) independently by Witkin and Tenenbuan [12], and Lowe [4]. Lowe argued that it is highly unlikely for organized image structures to occur by chance; hence they are salient. Yet, common cause can be used without reference to any principle of global organization, for example a local curve may be the common cause of a set of points: common cause is not Pr¨ agnanz. Techniques that aim to find best groupings in a global sense do exist. Guy and Medioni [13] combine proximity and continuation to find contours, which differs from those previously cited in that a global voting scheme was introduced, where each pixel gets votes from all other ones. Probably the most recognized global grouping technique, certainly one that gained considerable popularity is the normalized cut technique from Shi and Malik [14]. They use normalized cut over an affinity matrix built from proximal distances and pixel similarities to generate an adjacency matrix, forests of which are groups. Normalized cut has also been used by Stella and Shi [15] who advocated the use of prior knowledge (such as the position of foreground object, often input interactively for a particular image). Sarkar and Soundararajan [16] used a modified version of normalized cut, and applied it to the adjacency graphs obtained using a small set of Bayesian Networks, each of which corresponds to a certain Gestalt principle and is trained by a set of mutually competing automatas. Despite the popularity of normalized cut, it does not provide an explicit definition of Pr¨ agnanz, rather its advocates suggest the method adheres to that principle because it operates globally. Of the literature we read, only Ommer and Buhmann [17] explicitly define Pr¨ agnanz, in their case as the minimum of an entropy function based on the probability that pairs of primitives should be grouped. By minimizing entropy, these authors ensure simple groupings, as do the other global methods. But Simplicity is just one half of Koffka’s requirement that global organization should be simple, but there is no guarantee that the organization is stable.
320
Y.-Z. Song and P.M. Hall
Fig. 1. A picture is processed into edges and areas. Lines are fitted to edges which reference back to areas. Our Gestalt grouper optimally connects lines into groups. In this and all Figures in this paper, groups are color coded; singleton groups are in grey. Please refer to the electronic version for best viewing results.
The contributions this paper makes are, in order of importance: 1. We are the first to explicitly define Pr¨ agnanz, to account for both stability and simplicity of groupings, and which integrates simpler Gestalt principles. 2. We uniquely allow multiple possible groupings over one image; these grouping are naturally form a subset/superset structure that leads to the notion of “grouping scale”. In addition, we provide a flexible system that is easy to experiment with, which is based on a rich description of images including both edge and area primitives. In the reminder of this paper, we describe our approach (Section 2) including our definition of Pr¨ agnanz in subsection 2.1. In the results, Section 3, we show our method produces multiple solutions that appeal to intuition. Finally, a conclusion and discussion on future work is offered in Section 4.
2
A Description of Our Grouper
The input to our grouping algorithm is a graph based description of an image in which a node is one of two different types; one type corresponds to straight line primitives over edges, the other image areas. The graph is such that given a line the areas it separates can be determined, and given an area its lines can be identified. The input graph is bipartite — nodes of the same type are not connected. The output of our grouper is a graph embellished with arcs that link nodes of the same type. In common with other literature [14,15,16] forests in the embellishment delimit groups, but unlike others we recognize there is no “absolute best” solution and so produce more than one plausible grouping.
Stable Image Descriptions Using Gestalt Principles
321
The input graph is not formed using any Gestalt principle, but does aggregate pixels, into convenient image primitives. Straight line primitives are fitted to an edge map obtained by thresholding an Elder-Zucker edge detector [18], this requires monochrome images. Area primitives are connected pixel sets, determined by a color-image segmentation from Sinclair [19]. To appease any controversy regarding the grouping these processes enact, we have tuned both of them to operate conservatively: fitting produces many straight lines and the image is typically over segmented. This graph is conveniently mid-level in the sense that its primitives are super-pixel, but is nonetheless a weak description. It is the task of our grouper to strengthen this description. Figure 1 shows a typical example of the full process. As a note, we reason in favor of independently detected lines and areas on the grounds that area boundaries do not necessarily correspond to object edges, and that the presence of an edge does not necessarily imply an area boundary, see Figure 1. Independent line and area processes raise the information content of the input graph, and the competition (disagreement) and cooperation (agreement) between them is implicitly resolved by our grouper. 2.1
Pr¨ agnanz: Stable and Simple Groupings
We provide a computational definition of Pr¨ agnanz that takes in account both stability and simplicity, as required by Koffka [1]. Our definition is sufficiently general to be used as a controlling principle for any grouping algorithm whose output depends upon a set of control variables, not just our own grouper. Let P be a set of image primitives; straight line segments in our case. Let Q be a partition of P, Pr¨ agnanz selects between the many partitions. Suppose f (Q) is a scalar valued function that measures some property of a given partition, typically information content. In practice, any partitioning will depend on a vector of free variables, x ∈ k , threshold variables for example, hence f (.) also depends upon x. We define the stability of a partition as the magnitude of the gradient of f (.) with respect to the control vector: s(Q(x)) =
2 k ∂f (Q(x)) i=1
1/2
∂xi
We define a partition to be stable if s(Q(x)) = 0. In practice, the discrete nature of the control variables means zeros are rarely observed, so we seek local minima of s(.). We define a partition to be (consistent with) Pr¨ agnanz, if it is stable and can be justified by appeal to simple Gestalt principles. In the next subsection we will explain how we generate partitions, but here we continue discussing our general approach to Pr¨ agnanz. As mentioned, we wish f (.) to somehow measure the information content of a given partition. Any partition can be represented by a graph of nodes and arcs, forests in the graph delimit a particular group (see [15,14,16]). Of the many alternatives we have found the most satisfactory results are produced by a simple
322
Y.-Z. Song and P.M. Hall
analysis of these graphs, no reference to the “affinity” between primitive pairs is required. Let Mg be the maximum number of groups in any observed partition, and Ma be the maximum number of arcs in any observed partition. Let Ng and Na be the number of groups and arcs in a particular partition. We define f (.) as A(x) f (Q(x)) = −G(x) log G(x) in which G = Ng /Mg is the normalized number of groups, and A = Na /Ma is the normalized number of arcs. This function is monotonic, but, importantly, contains saddle points corresponding to stable groups. Intuitively, f (.) relates to the number of binary digits required to encode the grouping, when compared to the most complex alternative. Also, the number of arcs in a group relates to group stability: a change in control vector may add new arcs to a stable group, but may merge unstable groups. By construction we find groupings that are stable at local minima of s(.), which depends on the differential of f (.). But Koffka’s definition [1] requires simplicity too. Because our measure f (.) is related to entropy it can be used for simplicity, we need only favor stable groupings with smaller f (.). 2.2
Grouping Image Primitives by Combining Gestalt Principles
To manufacture a partition of image primitives, we combine two simple Gestalt principles: proximity and common-region, each of which depends upon its own control parameter. We adopt a simple approach that is strongly influenced by the work of Feldman [8], who argued in favor of combining Gestalt principles via logical propositions. In our case, image primitives are line segments, and the proximal distance between any pair is just the smallest distance between their end points. For line segments i and j, we define a “proximity proposition”: p(i, j|x1 ) = 1, if d(i, j) < x1 , and 0 otherwise, where d(., .) is the smallest distance between the ends of the line segments, and x1 is a threshold. Similarly, we define a “common-region proposition”: c(i, j|x2 ) = 1, if r(i, j) > x2 , and 0 otherwise, where r(i, j) counts the number of areas the line primitives have in common and x2 is a threshold. The number of common regions is readily determined by simply intersecting the list of area identifiers associated with line segment (recall our embellished image description in Section 2). The outcome of the combination determines the value in the adjacency matrix used to create a partition. A simple “or” combination was found to be effective: a(i, j|[x1 , x2 ]) = p(i, j|x1 ) ∨ c(i, j|x2 ) The control vector x = [x1 , x2 ] therefore determines the adjacency matrix, hence the partitioning (grouping).
3
Experiments and Results
In this section, we demonstrate the value of our definition of Pr¨agnanz using several real world images. Please note that for all Figures in this section,
Stable Image Descriptions Using Gestalt Principles
323
Fig. 2. Should we be interested in two distinct eagles (left), or one fighting pair (right)? Our grouper finds both interpretations
Fig. 3. Three groups for “musician”, Top-right, Grouping I, bottom-left Grouping II, bottom-right Grouping III, representing different grouping scales from “fine” to “coarse”. As scale rises, the number of groups increases and more primitives are aggregated together. Salient structures tend to persist across scales.
324
Y.-Z. Song and P.M. Hall
Fig. 4. Three groupings for “bus”, increasing in scale top-right to bottom-left. As with other examples, each grouping is plausible, albeit subject to some clutter, and salient structures persist over scale.
primitives of different groups have been uniquely color coded and singleton groups are drawn in gray lines of width 1. Due to space restrictions, only three representative groupings for each image are shown. The first test image is a simple image of two eagles fighting in the sky. The image is relatively simple in that the background is of a rather flat color. The main purpose of showing this is to demonstrate that our technique is capable of finding several plausible groupings; a qualitative measure by which we mean the grouping can be readily understood and described by a human. In this particular case, humans tend to perceive either two individual eagles, or else one pair of fighting birds; the latter being at a more “coarse scale” than the former. Both these interpretations are found by our grouper, as shown in Figure 2. Figure 3 shows three groupings of “musician”, ordered by simplicity. The first and most favored, grouping I, is at the “finest” scale, windows are differentiated, for example. Grouping II is “middle” scale, at which the musician and trees become visible (subject to clutter). Grouping III is at the “coarse” scale which
Stable Image Descriptions Using Gestalt Principles
325
Fig. 5. The “alpine” again shows groups as increasing scale, top-right over which structures merge in a plausible way, eventually relating to fore-ground, middle-ground, and back-ground. The hut is visible over all scales, and the car over two, again subject to clutter. The clutter may be removable by invocation of a Gestalt principle such as closure.
tends to discriminate yet larger objects. Interestingly, the musician grouping survives this scale change. If we made the scale large enough, then all primitives would be grouping into one. “Bus”, in Figure 4, again shows three stable groupings ordered in terms of their simplicity. As before Grouping I, the first stable grouping and the simplest of all, demonstrates local structures such as windows. In grouping II these groupings are given context as the aggregate into larger scale structures. In Grouping III, there are two large groups, one corresponds to the buildings, which can be treated as background, and another one corresponds to the bus in front, which is the foreground. Following the pattern of previous examples, grouping results in “alpine”, Figure 5, are ordered by simplicity. Our explanation follows the same pattern too: fine-scale structures are preferred by our grouper, then mid-scale structures,
326
Y.-Z. Song and P.M. Hall
finally coarse-scale. Across scales the house group barely changed. Clutter might be removed by appeal to the Gestalt principle of closure, for example: notice that much of the clutter is in the form of “spurs”, and this principle could be invoked to better differentiate the car from the road. Grouping III offers a more global grouping, comparing to the previous two, and plausibly renders the image into fore-ground, mid-ground, and back-ground.
4
Discussion and Conclusions
We have explicitly defined Pr¨ agnanz, for the first time, requiring groupings to be both stable and simple. Furthermore, our definition can be used as a controlling mechanism for any potential grouping algorithm, it is not restricting to our own. In addition, we introduced a rich underlying image description as a graph of nodes and arcs, with notes representing image primitives, in this case, line segments, edge pixels and areas, and used our grouper to embellish this description. In principle we could continue this trend toward a hierarchy in which a node at any “level” is semantically more meaningful that its ancestors. An important benefit of our definition of Pr¨ agnanz is that it yields more than one grouping, ordered by simplicity, and that interesting subset/superset structure can be observed in this ordering. In particular, the notion of scale is seen to play an important role, with fine scales corresponding to simpler image elements, while coarse scales can be associated with broad descriptive terms. Mid-range scales tend to correspond to semantic objects, whether this is an artefact of the scale at which they appear in images is an open question. The ordered groupings we produce and the relation between them might be employed by applications, matching for example. We believe that “grouping scale” or “Gestalt scale” deserves further investigation. Other immediate further direction would be incorporating more Gestalt principles, in the hope that they would “clean up” some of the clutter; such research would be of interest for it would have to address the issue of combining Gestalt principles for which there is no clear solution at present. We conclude that: (i) Koffka’s observation on Pr¨ agnanz is a useful one, to which our definition adheres. (ii) Groupers should be able to return more than one solution at different scales; investigating Gestalt scale is likely to prove rewarding, as might investigations into biasing the grouper with a prior to favor a description at a specific scale. (iii) Proximity and common-region produce clutter which might be reduced using other simple Gestalt principles in addition.
References 1. Koffka, K.: Principles of Gestalt Psychology. Harcourt, New York (1935) 2. Kohler, W.: Gestalt Psychology. Liveright, New York (1929) 3. Wertheimer, M.: Laws of Organization in Perceptual Forms, vol. 4. Psycologische Forschung (1923) Translation published in Ellis, W. (1938) 4. Lowe, D.G.: Perceptual Organization and Visual Recognition. Kluwer, Boston (1985)
Stable Image Descriptions Using Gestalt Principles
327
5. Carreira, M., Mirmehdi, M., Thomas, B., Penas, M.: Perceptual primitives from an extended 4D Hough transform. Image and Vision Computing 20, 969–980 (2002) 6. Dolan, J., Weiss, R.: Perceptual grouping of curved lines. In: Proceedings of a workshop on Image understanding workshop, pp. 1135–1145. Morgan Kaufmann Publishers Inc., San Francisco (1989) 7. Parent, P., Zucker, S.W.: Trace inference, curvature consistency, and curve detection. IEEE Trans. Pattern Anal. Mach. Intell. 11, 823–839 (1989) 8. Feldman, J.: Perceptual grouping by selection of a logically minimal model. Int. J. Comput. Vision 55, 5–25 (2003) 9. Elder, J.H., Goldberg, R.M.: Ecological statistics of Gestalt laws for the perceptual organization of contours. J. Vis. 2, 324–353 (2002) 10. Kanizsa, G.: Organization in Vision. Praeger, New York (1979) 11. Marr, D.: VISION: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, San Francisco (1981) 12. Witkin, A., Tenenbaum, J.: On the role of structure in vision. In: Beck, J., Hope, B., Rosenfeld, A. (eds.) Human and Machine Vision, pp. 481–543. Academic, New York (1983) 13. Guy, G., Medioni, G.: Inferring global perceptual contours from local features. Int. J. Comput. Vision 20, 113–133 (1996) 14. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000) 15. Yu, S.X., Shi, J.: Segmentation given partial grouping constraints. IEEE Trans. Pattern Anal. Mach. Intell. 26, 173–183 (2004) 16. Sarkar, S., Soundararajan, P.: Supervised learning of large perceptual organization: Graph spectral partitioning and learning automata. IEEE Trans. Pattern Anal. Mach. Intell. 22, 504–525 (2000) 17. Ommer, B., Buhmann, J.M.: A compositionality architecture for perceptual feature grouping. Energy Minimization Methods in Computer Vision and Pattern Recognition, 275–290 (2003) 18. Elder, J.H., Zucker, S.W.: Local scale control for edge detection and blur estimation. IEEE Trans. Pattern Anal. Mach. Intell. 20, 699–716 (1998) 19. Sinclair, D.: Voronoi seeded colour image segmentation. Technical Report TR99-04, AT&T Laboratories Cambridge (1999)
A Fast and Effective Dichotomy Based Hash Algorithm for Image Matching Zhoucan He and Qing Wang School of Computer Science and Engineering Northwestern Polytechnical University, Xi’an 710072, P. R. China
[email protected]
Abstract. Multi-view correspondence of wide-baseline image matching is still a challenge task in computer vision. There are two main steps in dealing with correspondence issue: feature description and similarity search. The wellknown SIFT descriptor is shown to be a-state-of-art descriptor which could keep distinctive invariant under transformation, large scale changes, noises and even small view point changes. This paper uses the SIFT as feature descriptor, and proposes a new search algorithm for similarity search. The proposed dichotomy based hash (DBH) method performs better than the widely used BBF (Best Bin First) algorithm, and also better than LSH (Local Sensitive Hash). DBH algorithm can obtain much higher (1-precision)-recall ratio in different kinds of image pairs with rotation, scale, noises and weak affine changes. Experimental results show that DBH can obviously improve the search accuracy in a shorter time, and achieve a better coarse match result.
1 Introduction Image matching is a fundamental task in computer vision, such as panorama building[1], object or scene recognition[2], image retrieval[3], 3D scene reconstruction based on images[4], and so on. Usually, image matching method integrated two main components: descriptor and k nearest neighbor (KNN) search. And the descriptor step should include a scale-and-rotation-invariant detector, such as DOG[2], HesssianLaplace detector[5], to detect interest points. The DOG detector is shown to be a-stateof-art one which is fast enough and invariant under large scales and rotations. While getting an interest point, a descriptor should be established to describe the point distinctively. Then, a 2NN search is carried out between two image feature sets, where nearest neighbor is selected as a determinate match only if it is much closer than the second-nearest one[2,6]. Thus, coarse match is completed. In order to get further accurate matching result, one could use RANSAC to exclude wrong matches. 1.1 Descriptors A well-performed descriptor should keep scale and rotation even affine transform invariant. There are several local invariant descriptors, such as SIFT[2], PCA-SIFT[9], GLOH[7], MOPs[1], spin-image and RIFT[7], and so on. Mikolajczyk[7,8] gave a survey on different kinds of descriptors, and presented detailed comparisons on those descriptors G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 328–337, 2008. © Springer-Verlag Berlin Heidelberg 2008
A Fast and Effective Dichotomy Based Hash Algorithm for Image Matching
329
to evaluate their capabilities. The conductive experimental results have shown that SIFT descriptor performs better than other descriptors. Due to the good performance of SIFT, it is treated as a milestone. Yan ke[9] proposed a low dimensional PCA-SIFT descriptor by projecting the SIFT descriptor onto the principal components of the normalized gradient field. GLOH, proposed by Mikolajczyk[7], expands the SIFT descriptor by changing the gradient location-orientation grid and the quantization parameters of histogram. In a word, all these attempts are reported to be few improvements, and they maybe improve SIFT on one aspect but perform weak on other aspects. Consequently, we adopt SIFT descriptor in this paper. 1.2 Search Algorithm High dimensional data search is quite an interesting and meaningful problem in pattern recognition[2], information retrieval[3], database querying[10], data analysis[10], and image matching[1-7]. A lot of researches on high dimensional data search has been done, where different kinds of data structures have been built according to data or spatial distributions, such as kd-tree[11], M-tree[10], SP-tree[14], R-tree and BBF[11], iDistance[10] and so on. While most people employed tree structures for high dimensional data searching, Indyk[12] brought a novel approach by establishing a hash index with random hash algorithm in high dimensional data search, which drawn an argument whether tree-based algorithms or hash based algorithms perform well[14]. This paper proposes a novel hash algorithm which could get better results in image matching than the traditional algorithms. The proposed method is named dichotomy based hash (DBH) algorithm. The core idea of DBH is based on the assumption that if two high dimensional vectors are matched or similar, they may have the same or similar attributes on some specific dimensions. First we select a key value for each dimension, and then randomly choose a given number of dimensions. Second we hash features into different buckets. For each query point, it would be hashed to the corresponding bucket by the same way on those chosen dimensions. The reminder of the paper is organized as follows. In section 2, we reviewed the traditional search algorithms for high dimensional vector, and then present the proposed DBH method in section 3. Section 4 shows the experimental results on extensive image matching. Finally the conclusions are drawn in Section 5.
2 Previous Work When referring to high dimensional data search, the issue of “curse of dimensionality” is often brought forward, since a lot of algorithms that did well in low dimensional situations can not process data whose dimension is over 15, such as kdtree and R-tree[11]. In fact, 15-D is quite low in some applications, however, SIFT descriptor reaches 128-dimension, and even some data surpass 1000-dimension [14]. k-d tree[11] is a k-dimensional binary search tree and the k-d tree index structure is established as follows. Firstly, add a set of N points in data space Rd to the root, and then find a key dimension and make a cut at the median value on the key dimension, so that the points are added to the left and right sub-trees equally. The process iterates
330
Z. He and Q. Wang
until the child node has only one feature data, as shown in Figure 1(a). When looking up the NN point to a query point q, it is first to traverse the tree to get the bin which contains the query point. Then following a backtrack stage. The search ends until all unexplored branches have been checked or reaches the preset iteration times. Nene[15] proposed a simple algorithm for nearest neighbor search, which is, for a ddimension query data , 2*d hyper planes are considered with the distance of ε to the query point on each dimension, and execute exhaustive search on those points that fell into the hyper cube, as shown in Figure 1(b). The key point of this method is the value of parameter ε . X1
X2
Z2 Z1
y+ε 2ε
Y1
P ( x , y , z)
y −ε
Y2
Y z +ε
Z X
x −ε
(a)
2ε
x +ε
z −ε
2ε
(b) Fig. 1. (a) kd-Tree (b) Hyper cube
More over, iDistance[10] (indexing the distance)partitioned the high dimensional feature set into n classes by using K-mean clustering algorithm, and then established a B+ tree for each class according to the distance to the central point p (reference point), Each class could be treated as hyper sphere centred at p. Given a query point q, and search radius r, we just calculate those spheres which include or insect with the query sphere. iDistance has been used in database search successfully. However, it is quite time-consuming to establish the index and lack of efficiency in image matching. BBF[11] (best bin first) is an expansion of kd-tree. Comparing to the standard kdtree, it is not necessary for BBF to explore every branch in the backtracking stage. It establishes a priority queue to record the distance of those traversed nodes on the key dimension. When execute backtracking, BBF selects the branch which has a minimum distance to the query point to search. By using this optimal strategy, BBF can massively reduce search times and could get more accurate result than kd-tree. Apart from tree structure based algorithms, LSH[13](local sensitive hash) is an alternative method for high dimensional data searching, which hashes similar features into the same bucket with high probability. There are three main steps in LSH search algorithm. Firstly, project feature vectors into Hamming space; Secondly, find K hash functions, for each hash function, randomly select L Hamming dimensions, and features are hashed to different buckets according to the values on those L dimensions. Finally, in query process, the query feature q is hashed to the corresponding bucket based on the
A Fast and Effective Dichotomy Based Hash Algorithm for Image Matching
331
same L Hamming dimensions, so what we do is just to calculate the points in those corresponding buckets.
3 Dichotomy Based Hash Based Search Algorithm In this section, we give a detailed description of the proposed DBH algorithm to facilitate the 2NN search for image matching. From aforementioned analyses, all those distance-based KNN search algorithms are attempt to find out the local or global distance attributes which can produce an effective description of NN. Thus, how to describe the attributes of the nearest neighbor effectively is key problem for KNN search. 3.1 The Basic Idea of DBH Algorithm Given two high dimensional features extracted from an image pair, it is obvious that, if they are very close to each other on some specific dimensions, they are probably similar to each other. Herein we say two values are ‘close’, is not the classical mean that the difference is small. The mean of ‘close’ is that two data may fall into a same numerical interval, that is to say, we measure local distance attributes by judging whether the feature vectors are distributed in a same data area. The proposed DBH algorithm is shown in Figure 2. In the procedure, we partition each dimension into two numerical intervals, and the partitioning value is dynamically selected according to the data distribution on each dimension. For a data set with the size N, we first compute data distribution on each dimension, and given an α ∈ (0,1) , we choose the ⎢⎡ N * α ⎥⎤ − th number as the partitioning value for each dimension. For two high dimensional vectors, having only one similar dimension (which falls into a same numerical interval) can not represent the similarity of the two vectors, so that we choose several dimensions, and repeat the process several times. In Figure 2, we say a bucket is a conflicted one when it has the query point. 3.2 The Procedure of DBH Algorithm Given two data sets A ∈ R d and B ∈ R d with the size NA and NB, we look for nearest neighbor from A for each vector in B, which means we establish hash index with A. At first, we obtain data distribution of each dimension in A. Given an α ∈ (0,1) , we select ⎡⎢ N A * α ⎤⎥ − th number (mean_value) as the partitioning value for each dimension. Secondly, randomly choose L non-repeated dimensions as the key dimensions on which we hash high dimensional features. In fact, for one vector p ∈ A , if the value on one key dimension is lower than the corresponding mean_value, it is assigned a value 0, if bigger or equal, assigned 1. All those binary value based on the L randomly chosen dimensions are made of a binary vector. If two high dimensional features have the same binary vectors, they should be put into a same bucket. After assigning all features in A, the hash index was completely established. Thirdly, in the search step, for a query feature q ∈ B , we execute the same process on those randomly selected L dimensions, and the query feature q will certainly be hashed to one bucket (we called
332
Z. He and Q. Wang
it as conflicted bucket). We just calculate the distances to all points in the conflicted bucket to get the nearest neighbor with a higher probability. Of cause, if we repeat the second and third steps for several times, we may get more accurate results. In fact, executing hash query process only one time does cost a little time and get a lower accuracy, there are two ways to improve the search algorithm. One way is just to establish only one hash function without any index, but needs to repeat the hash and query process several times, which will result in little memory cost but a little higher time cost. The other way is to establish K hash functions and hash features in advance, when executing query process, the query feature point is hashed by using the same hash functions and only search those buckets which may include the query feature. The later one is intuitional and more efficient because features are hashed beforehand and needn’t hash again in the query phase, just like those tree-based KNN search algorithms. After the hash index table is built, no modification is in need. However, the memory cost of the later method is much higher, which is nearly K times than the former one. As a result, in the following experiments, we employ the first method and we could repeat more than K times if more accurate nearest neighbor is in need, the procedure was shown in Figure 2. Feature set
Query features
12 randomly selected dimensions r1
r2
r3
…
…
…
r11
r12
Repeat the process for several times Buckets
…
…
…
…
0..00 0..01 0..10 12bits
… 1..10 1..11
Conflicted buckets
Fig. 2. The demonstration of DBH based search algorithm. We hash the feature set into different buckets based on the randomly chose 12 dimensions, and we hash the query points in the same way. The buckets which include the query point are called conflicted buckets. So we only need to calculate all the points in the conflicted buckets. Just repeat the process for several times, we could obtain accurate results.
There are two important parameters L and α ∈ (0,1) in DBH algorithm. The values of L and α decide the performance of DBH algorithm. Besides L and α , the other factor is the repeating times K of hashing and searching, which is decided by the specific application. If we want to get much higher accuracy, repeat hashing and searching more times, and vice versa.
A Fast and Effective Dichotomy Based Hash Algorithm for Image Matching
333
4 Experimental Results and Analysis In this section, we will conduct extensive experiments and show the performance of DBH. As we know, it is need to find out the nearest neighbor and the second one for a query feature in image coarse matching, so it is necessary to present the absolute accuracy of the 2NN search algorithms. For finding out nearest neighbor would be time-consuming and some applications do not need to find out the nearest neighbor, approximate nearest neighbors are enough to cope the problems and the time cost is low comparing to the KNN search so that the conception of ANN (approximate nearest neighbor) is quite popular. Here we also need to evaluate the capabilities by using ANN search. BBF[11] performs well in image matching, and is popular used in other NN search problems, and LSH as a new solution for NN search problems, has drawn many concerns for its strong capability in NN search problems, so we compare DBH to these two algorithms. As ground-truth results are in need to evaluate the performance, exhaustive search is also used in the experiments. 4.1 Experiment Configurations The whole experiments are divided into three aspects. The first one is to compare the performance of DBH, BBF and LSH algorithms for 2NN search. And the second one is to evaluate the capability of these algorithms in ANN search. The last one is to show the performance of the proposed DBH for image matching. As we know, BBF algorithm has an important parameter BBF_ SEARCH_TIME, which presents the times of backtrack. In our experiment, we keep it as 200 as Lowe recommended in [11]. In fact we could enlarge it for higher accuracy or reduce it for higher efficiency. We refer to the concept of “absolute accuracy” frequently, which is defined as follow: Given two high dimensional data sets A and B with the size NA and NB, we execute 2NN search from A for every feature in B, we get the correct answers by exhaustive search, then compare the 2NN results to the exhaustive answer, if we get NC correct answer, the absolute accuracy is NC/NB. As BBF is a state-of-art algorithm in image matching, we design our DBH algorithm by comparing to BBF to validate the efficiency and accuracy. As the search time is controllable, all the experiments are designed to keep the time cost of DBH a little smaller than BBF. The images used for comparison are from [16], and all feature vectors are 128-D SIFT descriptors extracted from those images. The experimental environment is PC with Pentium(R) 4 3.00 GHz CPU and 512M memory. The details of image and features are listed in Table1. The index size is the number of the feature vectors in the first image, and Ave size, Max size, Min size are the average number, max number and min number of the features in other images within the group, respectively. Table 1. Details of experimental images and number of SIFT features Image num Index size Ave size Max size Min size
Boat 10 8310 5005 7851 2478
Bricks 6 9686 10459 10741 10401
Car 6 3549 2759 3666 1204
East 10 4718 3965 5680 2133
Ensimag Graffiti Inria Laptop Resid 10 9 10 10 10 5123 2846 4676 1198 3330 4216 3889 2758 1426 2533 5768 4576 3816 2544 3051 3062 3201 1246 930 1724
Toy 10 1072 1190 1307 1066
Trees 6 11981 9486 15297 3105
334
Z. He and Q. Wang
4.2 Absolute Accuracy of Nearest Neighbor and Time Cost As 2NN search for image matching requires the accurate 1-NN and 2-NN as much as possible, we compare the absolute accuracy and time cost of 2NN results. We process 11 groups of images (97 images, about 380,000 SIFT features) with rotation, scale, view point changes and noises, and the size of those feature sets extracted from images are varied from 900s to 15,000s. For each group, we take the first image feature set as the reference data, and other image feature sets are used to search. We record the average time cost and average absolute accuracy of each algorithm for each group, as shown in the Table 2. Accu 1 and Accu 2 mean the accuracy rate of 1-NN and 2-NN respectively. Remember that the cost time of BBF is regarded as the reference. From Table 2, we could find that the accuracy of DBH is much higher than those of BBF and LSH, while the time cost of DBH and BBF are nearly equal. Table 2. The absolute accuracy and time cost of searching 2NN Ave size Boat Bricks Car East Ensimag Graffiti Inria Laptop Resid Toy Trees
5005 10459 2759 3965 4216 3889 2758 1426 2533 1190 9486
LSH 44.34 37.97 43.97 54.89 43.42 45.86 64.01 57.49 65.02 40.72 41.06
Accu 1 (%) BBF 41.53 36.53 62.56 54.06 49.52 56.68 63.07 81.67 68.05 78.57 30.53
DBH 60.16 63.10 81.75 67.08 55.49 70.36 64.61 87.80 73.48 86.47 55.08
LSH 22.78 15.24 18.92 35.75 25.06 27.98 46.44 25.34 47.69 12.87 19.59
Accu 2 (%) BBF 16.68 10.95 31.72 27.97 24.44 32.24 38.40 55.00 44.64 52.58 9.60
DBH 34.76 32.75 54.77 43.19 33.59 48.84 40.77 65.71 52.02 61.38 28.32
LSH 11166 21078 2531 10576 6793 3839 5604 1404 5444 1204 34115
Time (ms) BBF 4661 10231 2487 3687 3859 3525 2654 1354 2277 942 9246
DBH 4574 10509 2349 3583 3753 3384 2614 1173 2128 758 9412
4.3 Approximate Accuracy Generally speaking, if an algorithm can perform well for KNN search, it should perform well in ANN under the same parameters. Here, we took Lowe’s measurement to evaluate the performance of ANN results [11]. Given two high dimensional data sets A and B with the size NA and NB, we execute ANN search from A for every feature in B, we get the correct answers by exhaustive search, and the approximate accuracy ac is defined as ac =
1 NB di ∑ N B i =1 Di
(1)
where Di is the standard nearest distance, and di is the experimental nearest distance. Obviously, ac≥1, and the value of ac is smaller, the result is better. We use the same data used in the previous experiment, and record the average approximate accuracy for each group. As shown in Table 3 and 4, we can see that DBH performs better than BBF and LSH in finding 1NN and 5NN. There are two reasons: firstly, DBH can find more real nearest neighbor. On the other hand, even though DBH loses the nearest neighbor, it may find a feature quite close to the nearest one.
A Fast and Effective Dichotomy Based Hash Algorithm for Image Matching
335
Table 3. The approximate accuracy of 1NN LSH BBF DBH
Boat 1.29 1.29 1.07
Bricks 1.42 1.38 1.06
Car 2.32 1.35 1.03
East 1.26 1.18 1.06
Ensimag Graffiti 1.25 1.31 1.13 1.13 1.08 1.05
Inria 1.33 1.18 1.12
Laptop Resid 1.97 1.27 1.08 1.12 1.02 1.06
Toy 1.84 1.07 1.02
Trees 1.19 1.24 1.07
Toy 1.494 1.062 1.041
Trees 1.191 1.201 1.097
Table 4. The approximate accuracy of 5NN LSH BBF DBH
Boat 1.241 1.216 1.101
Bricks 1.229 1.230 1.087
Car East Ensimag Graffiti 1.285 1.267 1.262 1.242 1.193 1.168 1.145 1.150 1.065 1.089 1.115 1.062
Inria Laptop Resid 1.206 1.426 1.238 1.172 1.079 1.130 1.173 1.044 1.101
4.4 Comparison of (1-Precision)-Recall Ratio In order to evaluate the performance of image matching by different algorithms, Yan Ke[9] proposed the (1-precision)-recall ratio and it was quickly adopted to judge the capability of image matching algorithms. In this experiment, we choose 4 image pairs with rotation, scale, affine changes and noise to show the results of BBF, LSH and DBH algorithms, as shown in Figure 3. Figure 4 shows the encouraging results that (1-precision)-recall curves by using DBH algorithm is fairly better than BBF and LSH, especially it is nearly close to exhaustive method. Since exhaustive search is the most accurate method for the distance-based image matching, these results show that DBH is well adapted for matching SIFT descriptors.
Fig. 3. The images used in evaluating the performance of (1-precision)-recall ratio
4.5 Parameter Analysis The parameters α and L decide the capability of the proposed algorithm. Even using the same parameters, different results may be obtained due to the different distributions of high dimensional data. Now, we choose feature pairs with different size (from 1000 to 11000) to check how α and L affect the accuracy while the search time is nearly same as BBF. The parameter α decides how to partition the feature space on each dimension. We commonly select the mean value to divide feature space, however, different data distributions require different parameter in order to get good result. For discovering
336
Z. He and Q. Wang
Boat (scale and rotation)
Bricks (affine transformation)
Car (light variation)
Tree (noise)
Fig. 4. (1-precision)-recall curves of different image pairs
(a) α
(b) L
Fig. 5. The parameter settings of α and L
the relationship between accuracy and α , we keep L=12. Figure 5(a) shows that α has big affections on accuracy especially when the feature size is large. The “toy” pair keeps linear because its size is only about 1000, and other pairs are over 3000, especially the size of “tree” surpasses 10000. The better choice of α is 0.6. In Figure 5(b), we keep α =0.6 and find that L has fewer effects than α , and 12 is a proper value for it.
5 Conclusions In this paper, we propose a novel dichotomy based hash algorithm for dealing with the image coarse matching problem. Experimental results have shown that our algorithm can obtain high accuracy in KNN and ANN search, and achieve high (1-precision)-recall ratio in image matching. The proposed algorithm can sufficiently dig out the nearest neighbor for SIFT features and obtain greater improvement than BBF and LSH during the coarse matching phase.
A Fast and Effective Dichotomy Based Hash Algorithm for Image Matching
337
Acknowledgments This work is supported by National Hi-Tech Development Programs under grant No. 2007AA01Z314 and National Natural Science Fund (60873085), P. R. China.
References 1. Brown, M., Szeliski, R., Samarin, A.: Multi-Image Matching using Multi-Scale Oriented Patches. In: CVPR, pp. 510–517 (2005) 2. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Key points. International Journal of Computer Vision 60(2), 91–110 (2004) 3. Ke, Y., Sukthankar, R., Huston, L.: Efficient Near duplicate Detection and Sub-image Retrieval. In: ACM. International Conference on Multimedia (2004) 4. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. ACM Trans. Graph. 25(3), 835–846 (2006) 5. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. In: IJCV, pp. 63–86 (2004) 6. Omercevic, D., Drbohlav, O., Leonardis, A.: High-Dimensional Feature Matching: Employing the Concept of Meaningful Nearest Neighbors. In: ICCV (2007) 7. Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE Transactions on PAMI, 1615–1630 (2005) 8. Mikolajczyk, K., Matas, J.: Improving Descriptors for Fast Tree Matching by Optimal Linear Projection. In: Proc. ICCV (2007) 9. Ke, Y., Sukthankar, R.: PCA-SIFT: A More Distinctive Representation for Local Image Descriptors. In: CVPR, pp. 511–517 (2004) 10. Jagadish, H.V., Ooi, B.C., et al.: iDistance: An adaptive B + 2tree based indexing method for nearest neighbor search. ACM Trans. on Data Base Systems, 364–397 (2005) 11. Beis, J.S., Lowe, D.G.: Shape Indexing Using Approximate Nearest-Neighbor Search in High-Dimensional Spaces. In: CVPR, pp. 1000–1006 (1997) 12. Indyk, P., Motwaniy, R.: Approximate Nearest Neighbors Towards Removing the Curse of Dimensionality. In: 30th ACM Symp. on Theory of Computing, pp. 604–612 (1998) 13. Gionis, A., Indyky, P., Motwaniz, R.: Similarity Search in High Dimensions via Hashing. The VLDB Journal, 518–529 (1999) 14. Liu, T., Moore, A.W., Gray, A., Yang, K.: An Investigation of Practical Approximate Nearest Neighbor Algorithm. In: NIPS, pp. 825–832 (2004) 15. Nene, S.A., Nayar, S.K.: A Simple Algorithm for Nearest Neighbor Search in High Dimensions. IEEE Transactions on PAMI 19(9), 989–1003 (1997) 16. http://lear.inrialpes.fr/people/Mikolajczyk
Evaluation of Gradient Vector Flow for Interest Point Detection Julian St¨ottinger1 , Ren´e Donner1,2,3 , Lech Szumilas4 , and Allan Hanbury1 1
4
PRIP, Vienna University of Technology, Austria 2 ICG, Technical University Graz, Austria 3 CIR, Vienna Medical University, Austria ACIN, Vienna University of Technology, Austria
Abstract. We present and evaluate an approach for finding local interest points in images based on the non-minima suppression of Gradient Vector Flow (GVF) magnitude. Based on the GVF’s properties it provides the approximate centers of blob-like structures or homogeneous structures confined by gradients of similar magnitude. It results in a scale and orientation invariant interest point detector, which is highly stable against noise and blur. These interest points outperform the state of the art detectors in various respects. We show that our approach gives a dense and repeatable distribution of locations that are robust against affine transformations while they outperform state of the art techniques in robustness against lighting changes, noise, rotation and scale changes. Extensive evaluation is carried out using the Mikolajcyzk framework for interest point detector evaluation.
1
Introduction
Interest points are an important tool in many current solutions to computer vision challenges. Fergus et al. [1] point out that frameworks using interest points are heavily dependent on the detector to gather useful features. They detect locations of characteristic structures such as blobs, corners or local image symmetry. The interest point and interest region detectors allow for the reduction of computational complexity in scene matching and object recognition applications by selecting only a subset of image locations corresponding to specific and/or informative structures. The extraction of stable locations is a successful way to match visual input in images of the same scene acquired under different conditions. As evaluated in [2], successful approaches extracting stable locations rely on corner detection [3,4] or blobs like Maximally Stable Extremal Regions (MSER) [5] and Difference of Gaussians (DoG) [6]. The well known approach of detecting local symmetry [7] is also included in our evaluation. Donner et al. [8] take the minima of the Gradient Vector Flow (GVF) [9] with one manually adjusted set of parameters to locate centers of local symmetry at a certain scale. We extend their approach and propose a GVF based scale space pyramid and a scale decision criterion to provide general purpose interest points. This multi-scale orientation invariant interest point detector has the G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 338–348, 2008. c Springer-Verlag Berlin Heidelberg 2008
Evaluation of Gradient Vector Flow for Interest Point Detection
339
aim of providing stable and densely distributed locations. Due to the iterative gradient smoothing during the computation of the GVF, it takes more surrounding image information into account than other detectors. Its stability against noise, blur, JPEG artifacts, rotation and illumination change makes it a promising approach for many applications in computer vision. For example, low quality images and videos in on-line applications and used in mobile computing suffer from such effects. Medical imaging also often deals with low contrast images. In the next section, we give an overview of the interest point detectors used. We introduce our approach in detail in Section 3. Experiments and results are given in Section 4. Finally, the conclusion is given in Section 5.
2
Interest Points
We describe the most successful approaches for detecting interest points. For the most stable and broadly used corner detectors, we choose the Harris Laplacian approach for its excellent performance in [2] and describe it in Section 2.1. In Section 2.2, we go into more detail about broadly used blob detectors: DoG and MSER. Symmetry based interest points are covered in Section 2.3. 2.1
Harris Laplacian Detector
The Harris corner detector introduced in [3] provides a corner measure for image data. The pixels are analyzed to result in a one dimensional corner measure, also referred to as Harris energy. It is based on the trace and determinant of the second moment matrix M . An extension of the Harris corner detector, the scaleadapted Harris detector, was introduced to achieve scale invariance by Mikolajczyk and Schmid [10]. The resulting patch size is the size of the Gaussian kernel as the scale σ of the corner detector. The second moment matrix of a certain scale is then 2 Lx (σD ) Lx Ly (σD ) 2 M = σD GσI ⊗ (1) Lx Ly (σD ) L2y (σD ) where Lx and Ly are respectively derivatives in the x and y direction calculated after smoothing the image by a Gaussian of size σD (derivation scale). GσI is a Gaussian function of width σI (integration scale). As suggested in [10], we set σI = 3σD . To detect the characteristic scale, the maxima of the Laplacian of Gaussian function Λ are used [11,4] and is extended to take advantage of different color spaces in [12]. 2.2
Blob Detectors
Blob detectors, based on space-scale theory introduced to computer vision by Witkin [13] and extended by Lindeberg [14], rely on differential methods such as Laplacian of Gaussian (LoG), difference of Gaussian (DoG) and Determinant
340
J. St¨ ottinger et al.
(a) image 1
(b) image 6
Fig. 1. Harris Laplacian detector applied to the graffiti testset (see Fig. 7(a)) image 1 (a) and image 6 (b). The size of the circles indicates the size (scale) of the kernel with the highest peak.
(a) image 1
(b) image 6
Fig. 2. DoG applied on image 1 (a) and image 6 (b) of the graffiti testset
of Hessian (DoH) [11]. The result of blob detection using either LoG or DoG methods depends on the choice of the scale sampling rate which is analyzed in [6]. Another technique within the class of blob detectors but unrelated to scale-space theory is MSER [5] which is outlined further on. DoG. As demonstrated in [6], LoG results can be approximated with the DoG at reduced computational complexity. In this case the Laplacian operator Λσ is approximated by the difference between two Gaussian smoothed images. The scale space Sσ is defined by Sσ = Gσ ⊗ I. The image pyramid level Dσi is computed as the difference of the √ i input image I convolved with Gaussian kernels of size σi = 2 as Dσ = (G√2σ − Gσ ) ⊗ I. The implementation leads to an early diminished dataset, as the majority of the pixels are discarded at the first scale level. To discard edges and prioritize corners, the already calculated Hessian matrix is used to process an adapted Harris corner detection algorithm. An example of the resulting interest points is given in Fig. 2. MSER. [5] are obtained by a watershed-like algorithm. Connected regions of a certain thresholded range are selected if they remain stable over a set of thresholds. The algorithm is efficient both in run-time performance and detection rate. The region priority is measured in the number of thresholds where the region remains stable.
Evaluation of Gradient Vector Flow for Interest Point Detection
(a) image 1
341
(b) image 6
Fig. 3. Example of MSER locations applied on the graffiti testset image 1 (a) and image 6 (b). The ellipses mark the most stable blobs.
An image region Q is extremal if the intensity of all pixels q ∈ Q is higher than the intensity of boundary pixels p (adjacent to Q) I(q) > I(p) for maximum intensity regions or lower I(q) < I(p) for minimum intensity regions. Region Q is a contiguous image patch i.e. there is a path S connecting any two pixels q ∈ Q such that S ∈ Q. Extremal regions are then such that I(Qi + Δ) > I(Qi ) > I(Qi − Δ). The maximally stable extremal region is the one for which variation of the area q(i) has a local minimum at i: q(i) =
|Qi+Δ | − |Qi−Δ | |Qi |
(2)
Ellipses fitted to the MSER locations can be seen in Fig. 3. 2.3
Symmetry Based Interest Points
The Generalized Symmetry Transform (GST) [15] inspired the Fast Radial Symmetry Transform (FRST) by Loy and Zelinsky [16,7]. A pixel of the image contributes to a symmetry measure at two locations called negatively and positively affected pixels. The coordinates of negatively affected p−ve and positively affected p+ve pixels are defined by the gradient orientation at pixel p and a distance n (called range in [16]) as follows: g(p) g(p) p+ve = p + round n , p−ve = p − round n (3) g(p) g(p) The symmetry measure is a combination of orientation projection On and magnitude projection Mn maps, which are obtained through agglomeration of positively and negatively affected pixel contributions. Each positively affected pixel increments the corresponding element of the orientation projection map by 1 and magnitude projection map by g(p) while the negatively affected pixel decrements the map by these values: On p+ve (p) = On p+ve (p) + 1, On p−ve (p) = On p−ve (p) − 1 (4) Mn p+ve (p) = Mn p+ve (p) + g(p), (5) Mn p−ve (p) = Mn p−ve (p) − g(p)
342
J. St¨ ottinger et al.
(a) image 1
(b) image 6
Fig. 4. Example of Loy symmetry points with a simple scale selection applied on the graffiti testset image 1 (a) and image 6 (b). The size of the circles indicates the size of the range with a symmetry peak.
The radial symmetry measure at range n is a combination of normalized orientation and magnitude projection maps, additionally smoothed by a Gaussian kernel: α Mn |On | Sn = Gσn ⊗ (6) kn kn where kn is the scale normalization factor and α is the radial strictness parameter which allows to attenuate the symmetry response from ridges. The orientation projection map used for final calculations is thresholded using kn . The symmetry measure can be also averaged over a set of ranges N = {n1 , ...nK } to achieve partial scale invariance: 1 S= Sn (7) K n∈N
An exhaustive discussion of the parameter choice and results are presented in [7]. We refer to these interest points as Loy Points (see Fig. 4).
3
Gradient Vector Flow Based Interest Points
To detect points of high local symmetry we use the GVF based interest points (GVF points) as proposed in [8]. The GVF [9], which yields a rotation invariant vector field, was originally proposed to increase the capture range of active contours. It is defined as the vector field v(x, y) = (u(x, y), v(x, y)) which minimizes G= μ(u2x + u2y + vx2 + u2y ) + |∇f |2 |v − ∇f |2 dxdy (8) where f denotes the edge map of image I f 2 (x, y) = |(Gσ (x, y) ∗ I(x, y))|
(9)
and the parameter μ gives the relation between the first smoothing term (compare with the classic optical flow calculation [17]) and the second term. Its strengths include the ability to detect even weak structures while being robust to high amounts of noise in the image. When |∇f | is small, the energy yields a very smooth field.
Evaluation of Gradient Vector Flow for Interest Point Detection
(a) frontal
343
(b) 20 degree (c) 50 degree (d) 1st scale (e) 2nd scale (f) 3rd scale
Fig. 5. GVF points on image details of the graffiti testset. (a)-(c): GVF points under geometric transformation. (d)-(f): GVF points of the first three scales of one image detail.
(a) GVF points on image 1
(b) GVF points on image 6
Fig. 6. GVFpoints applied to image 1 (a) and image 6 (b) of the graffiti testset
The field magnitude |G| is largest in areas of high image gradient, and the start and end points of the field lines of G are located at symmetry maxima. E.g. in the case of a symmetrical structure formed by a homogeneous region surrounded by a different gray level value the field will point away from or towards the local symmetry center of the structure. The symmetry interest points are thus defined as the local minima of |G|. In contrast to techniques based on estimating the radial symmetry using a sliding window approach this will yield a sparse distribution of interest points even in large homogeneous regions. We increase μ iteratively thus smoothing G. As we lose information on local structure and v takes a gradually larger area into account, a rotation invariant scale space pyramid is built. For the experiments, the parameters μ = 0.1 and scale factor f = 1.33 are used. We apply the scale factor five times per image smoothing G for taking more area into account. Examples of the resulting images are shown in Fig. 5: (d)-(f) show the distributions of the resulting points for increasing scale; (a)-(c) show the interest points on geometric transformations. Further examples are shown in Fig. 6.
4
Results
In this section, a performance evaluation of the GVFpoints is given. We show that they outperform current approaches for invariant interest point locations in several important tasks.
344
J. St¨ ottinger et al.
(a) Test-set graffiti depicts a painted wall under heavy viewpoint changes.
(b) Test-set boat changes the viewpoint and the zoom level while rotating the scene.
(c) Test-set cars provides a natural scene at different daytimes.
(d) 6 out of 20 images of the test-set toy. It provides a natural scene under different lighting directions. Main challenge is the stability against shadowing effects.
(e) Test-set bikes with different bikes becoming more and more blurred.
(f) Test-set ubc adds JPEG compression to a natural scene. Fig. 7. One data-set per challenge used in the experiments
We give an overview of the experimental setup and then give a detailed review of the results of the experiments. Mikolajczyk and Schmid [2] suggest a test for the quality of local features. They measure the repeatability of local features under different image transformations. These tests consist of a set of images, where one acts as the reference image and the other images show the same scene under predefined changes like blur, rotation, zoom, viewpoint change, JPEG compression or lighting changes1. Having a correct homography H between two images L and L , we can project a point x in L into the transformed image x in L . Regarding the area of interest points, the overlap of the projected area is estimated. Areas are normalized so 1
http://lear.inrialpes.fr/people/mikolajczyk/Database
Evaluation of Gradient Vector Flow for Interest Point Detection
345
GVFpoints 60 40 20 0
1
2
3
4
5
4
5
4
5
MSER 60 40 20 0
1
2
3
DoG 60 40 20 0
1
2
3
Harris Lapl. 60 40 20 0
1
2
3
4
5
4
5
Loy Points 60 40 20 0
1
2
3
Fig. 8. Histogram of the ranks of the compared algorithms. For each of the 91 test images the algorithms were sorted according to their performance.
ø(rep) corresp ø(area) std(area) nr. pts
GVFp 74.2 7361.7 2496.2 4273.5 17349
MSER 66.9 215.8 1565.7 3930.7 533
DoG 49.5 1984 65.1 149.5 5479
HarLap 42.0 1016.8 4490.5 17926.1 3116
Loy 36.0 474.5 684.4 1013.8 1690
Fig. 9. Repeatability experiment test-set graffiti – viewpoint transformation of colourful patterns. GVFpoints perform best for each of the test images.
that each radius is set to 30 pixels. If the overlapping error is below 40% to the nearest neighbour, the interest point is repeated. The repeatability rate is defined as the ratio between the number of detected correspondences and the number of regions that occur in the area common to both images. Feature detectors tend to have higher repeatability rates when they produce a richer description. Twelve commonly used data-sets have been chosen for this evaluation. Examples of the images are shown in Fig.7. The histograms in Fig. 8 provide a summary view of the ranks of the individual algorithms. Each of the 91 reference image / test image pairs was treated as a separate experiment. For each of these, the algorithms were ranked according to their repeatability from 1 to 5. In 57.1% of the cases the GVFpoints exhibited the best performance (rank=1), while in 80.2% they performed either best or second best (rank≤2). Harris Laplacian and Loy’s symmetry points show far lower performance, with Loy having the lowest performance (rank=5) in 47% of cases. MSER and DoG display mixed results: while leading in same cases they fare badly in others, exhibiting an average performance overall. For one testset per challenge, the repeatability graphs and numerical results are given. These statistics give the means of the repeatability rate, number of correspondent regions, area of the circles in the images in pixel2 , standard deviation of the area and the average number of interest points in the image. GVFpoints show to be repeatable under geometric transformation (Fig.10). Elongated structures like the ones found in the graffiti testset (see Fig. 7(a) and Fig. 5) are centered precisely. This works also for MSER, having very well defined blobs on the wall. Therefore, DoG performs also better than the Harris Laplacian as the corners are heavily transformed during the challenge. For Loy points, no repeatability is found for the last two test images. Note the high number of GVFpoints compared to the other approaches because of the elongated structure of
346
ø(rep) corresp ø(area) std(area) nr. pts
J. St¨ ottinger et al.
GVFp 89.9 15828.8 2481.8 4252.9 28093
MSER 62.4 1026 507.8 1459.3 2363
DoG 69.6 8270.8 52.8 65.0 13604
HarLap 51.5 3036.2 1714.9 2789.4 7599
Loy 52.9 776.5 669.2 1007.6 1600
ø(rep) corresp ø(area) std(area) nr. pts
GVFp 75.5 6990.6 2422.3 4206.8 17625
MSER 61.3 497 1083.8 4320.7 1524
DoG 69.1 4969.7 51.6 33.9 13797
HarLap 55.8 1628.4 3857.0 11852.8 4961
Loy 42.2 502.3 816.6 1077.8 1803
Fig. 10. Repeatability experiment test-set Fig. 11. Repeatability experiment test-set bricks – viewpoint transformation on a boat for zoom and rotation highly structured plane
ø(rep) corresp ø(area) std(area) nr. pts
GVFp 83.5 10467.2 2369.7 4168.9 14524
MSER 70.1 90.2 1395.7 3824.4 172
DoG 71.3 8623 51.4 32.8 12477
HarLap 57.9 421.5 2345.5 4916.2 813
Loy 60.2 950.7 576.4 934.0 2106
mean(rep) corresp mean(area) std(area) nr. points
GVFp 64.1 3804.2 2493.1 4264.8 6995
MSER 54.0 56.3 350.8 625.5 117
DoG 65.4 3742.2 54.0 76.6 5732
HarLap 48.9 348.8 3671.8 11198.4 799
Loy 53.8 538.4 263.5 621.3 1238
Fig. 12. Repeatability experiment test-set Fig. 13. Repeatability experiment test-set cars with increasing illumination toy with changing lighting direction
mean(rep) corresp mean(area) std(area) nr. points
GVFp 96.1 13795.7 2476.8 4248.7 19648
MSER 63.9 193.2 580.2 1726.0 616
DoG 63.1 6893 51.4 32.0 14382
HarLap 57.9 1096.7 5894.8 25058.1 2111
Loy 59.8 898.3 669.2 1007.6 1600
mean(rep) corresp mean(area) std(area) nr. points
GVFp 84.9 11743.2 2403.7 4194.2 15388
MSER 58.1 415.7 451.8 1708.6 770
DoG 67.2 5160.2 51.4 36.8 8417
HarLap 75.2 1482.3 2839.4 8376.9 2053
Loy 64.2 983.8 669.2 1007.6 1600
Fig. 14. Repeatability experiment test-set Fig. 15. Repeatability experiment test-set ubc for JPEG compression bikes challenging blur
Evaluation of Gradient Vector Flow for Interest Point Detection
347
the blobs, which increases the repeatability rate. On small, often repeated structures like in test-set bricks, GVFpoints are able to estimate correspondences for over 75% of all locations, even after 60 degree of transformation (Fig. 9). The experiment on the testset boat (see Fig. 7(b)) shows that GVFpoints exhibit higher repeatability at small details, being more invariant to rotational change than other approches. As shown in Fig. 12 and Fig. 13, GVF based points are more stable against changing illumination than all other interest point detectors. Linear illumination change does not affect the GVF to the same degree as the other interest point detectors. For heavy change of lighting, MSER provide slightly more stable locations than GVF. Fig. 14 shows the GVFpoints are almost perfectly invariant to blur. Local noise like the JPEG compression artefacts in testset ubc is evaluated in Fig. 15. We show that the GVFpoints provide more stable locations up to the point where the extrema are significantly shifted by the newly introduced structures. Surprisingly, MSER turn out to be very unstable to this kind of noise, where Loy points provide better results. Harris Laplacian points are obviously more stable and perform almost comparably to the DoG.
5
Conclusion
We showed that for the majority of challenges, interest points based on GVF provide more stable locations than the well known and broadly used corner or blob detectors. They give a rich and well-distributed description for diverse visual data. Especially the invariance against linear and arbitrary lighting and illumination changes as well as viewpoint transformations makes the proposed interest points well suited for many problems dealing with rotation, noise, low contrast or heavy compression. These effects often occur in on-line and mobile applications. Medical imaging often deals with low contrast images, where the GVFpoints have advantages compared to corner or blob detectors. The GVF deals with those challenges in a very stable way, being almost invariant to blur and more repeatable towards JPEG compression than other detectors.
Acknowledgment This work was partly supported by the NB project 12537 (COBAQUO) and the Austrian Research Promotion Agency (FFG) project 815994.
References 1. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR, pp. 264–271 (2003) 2. Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. IJCV 60, 63–86 (2004) 3. Harris, C., Stephens, M.: A combined corner and edge detection. In: AVC, pp. 147–151 (1988)
348
J. St¨ ottinger et al.
4. Mikolajczyk, K., Schmid, C.: Indexing based on scale invariant interest points. In: ICCV, pp. 525–531 (2001) 5. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In: BMVC, pp. 384–393 (2002) 6. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91– 110 (2004) 7. Loy, G., Zelinsky, A.: Fast radial symmetry for detecting points of interest. PAMI, 959–973 (2003) 8. Donner, R., Miˇcuˇs´ık, B., Langs, G., Bischof, H.: Sparse MRF appearance models for fast anatomical structure localisation. In: Proc. BMVC (2007) 9. Xu, C., Prince, J.L.: Snakes, shapes, and gradient vector flow. In: PR (1998) 10. Mikolajczyk, K., Schmid, C.: An affine invariant interest point detector. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 128–142. Springer, Heidelberg (2002) 11. Lindeberg, T.: Feature detection with automatic scale selection. IJCV 30, 79–116 (1998) 12. St¨ ottinger, J., Hanbury, A., Sebe, N., Gevers, T.: Do colour interest points improve image retrieval? In: ICIP, pp. 169–172 (2007) 13. Witkin, A.P.: Scale-space filtering. In: IJCAI, pp. 1019–1022 (1983) 14. Lindeberg, T.: Effective scale: A natural unit for measuring scale-space lifetime. In: ISRN KTH (January 1994) 15. Reisfeld, D., Wolfson, H., Yeshurun, Y.: Context free attentional operators: the generalized symmetry transform. JCV (1994) 16. Loy, G., Zelinsky, A.: A fast radial symmetry transform for detecting points of interest. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 358–368. Springer, Heidelberg (2002) 17. Horn, B., Schunck, B.: Determining optical flow. Artiflcial Intelligence 17, 185–204 (1981)
Spatially Enhanced Bags of Words for 3D Shape Retrieval Xiaolan Li1,2, Afzal Godil1, and Asim Wagan1 1
National Institute of Standards and Technology, USA 2 Zhejiang Gongshang University, P.R. China {lixlan,godil,wagan}@nist.gov
Abstract. This paper presents a new method for 3D shape retrieval based on the bags-of-words model along with a weak spatial constraint. First, a two-pass sampling procedure is performed to extract the local shape descriptors, based on spin images, which are used to construct a shape dictionary. Second, the model is partitioned into different regions based on the positions of the words. Then each region is denoted as a histogram of words (also known as bag-of-words) as found in it along with its position. After that, the 3D model is represented as the collection of histograms, denoted as bags-of-words, along with their relative positions, which is an extension of an orderless bag-of-words 3D shape representation. We call it as Spatial Enhanced Bags-of-Words (SEBW). The spatial constraint shows improved performance on 3D shape retrieval tasks.
1 Introduction With recent advances in computer aided modeling and 3D scanning technology, the number of 3D models created and stored in 3D shape repositories is increasing very rapidly. 3D models are widely used in several fields, such as computer graphics, computer vision, computer aided manufacturing, molecular biology, and culture heritage, etc. Therefore it is crucial to design effective and efficient methods for retrieving shapes from these 3D shape repositories. Appearance and geometry are two main aspects of a 3D model. However, most existing search engines focus either on appearance [Chen03] [Osada02] or on geometry [Siddiqi08] alone. Even though for these methods [Papadakis07] [Vranic03], which explore both features, they ask for specifically geometric ordering for the appearance features. They use spherical harmonic transform to combine these two properties together, which requires the order of the sampling positions to be sorted according to two spherical angle coordinates. This requirement reduces the flexibility of the method. The bag-of-words methods, which represent an image or a 3D shape as an orderless collection of local features, have recently demonstrated impressive level of performance [Li05]. However, it totally disregards the information about the spatial layouts, which limits its descriptive ability. In this paper, we would like to incorporate a weak geometry constraint into the bag-of-words framework, which will compensate for the shortcoming of the framework and keep its flexibility as much as possible. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 349–358, 2008. © Springer-Verlag Berlin Heidelberg 2008
350
X. Li, A. Godil, and A. Wagan
Our approach is most similar to that of [Liu06] [Shan06]. The similarities include: 1) the spin image is chosen as the local descriptor. 2) the bag-of-words model supports the whole framework of 3D shape retrieval. The main differences between our work and their work are two-fold. First, we incorporate a weak spatial constraint which improves the descriptive capability compared to the original bag-of-words model. Second, we introduce a new similarity metric which accounts for appearance similarity and geometry similarity. We start by uniformly sampling basis points and support points on the surface of the model, which satisfies insensitivity to the tessellation and resolution of the mesh. After extracting a set of spin images for each model, we construct a large shape dictionary by clustering all spin images acquired from the whole training dataset. Instead of representing one model with a histogram of the words from the dictionary, it is partitioned into several regions by clustering the basis points according to their spatial positions, and then represented as a set of histograms (bags-of-words) with pairwise spatial relations of the regions. After a correspondence based on the appearance is built between two models, the spatial difference, referred to as geometry dissimilarity, between the layouts of the regions is calculated and used as the second part of the dissimilarity metric. That is, as a 2D example shown in figure 1, even though two images containing the same components, the spatially enhanced bags-of-words (SEBW) method will differentiate one from the other (1-b). However, the original bags-of-words method will regard them as the same (1-a).
(a) Representing images with bags-of-words model. The left and the right image are both partitioned into 5 parts, and each part is represented with a histogram of the words (bag-ofwords) with no spatial information of the parts taken into account. Under this kind of representation, the left and the right images are regarded as the same
(b) Representing images with spatially enhanced bags-of-words model. Besides the bags-of-words representation, the representation of the image also includes the geometric links of pairwise parts, recorded as weighted edges. Under this kind of representation, the left and the right images are different
Fig. 1. Comparing bags-of-words model and the spatially enhanced one
The organization of the paper is as follows. Several related works are summarized in Section 2. The bag-of-words model and the spatially augmented bags-of-words model are presented and defined in Section 3. Then, the procedures of feature extraction and similarity computation are described in Section 4 and 5 respectively. In Section 6, we provide the 3D shape retrieval results on Princeton Shape Benchmark
Spatially Enhanced Bags of Words for 3D Shape Retrieval
351
(PSB) [Shilane04] and analyze the influence of the parameters. Finally, we conclude this paper in Section 7.
2 Previous Work Designing discriminating 3D shape signatures is an active research area [Iyer05]. Among them, statistics based methods hold a very important position, which can be adopted as part of the machine learning framework. Statistical properties, frequently represented as histograms [Osada02], have been used to describe the appearance of object/scene for a long time. In [Osada02], several appearance properties are recorded with histograms, including the distances between two randomly selected surface points, areas of the triangles composed of three randomly selected surface points and so on. Instead of calculating the Euclidean distance between every two surface points, Ruggeri [Ruggeri08] compute the histograms based on the geodesic distances, which are effective to retrieve articulated models. Except the histogram, probability density functions [Akgul07] are also used to express the 3D model, which is more robust to the asymmetrical distribution of the triangles. Because of its simplicity and generality, the bag-of-words method, which is insensitive to deformation, articulation and partially missing data, has attracted lots of interest in 2D [Li05] and 3D [Shan06] [Liu06] [Ohbuchi08] fields. In [Li05], the method is applied to images by using a visual analogue of a word, formed by vector quantizing two regional descriptors: normalized 11*11 pixel gray values and SIFT descriptors. In [Shan06] and [Liu06], a visual feature dictionary is constituted by clustering spin images in small regions. In order to procure partial-to-whole retrieval, Kullback-Leibler divergence is proposed as a similarity measurement in [Liu06], while a probabilistic framework is introduced in [Shan06]. For the sake of collecting visual words, Ohbuchi et al. [Ohbuchi08] apply the SIFT algorithm to depth buffer images of the model captured from uniformly sampled locations on a view sphere. After vector quantization, Kullbak-Leibler divergence measures the similarities of the models. Besides the many advantages of these methods, they suffer from their extreme simplicity. Because all of the spatial layout information of the features is discarded, their descriptive capability is severely constrained. Lazebnik et al. [Lazebnik06] propose a spatially enriched bags-of-words approach. It works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. Implicitly geometric correspondences of the sub-regions are built in the pyramid matching scheme [Grauman05]. In [Savarese07], the object is an ensemble of canonical parts linked together by an explicit homographic relationship. Through an optimization procedure, the model, corresponding to the lowest residual error, gives the class label to the query object along with the localization and pose estimation. Inspired by the work described by [Lazebnik06], [Savarese07], and [Shan06], we propose a Spatially Enhanced Bags-of-Words (SEBW) approach. We will elaborate it in the following sections.
352
X. Li, A. Godil, and A. Wagan
3 Spatially Enhanced Bag of Words Model We first describe the original formulation of bag-of-words [Li05], and then introduce our approach to create a Spatially Enhanced Bags-of-Words (SEBW) 3D model representation. 3.1 Bag of Words Model Let us use the image categorization as an example to give an explanation of the bagof-words model. Denote N be the total number of labels (“visual words”) in the learned visual dictionary. The image can be represented as a vector with length N, in which the elements count the occurrences of the corresponding label. The procedure can be completed in three steps. 1. 2.
Local feature detectors are applied to the images to acquire low level features. Visual words, denoted as the discrete set {V1, V2, …, VN}, are formed by clustering the features into N clusters, so that each local feature is assigned to a discrete label.
Fig. 2. Spatially Enhanced Bags-of-Words (SEBW) method for 3D shape retrieval (best viewed with color)
Spatially Enhanced Bags of Words for 3D Shape Retrieval
3.
353
The image contents are summarized with a global histogram (“bag-of-words”), denoted as a vector fv=( x1, x2, …, xN ), by counting the occurrences of each visual word.
3.2 Spatially Enhanced Bags of Words Representation Rather than using only a global histogram, this paper advocates using more than one histogram along with related spatial information to reveal the 3D shape in more detail. A schematic description of the approach is given in Figure 2. Specifically, after extracting low level features with spin images at oriented basis points, a visual dictionary is formed by clustering them in feature space. Then each 3D shape is partitioned into a predetermined number of regions by clustering the oriented basis points in 3D geometry space. A matrix is used to record the spatial relationship of the regions, while each region is represented with a histogram of visual words. Therefore, the 3D shape is recorded with the Spatially Enhanced Bags-of-Words (SEBW) model.
4 Shape Representation This section elaborates on the ideas introduced in the previous section. Block 1, 2, 3 and 4 are discussed in this section and block 5 is covered in the next section. 4.1 Low Level Feature Extraction As shown in Figure 3, the Spin image, which is invariant to the rotation and translation transform, characterizes the local appearance properties around its basis point p within the support range r. It is a two-dimensional histogram accumulating the number of points located at the coordinate (α, β), where α and β are the lengths of the two orthogonal edges of the triangle formed by the oriented basis point p, whose orientation is defined by the normal n, and support point q. We choose it as the low level feature descriptor in this paper.
Fig. 3. Extracting low level features with spin images
Because the 3D meshes may have large and small triangles, instead of calculating spin images based on the mesh vertices [Jonson99], a two-pass sampling procedure is performed here. Using a Monte-Carlo strategy [Osada02], for each 3D mesh, Nb oriented basis points p with normal n and Ns support points q are sampled uniformly on
354
X. Li, A. Godil, and A. Wagan
the surface in two passes respectively, where Nb=500, Ns=50000 [Liu06]. Other parameters of the spin image are defined as: 1) r =0.4R, where R is the radius of the mesh. 2) the width and height of spin images, set as w=h=16. A large number of spin images are collected from the 3D shape database. Each mesh is represented with Nb spin images. 4.2 Visual Words Dictionary Construction With Nb*Nm spin images, where Nb is as defined previously and Nm is the number of 3D meshes we used for building the visual words dictionary, the k-means algorithm is applied to agglomerate N clusters. This is similar to the procedure we described in section 3.1. Therefore, each spin image is assigned the index of its nearest cluster. Actually, other clustering algorithms [Moosmann08] can be adopted to do the work. Further research needs to be done to analyze the effects of the various clustering methods. 4.3 Spatial Object Segmentation Although many sophisticated segmentation approaches [Podolak06] [Berretti06] can be exploited to do the work, for the simplicity, the k-means algorithm is performed here to segment the 3D meshes. Evaluating the effects of using different segmentation schemes in our 3D retrieval framework will be a subject for future research. After step 1, the shape of the 3D mesh is sketched by a set of spin images located at the positions of basis points in 3D geometry space. For each 3D mesh, the spatial object segmentation is achieved by clustering the basis points into M clusters based on their spatial positions, where M is a predetermined number. 4.4 Spatially Enhanced Object Representation Once the object has been partitioned into M components, the associated spin images of the shape are also split into M groups. Referred to the visual words dictionary, whose vocabulary size is N, each spin image corresponds to a visual word. Therefore, each component is represented by a vector fv=( x1, x2, …, xN ) counting visual word frequencies. A M*N Feature Matrix FM depicts the appearance of the object, where FM = [ fv1, fv2, …, fvM ]T,
(1)
We outline the geometric properties of the model with a matrix GM: ⎡ g11 ⎢g GM = ⎢ 21 ⎢L ⎢ ⎣g M1
g12 g 22 L gM 2
g 1M ⎤ L g 2 M ⎥⎥ L L ⎥ ⎥ L g MM ⎦ M *M L
(2)
where gij is the Euclidean distance between pair of the centers of the components, i, j = 1, 2, …, M. Therefore, each object is recorded with M visual words histograms associated with one Geometry Matrix, as shown in Figure 2.
Spatially Enhanced Bags of Words for 3D Shape Retrieval
355
5 Dissimilarity Measure When performing 3D shape retrieval, the spatially enhanced representation of the query shape is constructed on line, and compared with those in the database. A retrieval list is ordered according to the dissimilarity metric. In our paper, it is made up of two parts, which is formulated as: Dis (O A , O B ) = α ⋅ Disa ( FM A , FM B ) + (1 − α ) ⋅ Disg (GM A , GM B ) , M
Dis a ( FM A , FM B ) = ∑ min (dist ( fviA , fvπB(i ) )) , i =1
(3) (4)
π (i ) M
M
Dis g (GM A , GM B ) = ∑∑ g iA, j − g πB(i ),π ( j ) ,
(5)
i =1 j =1
where OA, OB are two objects A and B respectively; α is the weight to balance the effects of appearance features and geometry features, satisfying 0≤ α ≤1; dist ( fviA , fvπB(i ) ) is the distance between two feature vectors. The distance can be measured with Kullback-Leibler divergence [Liu06], cosine distance, L1 and L2 distance, etc. The pseudo-code for measuring the dissimilarity between a 3D query object A and another 3D shape B from the database is listed as follows. for each component i of A get corresponding component’s index (i) for obi ject B resulting mini feature distance dist f; end i summarize dist f of all the components i(eq. (4)); calculate the geometry distance (eq. (5)); measure the dissimilarity with eq. (3); Then corresponding to the query object, every object in the database is assigned a metric value. Accordingly, a retrieval rank list is obtained based on it.
6 Experiments Princeton Shape Benchmark (PSB) [Shilane04] is chosen as the 3D shape database. It is divided into two sets: training set and testing set. We conduct the experiments on the test set, which contains 907 3D models belonging to 92 classes at the finest classification granularity. We compare our approach (SEBW) with the bag-of-words method (BW). For SEBW, the number of components is set to be 10, and dissimilarity weight α = 0.7. Figure 4 shows the retrieval lists using m92 in the PSB. Comparing a with b, it is obvious that with our SEBW method, the wrong shape, helicopter, is eliminated from the retrieval list. Table 1 lists the five statistics: Nearest Neighbor (NN), First Tier (FT), Second Tier (ST), E-Measure (E-M), and Discounted Cumulative Gain (DCG) as described in
356
X. Li, A. Godil, and A. Wagan
(a) Retrieval results using BW method
(b) Retrieval results using SEBW method Fig. 4. Compare BW and SEBW with the first 9 retrieval results using the same query shape placed at the top left position
[Shilane04]. It shows that the Nearest Neighbor improved with the spatial information. The Precision-Recall curves in Figure 5 also affirm that. However the improvement is not large. The reason behind could be: 1) The number of the basis points is not sufficient to adequately cover the shape; 2) The segmentation method, k-means, is sensitive to initialization, and not particularly robust across variations in the shape. A more sophisticated segmentation method might improve the result. Table 1. The retrieval statistics about two retrieval methods
NN
FT
ST
E-M
DCG
BW
0.335
0.173
0.247
0.155
0.446
SEBW
0.338
0.160
0.226
0.141
0.433
Fig. 5. The precision-recall curve of two methods: Bag-of-Words (BW), Spatially Enhanced Bags-of-Words (SEBW)
Spatially Enhanced Bags of Words for 3D Shape Retrieval
(a) Effects of M
357
(b) Effects of α
Fig. 6. How does the parameter affect the retrieval precision
Two important parameters, which will affect the retrieval precision, are investigated. The first parameter is the number of components M. To balance between efficiency and effectiveness, M is limited to less than 11. Actually, beyond a certain threshold, even if the number of the components is increased, the precision is relatively stable. As shown in Figure 6-a, the threshold is 4. Note when M=1, SEBW is the same as BW. Therefore, BW can be regarded as a special case of SEBW. Fixing the number of regions (M) as 10, the second parameter is discussed. It is the dissimilarity weight α defined in eq. (3). From the equation, we see that when α is too large, the appearance features dominate the dissimilarity measure; while when α is too small, the geometry features dominate. Both will decrease the retrieval precision. The experiment results verify the conclusion. As shown in Figure 6-b, when α is set to be larger than 0.9 or less than 0.2, the retrieval precision decreases rapidly. The highest retrieval precision is obtained with α set at 0.7.
7 Discussion In this paper, we propose a means by which to incorporate spatial information into the bag-of-words model for 3D shape retrieval. The enhanced method is compared with the original bag-of-words method. Two key parameters for the approach are discussed in detail. However, this is only the initial investigation into combining the spatial information with BW method. Besides using the simple distance matrix to infer the geometric properties, other sophisticated geometric properties, even the topologic structure can be explored in the future.
References [Akgul07] Akgul, C.B., Sankur, B., Yemez, Y., Schmitt, F.: Density-Based 3D shape descriptors. EURASIP Journal on Advances in Signal Processing ID 32503, 16 pages (2007) [Berretti06] Berretti, S., Bimbo, A.D., Pala, P.: Partitioning of 3D meshes using reeb graphs. In: ICPR 2006 (2006) [Chen03] Chen, D.Y., Ouhyoung, M., Tian, X.P., Shen, Y.T.: On visual similarity based 3D model retrieval. In: Computer Graphics Forum (Eurographics 2003), vol. 03, pp. 223–232 (2003)
358
X. Li, A. Godil, and A. Wagan
[Grauman05] Grauman, K., Darrell, T.: Pyramid match kernels: Discriminative classification with sets of image features. In: ICCV 2005 (2005) [Iyer05] Iyer, N., Jayanti, S., Lou, K., Kalyanaraman, Y., Ramani, K.: Three-dimensional shape searching: state-of-the-art review and future trends. Computer-Aided Design 37(2005), 509– 530 (2005) [Johnson99] Johnson, A.E., Hebert, M.: Using Spin Image for Efficient Object Recognition in Cluttered 3D Scenes. PAMI 21(5), 433–449 (1999) [Lazebnik06] Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: CVPR 2006, pp. 2169–2178 (2006) [Li05] Fei-Fei, L., Perona, P.: A Bayesian Hierarchical model for learning natural scene categories. In: CVPR 2005, pp. 524–531 (2005) [Liu06] Liu, Y., Zha, H., Qin, H.: Shape Topics: A Compact Representation and New Algorithms for 3D Partial Shape Retrieval. In: CVPR 2006, pp. 2025–2032 (2006) [Moosmann08] Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image classification. PAMI 30(9), 1632–1646 (2008) [Ohbuchi08] Ohbuchi, R., Osada, K., Furuya, T., Banno, T.: Salient local visual features for shape-based 3D model retrieval. In: Proc. IEEE Shape Modeling International, pp. 93–102 (2008) [Osada02] Osada, R., Funkhouser, T., Chazelle, B., Dobkin, D.: Shape distributions. ACM Transaction on Graphics 21(4), 807–832 (2002) [Papadakis07] Papadakis, P., Pratikakis, I., Perantonis, S., Theoharis, T.: Efficient 3D shape matching and retrieval using a concrete radicalized spherical projection representation. Pattern Recognition 40(2007), 2437–2452 (2007) [Ruggeri08] Ruggeri, M.R., Saupe, D.: Isometry-invariant matching of point set surfaces. In: Eurographics workshop on 3D object retrieval, 8 pages (2008) [Savarese07] Savarese, S., Fei-Fei, L.: 3D Generic Object Categorization, Localization and Pose Estimation. In: ICCV 2007, pp. 1–8 (2007) [Shan06] Shan, Y., Sawhney, H.S., Matei, B., Kumar, R.: Shapeme Histrogram Projection and Matching for Partial Object Recognition. PAMI 28(4), 568–577 (2006) [Shilane04] Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The Princeton shape benchmark. In: Proc. of the Shape Modeling International 2004 (SMI 2004), vol. 04(00), pp. 167– 178 (2004) [Siddiqi08] Siddiqi, K., Zhang, J., Macrini, D., Shokoufandeh, A., Bioux, S., Dickinson, S.: Retrieving articulated 3D models using medial surfaces. In: Machine Vision and Applications (MVA), vol. 19(4), pp. 261–275 (July 2008) [Sivic05] Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering object categories in image collections. In: ICCV 2005, pp. 370–377 (2005) [Vranic03] Vranic, D.V.: 3D model retrieval. Ph. D. Dissertation (2003)
Image Matching Using High Dynamic Range Images and Radial Feature Descriptors Krishnaprasad Jagadish and Eric Sinzinger Department of Computer Science, Texas Tech University Box 43104 Lubbock, TX 79409-3104 krishna j
[email protected],
[email protected]
Abstract. Obtaining a top match for a given query image from a set of images forms an important part of the scene identification process. The query image typically is not identical to the images in the data set, with possible variations of changes in scale, viewing angle and lighting conditions. Therefore, features which are used to describe each image should be invariant to these changes. Standard image capturing devices lose much of the color and lighting information due to encoding during image capture. This paper uses high dynamic range images to utilize all the details obtained at the time of capture for image matching. Once the high dynamic range images are obtained through the fusion of low dynamic range images, feature detection is performed on the query images as well as on the images in the database. A junction detector algorithm is used for detecting the features in the image. The features are described using the wedge descriptor which is modified to adapt to high dynamic range images. Once the features are described, a voting algorithm is used to identify a set of top matches for the query image.
1 Introduction Image matching has been an active area of research in computer vision for several years. One important subdomain of image matching is scene identification, the selection of images of the same general scene but from different viewpoints. Two of the major difficulties in scene identification is handling changes in color and in view. Much work has been performed on affine covariant representation for handling large changes in view, but there is less research on handling large changes in color representation. With the standard camera images, the color range is limited, so using color for accurate matching will not yield ideal results. The highlights or the shadows in the scene will mask some of the objects in the scene which might form the invariant part of the scene. Also, standard images suffer from the fact that they need to be shot only when the lighting conditions are perfect. All these factors make image matching less accurate. The images obtained from standard cameras are called low dynamic range (LDR) images since a range of illumination is lost during encoding. Using LDR images for matching decreases the accuracy. High dynamic range (HDR) images constructed from LDR images can be used to improve the accuracy of image matching. When the LDR images are fused together a higher dynamic range of illumination is obtained. Using the HDR image for image matching has a number of advantages. First, G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 359–369, 2008. c Springer-Verlag Berlin Heidelberg 2008
360
K. Jagadish and E. Sinzinger
Fig. 1. The first two images show LDR images which are not capable of showing both the indoor and outdoor objects. The third image shows a HDR capable of displaying both indoor and outer information.
since there is more information available in the form of luminance and chrominance, the image matching algorithm can be modified to use this information to enhance the accuracy. Second, since the high dynamic range image captures all the contrast in the scene, the chances of missing objects in the scene due to the presence of a shadow or a highlight is minimized. HDR images provide more information than a LDR image and hence increases image matching accuracy. Figure 1 shows two LDR images which cannot represent the scene while the HDR image shown in the third image shows all the objects in the scene. In order to improve image matching, both HDR and junction detection are applied in this work to construct a robust descriptor that encompasses both color and shape information. All LDRs are transformed into HDR representations. From the HDR representations, junctions are detected. The junctions are divided into wedges. Descriptors of both color from the HDR representation, and shape from locally oriented image gradients are entered into a description database. Query images search for similar representations in the database and use a voting scheme to rank the potential matches based upon the numbers of matched descriptors.
2 Related Work 2.1 High Dynamic Range Imaging Standard cameras encode the dynamic range of illumination during image encoding to only 256 values, whereas the actual dynamic range of illumination is typically around 100,000. The LDR images captured by standard cameras lose a significant amount of color information. HDR images can be used to overcome the loss of dynamic range of illumination by fusing together a set of LDR images taken with varying exposure parameters. In our work, the LDR images are fused together using the Debevec and Malik [1] technique. The main idea of their technique is to use the correctly exposed pixel from one of the LDR images to form the HDR image. Though we are able to get all the information in the scene by using HDR images, we cannot display them on the current display devices such as CRT or the LCD monitors. These display devices are made to display only a limited range of values and thus a HDR image must be mapped to a displayable form without losing too much information. This is called the tone mapping problem. There exists several operators that perform tone mapping on the HDR images. Most of the tone mapping operators are modeled
Image Matching Using High Dynamic Range Images and Radial Feature Descriptors
361
based on the human visual system. They can be divided into four groups namely local operator, global operator, frequency domain operator and gradient domain operator [2]. While local operators like [3] map the pixels based on their individual values, the global operators like [4] uses a global curve to tone map the HDR image. Frequency domain operators like [5] split the image into components and compress the high frequency components. Gradient domain operators like [6] scale the image according to their gradients at different scales. Tone mapped HDR images are displayable on current display devices and they provide much larger information than the single LDR image. The gradient domain operator [6] has been used in this work. This operator scales the HDR image by applying a compressive function to the gradient field. This technique initially computes the gradient of an image, manipulates these gradients and then integrates them by solving a Poisson equation. The gradients are manipulated at different scales and a scale factor for each pixel at each scale is given by the following equation φs (x, y) =
α ∇Gs (x, y)
∇Gs (x, y) α
β
where α and β are user defined parameters which give control over the amount of details visible in the scene to the amount of compression that will be applied. ∇Gs (x, y) is the gradient of the image at the scale s. 2.2 Local Feature Detectors For matching images with large changes, the detected features should be invariant under rotation, illumination and scale changes. There are detectors that are affine under some of the above mentioned conditions but fail for others [7,8,9]. Schmid et al. [10] compare 5 local feature detectors. One of the detectors used in the evaluation, Horaud et al. [11], extracts line segments from the image contours and groups the segments to find intersections which are claimed as interest points. An alternative approach is to use a signal-based detector such as those by Harris [9], Forrstner [12] and Heitger [13]. Harris corner detectors estimate the difference in the image sample patches. They take the eigen values of the gradient matrix to determine if there is a large difference in the patch being sampled. If there exists two large deviations in the eigen values, a corner is identified. Schmid et al. [10] state that a good interest point detector satisfies two conditions, namely repeatability and information content. By repeatability, the technique should return the same interest points when applied to the same scene but with different scales, illuminations and/or viewing angles. Repeatability is defined as the percentage of the total observed points in both images. Through information content, the measure of the distinctiveness of the local grey level pattern in the image is taken into account. The detectors are evaluated for the change in the viewing angle, scale change and illumination variation. The work found the Harris corner detector to be the most invariant to affine changes among the evaluated detectors.
362
K. Jagadish and E. Sinzinger
2.3 Local Feature Descriptors The current benchmarks for feature detection and description is the Scale Invariant Feature Transform (SIFT) [14]. SIFT detects the invariant points and describes them with features that are invariant to rotation, illumination and scale changes. These invariances make SIFT an extremely efficient descriptor for image matching. It is a four stage algorithm wherein in the first step, the key points are found by taking the difference of Gaussian of the image samples at varying scales. The second step is minimizing the number of key points found by eliminating the unstable ones. In the third step invariance to rotation is obtained by assigning each key point an orientation based on the local image gradient. The fourth step is describing the key points using a histogram. Once all the key points are described one can match the features of one image with the features of others to get a top score based on the Euclidean distances. Mikolajczyk and Schmid [15] evaluated 10 different descriptors. A SIFT descriptor is a 3D histogram of gradient location and orientation where location is quantized into a 4 × 4 location grid and the gradient angle is quantized into eight orientations resulting in a descriptor of dimension 128. The Gradient Location Orientation Histogram(GLOH) [15] is described as an extension to SIFT and is said to increase the robustness and distinctiveness of descriptors. A 272 bin histogram is obtained by quantizing the gradient angles into 16 bins. The size of the descriptor is reduced by using PCA. Another approach, Shape Context, is similar to SIFT but is based more on edges which are extracted using the Canny edge detector [16]. The descriptor is a 36 dimensional 3D histogram of the edge point locations and orientations. PCA SIFT descriptor [17] is a vector of image gradients in the x and y directions. The gradient regions are sampled at 39 × 39 locations and hence the vector is of dimension 3042 which is reduced to 36 with PCA. Schmid et al [15] tested the descriptors for viewpoint changes of approximately 50 degrees. They also tested with the scale changes and also with image rotation between 30 and 45 degrees. They concluded that GLOH performed better than SIFT in most of the cases with shape context descriptor coming third. Different descriptors exist with varying techniques and efficiency [18,19,20,21]. CSIFT [22] is an extension of SIFT but performs the matching on color images as well.
3 High Dynamic Range Representation This work performed image matching using HDR images with features determined by a junction detector [23] and descriptors built over the individual junction regions, or wedges. The HDR images are color corrected using [24] to increase the accuracy of matching. 3.1 HDR Construction An HDR image is constructed from the LDR images using the Debevec and Malik method [1]. The construction is a two step process, with the first step determining the nonlinear camera response curve, and the second step to remap the multiple samples of
Image Matching Using High Dynamic Range Images and Radial Feature Descriptors
363
the same scene to a corrected exposure. The determining of the camera response curve can be done by the solving the objective function for N values of Ei and 256 values of g, N
P
O = ∑ ∑ w(Zi j )[g(Zi j ) − ln(Ei ) − ln(Δt j )]2 + λ i=1 j=1
Zmax −1
∑
[w(z)g(z)]2
(1)
z=Zmin +1
where P is the number of photographs, Ei gives the actual scene radiance, w(·) is a weighting function, g(·) gives the response curve of the camera, and Zi j are the individual pixels. In order to obtain the HDR image, at least 3 LDR images of the same scene with varying exposures are necessary. Since the images are merged to form a single image, all of the images need to be aligned perfectly with no movement in the scene. In order to render the camera perfectly still when the photographs were taken camera control software was used to automate the image acquisition of different exposure times. In this work, 5 LDR images were acquired to construct each HDR. The exposure times used 1 1 1 1 for constructing the HDR images were 4000 , 1000 , 350 , 40 , and 2 seconds. 3.2 HDR Tone Mapping The dynamic range of illumination in the real world is very high. There will be shadows as well as highlights in a single scene. The range will be greater when the scene contains objects illuminated by the sun as well as objects that are illuminated indoors through artificial lighting. Using the HDR capture techniques can enable the acquisition of the entire dynamic range of the scene, but display devices are typically constrained to only 8 bits per color channel per pixel. There is thus a discrepancy with the wide range of illumination that can be captured and a small range of illumination that can be displayed. To downsample the HDR image into an appropriate LDR representation for display is called the HDR tone mapping problem. This work uses Fattal’s gradient domain tone mapping operator [6] to compress the dynamic range. The main reason for selecting Fattal’s tone mapping operator is the importance in image matching of both preserving the details as well as enhancing the color in the scene. The α and β values were set at 0.1 and 0.786 respectively. These values were selected based upon trial and error. Figure 2 shows the LDR images with
(a)
(b)
Fig. 2. (a) A sample set of LDR images, and (b) the final HDR image
364
K. Jagadish and E. Sinzinger
Fig. 3. Sample HDR images in the database
Fig. 4. The query image, a sample database image, and the color corrected database image
the varying exposures and the tone mapped image using the Fattal gradient domain compression [6]. This process was used for building the database. Different buildings on a university campus were shot at different times of day and all the images were tone mapped. Some of the tone mapped HDR images are shown in Figure 3. 3.3 HDR Color Correction Once the HDR images were obtained and tone mapped, we performed color correction on the images based on the query image color characteristics. All the images would have the same characteristics as the query image and this helped in increasing the accuracy of the wedge descriptor. The color correction was based on the technique given in [24]. The query image and the images in the database are first converted to l αβ color space. The l axis is considered to represent an achromatic channel while the α and β are considered to represent the yellow-blue and the red-green opponent channels. Color characteristics can be transferred from one image onto an other when some distributions in the data points are moved from the source to the target image. According to Reinhard et al [24] it is sufficient to move the standard deviation and the mean along each axis. This is done by the following equations as given in [24]. The mean of the target image is added to the data points to get the new image. Figure 4 shows the images before and after color transfer.
4 Feature Representation 4.1 Junction Detector After the images are color corrected the next step is to perform the invariant key point selection. A junction detector by Sinzinger [23] has been used to identify key image features. The junction detector essentially detects the T, X, Y, and L junctions in an image and fits a model to the image region. A three phase algorithm is used to identify
Image Matching Using High Dynamic Range Images and Radial Feature Descriptors
365
Fig. 5. Same junction detected irrespective of the viewing angle change
Fig. 6. Invariant features detected in the image and being segmented into wedges
the actual junction points. In the first phase, the strongest edges around each point are selected. The energy minimization algorithm gives the optimal subset of these edges in the second phase. In the third phase the strongest junctions in each neighborhood are identified. The advantages of using a junction detector over other key point detectors is that the area around the junctions gives more information than the arbitrary corners. Junctions define the shape of the buildings in a more precise manner than the corners. Junction Detection has high repeatability when it comes to architectural buildings. It can be clearly seen from the Figure 5. The image shows the same junction being detected even when there is a change in the viewing angle. The junction detector splits the region around it into wedges formed by radial edges. As can be seen in the Figure 6 the algorithm has detected the invariant junctions and has segmented the area around the point into wedges. 4.2 Wedge Descriptor The junction detector algorithm identifies key points and segments the local neighborhood into wedges, the information inside which is used for describing the wedges. A histogram based approach is used to describe the local region. Each wedge is given a wedge descriptor, a 46 dimensional feature vector. The first 32 bins represent the shape information while 14 bins represent the color and brightness information. The shape information within the wedge is given by polar image gradients [25]. Polar image gradients calculate the change in the intensity gradients along the radial and orientational directions. The polar image gradients provides rotational invariance for the representation. The hue for the entire wedge is calculated according to the IHLS [26]. The hue is an angular value and it is weighted by the saturation as defined in [27]. The reason for
366
K. Jagadish and E. Sinzinger Table 1. Division of the hue histogram
Color Angular Value red 3300 to 300 yellow 300 to 900 green 900 to 1500 cyan 1500 to 2100 blue 2100 to 2700 magenta 2700 to 3600
Fig. 7. The wedge descriptor for the wedge representing the sky. Notice the strong value at position 37, indicating the strength of the blue hue in the wedge.
saturation weighting is that hue values can become unstable at low saturation. The hue is also weighted with the average of the polar image gradient of that wedge and hence it also includes shape information. Since the color information will be the same when the same feature is detected and this information does not change with the change in the viewing angle, the color information forms an important part of the description. This work divides the hue information into 6 bins according to the angular values given in Table 1. The luminance histogram is an 8 bin representation of the luminance values in the scene. It is also weighted with the saturation and also with the averages of the polar image gradients gradients of that wedge. The luminance values are divided into 8 bins by dividing by the difference of maximum and then minimum luminance values in the scene. The wedge descriptor thus has 32 shape bins, 6 hue bins and 8 luminance bins combining to form a 46 dimensional feature vector. A junction on an image and the corresponding histogram is shown in the Figure 7.
5 Voting Method Given a description of the feature, the images are matched using a voting approach. Wedge descriptors are generated for all detected features in the query image. Then, each query wedge is compared to the wedge database. The different between the query wedge and a database wedge is computed as the Euclidean distance between the respective histograms. All matches that are under a given threshold contribute a single vote for the image linked to the database wedge. The top matches are then returned.
Image Matching Using High Dynamic Range Images and Radial Feature Descriptors
367
6 Results The data set for this work is composed of 63 HDR images taken at different times of the day. Many of the HDR images are of the same buildings, but under different lighting conditions or from different views. The camera was mounted on a tripod and operated remotely from a laptop to minimize vibration and noise and with a Nikon D80 DSLR camera. For each of the HDR images, five LDR images were taken. The LDR images were then fused to form the HDR images. The same procedure was used to generate a set of 18 HDR query images. The proposed method was compared to SIFT, a well known benchmark in feature matching. SIFT identifies the matches by finding the two nearest neighbors of each key point from the query image among those in the images in the data set, and it is matched only if the distance to the closest neighbor is less than 0.6 of that of the closest neighbor [14]. SIFT is provided with query HDR images to match with the images in the data set. The query images could match multiple images in the database, for a total of 40 possible matches. For each query image, the number of top correct matches was determined. The results are given in the Table 2. As can be seen in the table, there has been an increase in the number of matches by 15.0% over the benchmark SIFT approach. The increase can be attributed to (1) the effectiveness of the division into separate planar regions by the junction detector, and (2) the high definition of the color information by using HDR images. Table 2. Comparison of image matching between SIFT matches on HDR images and the proposed method
Matches Recall Rate
SIFT 25 62.5%
Proposed Method 31 77.5%
7 Conclusion and Future Work The proposed method increased the accuracy of image matching by incorporating high definition color from the HDR images into a wedge descriptor. Additionally, color correction eliminated the lighting differences in the images and contributed to the increased accuracy in the image matching. In addition, the junction detector performance was improved by using the HDR images, since the edges became better defined. The proposed scheme outperformed SIFT on a database of 63 HDR images by 15%. Though HDR imaging eliminated much of the lighting differences between different views of the same scene, there is a need to work on regions that are masked by shadows. The shadows can change throughout the day, but without detection of the shadows themselves it is difficult to suppress the effects of the shadows. Additionally, each HDR image took approximately 10 seconds to compute. Parallelizing the code could result in faster construction of the database as well as preprocessing of the query images.
368
K. Jagadish and E. Sinzinger
References 1. Debevec, P., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: SIGGRAPH 19997, pp. 369–378 (1997) 2. Reinhard, E., Ward, G., Pattaniak, S., Debevec, P.: High Dynamic Range Imaging. Morgan Kauffman Series, San Francisco (2006) 3. Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic tone reproduction for digital images. ACM Transactions on Graphics 21, 267–276 (2002) 4. Drago, F., Myszkowski, K., Annen, T., Chiba, N.: Adaptive logarithmic mapping for displaying high contrast scenes. EUROGRAPHICS 22, 419–426 (2003) 5. Durand, F., Dorsey, J.: Fast bilateral filtering for the display of high-dynamic-range images. ACM Transactions on Graphics 21, 256–257 (2002) 6. Fattal, R., Lischinski, D., Werman, M.: Gradient domain high dynamic range compression. ACM Transactions on Graphics 21, 249–256 (2002) 7. Smith, S.M., Brady, J.M.: SUSAN: A new approach to low level image processing. International Journal of Computer Vision 23, 45–78 (1997) 8. Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 790–799 (1995) 9. Harris, C., Stephens, M.: A combined corner and edge detector. In: Alvey Vision Conference, pp. 147–151 (1988) 10. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. International Journal of Computer Vision 37, 151–172 (2000) 11. Horaud, R., Veillon, F., Skordas, T.: Finding geometric and relational structures in an image. In: Faugeras, O. (ed.) ECCV 1990. LNCS, vol. 427, pp. 374–384. Springer, Heidelberg (1990) 12. Forstner, W.: A framework for low level feature extraction. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 383–394. Springer, Heidelberg (1994) 13. Heitger, F., Rosenthaler, L., Von Der Heydt, R., Peterhans, E., Kuebler, O.: Simulation of neural contour mechanisms: from simple to end-stopped cells. Vision Research 32, 963–981 (1992) 14. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 15. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1615–1630 (2005) 16. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 509–522 (2002) 17. Ke, Y., Sukthankar, R.: PCA-SIFT: A more distinctive representtion for local image descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 506–513 (2004) 18. Park, D., Jeon, Y., Won, C.: Efficient use of local edge histogram descriptor. In: ACM Workshops on Multimedia, pp. 51–54 (2000) 19. Gros, P.: Color illumination models for image matching and indexing. In: International Conference on Pattern Recognition, vol. 3, pp. 576–579 (2000) 20. Wong, K.M., Po, L.M., Cheung, K.W.: Dominant color structure descriptor for image retrieval. In: IEEE International Conference on Image Processing, vol. 6, pp. 365–368 (2007) 21. Setia, L., Teynor, A., Halawani, A., Burkhardt, H.: Image classification using clustercooccurrence matrices of local relational features. In: 8th ACM International Workshop on Multimedia Information Retrieval, pp. 173–182 (2006) 22. Abdel-Hakim, A., Farag, A.: CSIFT: A SIFT descriptor with color invariant characteristics. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1978–1983 (2006)
Image Matching Using High Dynamic Range Images and Radial Feature Descriptors
369
23. Sinzinger, E.: A model-based approach to junction detection using radial energy. Pattern Recognition 41, 494–505 (2008) 24. Reinhard, E., Ashikhmin, M., Gooch, B., Shirley, P.: Color transfer between images. IEEE Computer Graphics and Applications 21, 34–41 (2001) 25. Worthy, L., Sinzinger, E.: Scene identification using invariant radial feature descriptors. In: Workshop on Image Analysis and Multimedia Interactive Services, pp. 39–42 (2007) 26. Mitsunaga, T., Nayar, S.: Radiometric self calibration. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1374–1381 (1999) 27. Hanbury, A.: Circular statistics applied to color images. In: Computer Vision Winter Workshop, pp. 124–131 (2003)
Random Subwindows for Robust Peak Recognition in Intracranial Pressure Signals Fabien Scalzo, Peng Xu, Marvin Bergsneider, and Xiao Hu Division of Neurosurgery, Geffen School of Medicine University of California, Los Angeles
[email protected]
Abstract. Following recent studies, the automatic analysis of intracranial pressure pulses (ICP) seems to be a promising tool for forecasting critical intracranial and cerebrovascular pathophysiological variations during the management of many neurological disorders. MOCAIP algorithm has recently been developed to automatically extract ICP morphological features. The algorithm is able to enhance the quality of ICP signals, to segment ICP pulses, and to recognize the three peaks occurring in a ICP pulse. This paper extends MOCAIP by introducing a generic framework to perform robust peak recognition. The method is local in the sense that it exploits subwindows that are randomly extracted from ICP pulses. The recognition process combines recently developed machine learning algorithms. The experimental evaluations are performed on a database built from several hundreds of hours of ICP recordings. They indicate that the proposed extension increases the robustness of the peak recognition.
1
Introduction
The management of many neurological disorders such as traumatic brain injuries relies on the continuous measurement of intracranial pressure (ICP). Following recent studies, the dynamic of the ICP signal seems to be related to intracranial compartment compliance and vascular physiology. Recent works (reviewed in [1]) have shown that variations of the ICP signal are linked to the development of intracranial hypertension and cerebral vasospasm, acute changes in the cerebral blood carbon dioxide (CO2) levels, and changes in the craniospinal compliance. Therefore, the automatic extraction of ICP morphological properties constitutes an essential step towards a better monitoring, understanding and forecasting of intracranial and cerebrovascular pathophysiological changes. The reliable processing of continuous ICP signals is a very challenging problem, yet beyond most of state-of-the-art ICP analysis methods [2,3]. MOCAIP algorithm [1] (Morphological Clustering and Analysis of ICP Pulse) has recently been developed for this purpose. The algorithm is capable of extracting morphological properties (i.e. features) of ICP pulses in real time. The algorithm relies on the fact that the ICP waveform is typically triphasic (i.e. three peaks in each ICP pulse). Its core is made of several modules that are able to enhance G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 370–380, 2008. c Springer-Verlag Berlin Heidelberg 2008
Random Subwindows for Robust Peak Recognition
371
ICP signal quality, to detect legitimate ICP pulses, and to recognize the three sub-components in each ICP pulse. During the peak recognition, MOCAIP relies on a Gaussian model to represent the prior knowledge about the position of each peak in the pulse. The assignation is chosen such that it maximizes the probability of observing the peaks given the prior distributions. This can be problematic because the position of the peaks across the pulses presents a large variation that is typically translated into large variance priors. This weakens the effectiveness of the peak recognition. Moreover the ICP pulse itself, which might contain useful values, is not exploited directly. MOCAIP was recently extended [4] using regression models instead of Gaussian priors during the peak recognition process. Those extensions demonstrated significant improvements in accuracy but require a large number of training samples. Moreover, since the entire pulse is used as input (referred to as a holistic, or global approach in pattern recognition), those regression models are sensitive to translations of the pulse, and are affected by local signal perturbations that typically occur in clinical data. The current work addresses these problems by introducing a novel extension of the MOCAIP algorithm. The key idea of the proposed framework is to exploit local subwindows extracted at random locations in ICP signals. This paper first presents, in Section 2, an overview of MOCAIP as well as its constituent parts. Then Section 3 describes the peak recognition model that is learned over a set of annotated training examples. Section 4 exposes the three different techniques that are used to classify ICP pulse subwindows. Finally, the effectiveness of the proposed extension is evaluated in Section 5 on a set of real clinical data.
2
MOCAIP Algorithm
This section provides an overview on MOCAIP [1], a recently developed algorithm to process continuous ICP signal and to identify the three morphological properties (i.e. peaks) that typically occur. Peak recognition is achieved through three major tasks that are summarized below and described in the next paragraphs. The first task consists of robustly segmenting the continuous ICP signal into a sequence of legitimate individual ICP pulses. This is done using a pulse detection algorithm, a clustering algorithm and a filtering process that identifies legitimate pulses. The second task is to detect all the candidate peaks in each ICP pulse. Finally, the third task relates to the recognition of the three peaks among the detected candidates. Pulse Detection. This step aims at segmenting the continuous ICP signal into a sequence of individual ICP pulses. To this end, MOCAIP combines a pulse extraction technique [5] with the ECG QRS detection [6] that first finds each ECG beat. ICP pulse detection achieves a 100% sensitivity and a 99% positive predictivity using lead II ECG sampled at 400 Hz [1]. Pulse Clustering. ICP recordings can be perturbed by several types of noise and artifacts. These includes high frequency noise that originated from measurement or amplifier devices, artifacts from patient coughing, and low quantization
372
F. Scalzo et al.
level. Instead of applying ICP morphology analysis on individual pulses separately, a representative cleaner pulse is extracted from a sequence of consecutive ICP pulses. A hierarchical clustering approach [7] is used to find the largest ICP pulse cluster in the sequence. Morphological properties (i.e. peaks) will be extracted from the dominant pulse cluster of each sequence. Legitimate Pulse Recognition. Pulse clustering provides a way to eliminate temporary perturbations occurring in ICP signals. However, dominant pulse clusters might not correspond to a legitimate pulse. This is the case when the entire ICP recording segment is noise. To identify legitimate ICP pulses automatically, MOCAIP performs a filtering that is based on two tests that both uses a second hierarchical clustering applied on the dominant pulses previously found. The first test exploits a reference library containing validated ICP pulses that have been manually extracted from multiple patients. A pulse is said to be legitimate if it belongs to a cluster whose average pulse correlates with any of the reference ICP pulse. The second test measures the coherence of a cluster using the average of the correlation coefficients between each member to the average pulse of the cluster. The dominant pulses of the cluster that fail both tests are considered as illegitimate and excluded from further analysis. Detection of Candidate Peaks. Once a legitimate ICP pulse has been extracted, MOCAIP detects a set of peak candidates that intuitively corresponds to curve inflections of the ICP signal. Each of them is susceptible to be one of the three peaks. The extraction of these candidates relies on the segmentation of the pulse into concave and convex regions. This is done using the second derivative of the pulse. Typically, a peak corresponds to the intersection between a convex and a concave region. The detection process produces a pool of N peak candidates {a1 , a2 , . . . , aN }. Peak Recognition. The last task of the MOCAIP algorithm is to identify the three ICP peaks (p1 , p2 , p3 ) from the set of candidate peaks. Given Pi (aj ), i = {1, 2, 3} to denote the probability density functions (PDF)1 of assigning aj to the i-th peak. The peak assignation amounts to searching for the maximum of the following scoring function J(i, j, k) = P1 (ai ) + P2 (aj ) + P3 (ak ). To insure the coherence of the assignment, it is subject to additional ordering constraints. In order to deal with missing peaks, an empty designation a0 is added to the pool of candidates. In addition, to avoid false designation, MOCAIP uses a threshold ρ such that Pi (ak ) = 0, i ∈ {1, 2, 3}, k ∈ {1, 2, . . . , N } if the probability of assigning ak to pi is less than ρ.
3
Random Subwindows for Robust Peak Recognition
Holistic regression approaches to peak recognition [4], as well as the original MOCAIP [1], have the main drawback to be sensitive to translations and local 1
Each PDF is a Gaussian distribution estimated from peak locations previously detected on a set of reference ICP pulses.
Random Subwindows for Robust Peak Recognition
373
perturbations of the pulse. Therefore, those strategies limit the ability of the framework to cope with the complexity of the data and may lead to incorrect predictions of the peaks. This section focuses on these problems and introduces a robust peak recognition framework that is intended to extend MOCAIP. In contrast with holistic methods [4], the proposed framework is local in the sense that it exploits subwindows (i.e. subvectors) extracted from the ICP pulses (Section 3.1). During training (Section 3.2), a large number of these subwindows are extracted at random locations and scales from the ICP pulses. A classification model is constructed from these subwindows after they have been labelled with the closest peak in their neighborhood if any, or with a special additional label. To predict the most likely position of the three peaks within a previously unseen pulse, the framework extracts random subwindows in the pulse, and their label is predicted from the classifier. Then the spatial distribution of each peak p1 , p2 and p3 is estimated independently using the positions of their respective subwindows (labelled with their number, as explained in Section 3.3). This is done using a Kernel Density Estimation (KDE). 3.1
Random Subwindows
The use of random subwindows to perform classification is similar in spirit to the works of Mar´ee et al [8] that have demonstrated excellent results on various image classification problems. This method extracts a large set of possibly overlapping subwindows of random length and at random positions from training ICP pulses. The size of each subwindow is randomly chosen between 50 and 150 ms. The position is then randomly chosen so that each subwindow is fully contained in the pulse. The same stochastic process is applied on the test ICP pulses. Each subwindow is resized to a fixed vector wi ∈ RSw of Sw real values using a bicubic interpolation (Sw = 50 in our experiments). 3.2
Learning
The learning phase amounts at constructing a classifier from labelled subwindows previously extracted from training ICP pulses. To this end, each subwindow wi ∈ W is assigned a label ci ∈ {1, 2, 3, 4} corresponding to either the closest peak (p1 , p2 or p3 ), or to a special label 4 indicating that the position μp of the closest peak is beyond a certain distance θ. This is expressed formally as argminp∈{1,2,3} |μi − μp | if (|μi − μp | ≤ θ) ∀wi ∈ W, ci = (1) 4 otherwise where μi and ci are the location and the label of the subwindow wi respectively, and μp = {μ1 , μ2 , μ3 } denotes the position of the three peaks within the pulse. From this training set of annotated subwindows {wi , ci }, any supervised machine learning algorithm can be used to learn a subwindow classification model. The classifier should then be able to predict the label cn of previously unseen subwindows wn .
374
F. Scalzo et al.
Fig. 1. Illustration of the peak recognition process. A large number of random subwindows are first extracted from the ICP pulse. Their label is predicted from the classifier. Then, for each of the three classes of peak, a spatial distribution is estimated using the position of the subwindows. The maximum response corresponds to the peak position.
In summary, the input of the classifier consists of NW = Np × Nr subwindows (where Np is the number of pulses, and Nr is the number of subwindows per pulse), each of which is described by SW real-valued input variables and a discrete output class ci ∈ {1, 2, 3, 4}. 3.3
Recognition
Peak recognition on previously unseen pulses can intuitively be thought of as a two-stage process where labels are first assigned to randomly extracted subwindows from the classification model. Then a machine learning technique is used to infer the spatial distribution of each peak p1 , p2 and p3 given the labels and the positions at which the subwindows were extracted. These two phases are detailed below and illustrated in Figure 1. During the first phase, subwindows are extracted randomly and resized following the procedure previously explained. The label ci ∈ {1, 2, 3, 4} of each subwindow wi ∈ RSw is predicted individually from the learned classification model. After this step, the pulse is summarized as a set Θ of labelled subwindows {wi , ci } together with the position μi ∈ R within the pulse at which they were extracted, such that Θ = {wi , ci , μi }i={1...NW } . The second phase aims at estimating a spatial distribution for each peak (p1 , p2 and p3 ) given the annotated data Θ. This is done using a Kernel Density Estimation (KDE) (detailed in the next paragraph) that fits a distribution to the observed data. The distribution for a given peak pk , k ∈ {1, 2, 3} is estimated by considering the subset of positions {μk }i=1...Nk of the subwindows labeled as
Random Subwindows for Robust Peak Recognition
375
ci = k. Finally, the position of each peak pk is set by evaluating the estimated distribution fˆ(x; μki=1...Nk ) for a set of points (a regular grid was generated to cover the pulse) and extracting the position that obtained the highest response. Kernel Density Estimation. An attractive method for modeling probability distributions from sample points is the kernel-based density estimation (KDE), or Parzen window density estimation [9]. In kernel density estimation, a kernel function K : Rd → R is used to smooth a set of observed samples into a continuous density estimate. For N samples μ1 . . . μN , the density estimate is written as: 1 x − μi fˆ(x; μi=1...n ) = K , x ∈ Rd nh i=1 h n
(2)
where h denotes the bandwidth (i.e. covariance), and controls the smoothness of the resulting density estimation. In this formulation, the bandwidth parameter h is fixed; so that it is held constant across x and the μi ’s. The most common kernel function is the isotropic Gaussian kernel, K(x; μi , h) = G(x; μi , h), where d is the dimensionality of the data, μi is the mean of the 2 −2/5 kernel, and h is set using the bandwidth the Rule of thumb [9]: h = 1.05 σ N 1 1 2 where σ = N (xi − μ) and μ = N xi .
4
Subwindow Classification
This section describes three different classifiers that are used in this framework to classify subwindows of ICP pulses. Besides a KD-Tree Nearest Neighbor classifier that was used as a baseline algorithm, Support Vector Machines (SVM) and recently developed Extremely Randomized Decision Trees are successively described. 4.1
Nearest Neighbor
A simple K-Nearest Neighbor (k-NN) classifier was implemented using K-dimensional trees (KD-trees) to perform the search more efficiently. The reason to use a Nearest Neighbor classifier is that it provides a very simple and fast way to obtain a classifier for nonlinear data at a low cost in complexity during training. 4.2
Extremely Randomized Decision Trees
Extremely Randomized Decision Trees (Extra-Trees) [10] is a recently developed machine learning method that extends classical decision trees by introducing stochasticness during the induction process. Similarly to other highly randomized approaches, a large ensemble of these Extremely Randomized Trees is constructed. During classification, the predicted classes of the trees are aggregated and the majority class is selected as the final prediction. Extra-Trees consists in an ensemble of binary decision trees. Their internal nodes are labeled by a cut-point (i.e. threshold λj ) defined on an input attribute j ∈ [1, 2, . . . , SW ], that is to be tested in that node.
376
F. Scalzo et al.
In the classical decision tree induction algorithm, the tree is built by searching at each node for the test that maximizes a score measure (e.g. Shannon information). In Extra-Trees, the test attribute j is selected randomly and its threshold λj is also chosen randomly according to a Gaussian distribution N (xj , σj ), where xj , σj are the mean and standard deviation of this attribute j computed on the training samples available at this node. The induction algorithm in Extra-Trees can be summarized as follow. A top-down process builds the tree by successively splitting the leaves where the output variable vary. For a randomly selected attribute, the algorithm selects a random threshold depending on its distribution. The construction stops when the training subwindows at a given node are all from the same class. To classify a subwindow through an extra-tree model, the subwindow is independently classified by each tree (In our experiments, a number of 50 trees was used). The predicted classes originating from the trees are collected into a voting table and the class ci ∈ {1, 2, 3, 4} that obtains the majority of votes is assigned to the subwindow wi . 4.3
Support Vector Machines
A Support Vector Machine (SVM) is a supervised machine learning technique that has been used extensively in a wide range of pattern recognition applications. The idea behind SVM is to learn the optimal separating hyperplane in the feature space (i.e. subwindow space) such that it minimizes the error rate on the training data. Given a training set made of labelled samples (xi , yi ), where xi ∈ RSW stands for a subwindow, or more generally for an input vector, and yi ∈ {1, −1} is the corresponding class label (for a two-class problem). SVM aims at finding the optimal separating hyperplane that minimizes the misclassification rate on the training set, while maximizing the sum of distances of the training samples from this hyperplane. Formally, this problem amounts at finding the parameter α, such that: argminα 21 αT Qα − eT subjectto
(3)
yT α = 0 0 ≤ αi ≤ C, i = 1, . . . , n
where e is the vector of all ones, Q is a matrix with Qij = yj K(x1 , xj ), K(xi , xj ) = ψ(xi )T ψ(xj ) is called the kernel function. In our framework, ψ is a RBF kernel function (Eq. 4) that maps input features into another space (of larger dimensionality) in which the samples are hopefully linearly separable, K(xi , xj ) = exp(−γ||xi −xj || ) , γ > 0 2
(4)
The parameter C in Equation 3 essentially controls the amount of penalty on the error term during the minimization i.e. the sum of distances of all training vectors from the optimal separating hyperplane).
Random Subwindows for Robust Peak Recognition
377
The discriminant function for this SVM classifier is written as a weighted sum of the distances between a newinput vector x and each of the training vectors n xi in the kernel space, f (x) = i=1 yi αi K(x, xi ) + b. Unfortunately, the notation in Equation 3 does not apply to our four-class classification task yi ∈ {1, 2, 3, 4}. To overcome this problem, a “one-against-one” approach [11] is exploited, such that Nc = k(k − 1)/2 classifiers are constructed. Each of the Nc classifier trains model from data of two different classes.
5
Experiments
In this section, the effectiveness of the proposed extension is evaluated. The experimental evaluation has three goals. First, it aims, in Section 5.2, at measuring the accuracy of the subwindow classification for the different methods (NN, SVM and Extra-Trees). The second goal is to evaluate the precision, in terms of peak position, of the prediction of the peak position using a Kernel Density Estimation (KDE). Finally, it quantifies the improvement over the original MOCAIP peak recognition. To this end, ten datasets of ICP pulses of increasing difficulty are generated automatically. Each of them being artificially alterated by two different transformations that are detailed in the next Section. 5.1
Dataset
The data used during our experiments originate from the dataset that has been used for the evaluation of MOCAIP [1]. 13611 ICP dominant pulses were extracted from the recorded ICP signals of 66 patients. From that dataset, 1000 ICP dominant pulses were extracted randomly, and artificially perturbed to produce 20 new datasets. The first 10 datasets were produced by adding random values v ∈ [−r, +r] to the pulse signal, where r varies uniformly from 0 to 0.3 and thus produce the first 10 datasets of increasing difficulty. 10 other sets of data were created by translating each pulse by t ms where t varies from 1 to 100 ms, thus producing datasets increasingly challenging. 5.2
Subwindow Classification Accuracy
The first experiment evaluates the ability of the classifiers to predict the label (1, 2, 3, 4) of new subwindows. Their accuracy, in terms of percentage of correct labels, obtained by Nearest Neighbor (NN), Support Vector Machines (SVM) and Extremely Randomized Decision Trees (Extra-Trees) is reported. The experimental protocol consists of a three-fold crossvalidation is performed on the 1000 ICP pulses. In other words, for each fold, 666 pulses extracted randomly are used for training and 334 are used as a test set. The partionning is made such that each pulse appears at least once in the test set. For each pulse, 500 subwindows are extacted randomly (five of them are illustrated in Figure 2). Extra-Trees algorithm performs best and obtains an average recognition rate of 93.6% where Support Vector Machines (SVM) obtains 91.5%, and Nearest Neighbor (NN) performs not as well with a recognition rate of 88.2%.
378
F. Scalzo et al. 1
1
1
0.85
0.9
1
0.8
0.95 0.95
0.8
0.75
0.95
0.9
0.7
0.7 0.9
0.85
0.6
0.65
0.5
0.8
0.85
0.9
0.6
0.4
0.55
0.75
0.85
0.8
0.3
0.5
0.7
0.2
0.45
0.8
0.75 0.65
0.1
0
5
10
15
20
25
30
35
40
45
50
0
0.4
0
5
10
15
20
25
30
35
40
45
0.7 0
50
5
10
15
20
25
30
35
40
45
0.35
50
0
5
10
15
20
25
30
35
40
45
0.75
50
0
5
10
15
20
25
30
35
40
45
50
Fig. 2. Illustration of five random subwindows extracted from an ICP pulse 1
1
1
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
50
100
150
200
250
300
350
400
0
0.1
0
50
100
150
200
250
0
1
1
1
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
50
100
150
200
250
300
0
50
100
150
200
250
300
350
400
0
1
1
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
0.1
0
1
0
0
0
50
100
150
200
250
300
350
400
0.1
0
50
100
150
200
250
300
0
0
50
100
150
200
250
300
350
400
450
500
Fig. 3. The peak recognition process is illustrated on nine different ICP pulses. Color dots on the ICP pulse indicate the locations of the subwindows labelled as a peak class (1-green, 2-blue or 3-red). The distributions estimated from these sample points are shown on the bottom and the location of their maximum is used to set the position of each peak.
5.3
Predicting the Peaks Position
Using the protocol presented in the previous section, the accuracy of the prediction of the peaks position is evaluated. The three classifiers mentionned above (NN, SVM and Extra-Trees) are compared to two recently developed global regression approaches (SR, KSR) [4]. As shown in Figure 4, using subwindows greatly improves the robustness of the prediction when the pulses are perturbed with translations. Under random noise perturbations, the results are close for
Random Subwindows for Robust Peak Recognition 80
80
SR (regression) KSR (regression) NN (Subwindows) SVM (Subwindows) Extra−Trees (Subwindows)
70
60
50
40
30
50
40
30
20
20
10
10
0 0.01
0.04
0.07
0.11
0.14
SR (regression) KSR (regression) NN (Subwindows) SVM (Subwindows) Extra−Trees (Subwindows)
70
Average Prediction Error (ms)
60
Average Prediction Error (ms)
379
0.17
0.2
Average Percentage of Noise
0.24
0.27
0.3
0
1
2
3
5
8
13
22
36
60
100
Average Translation (ms)
Fig. 4. Average prediction error (in ms) of the peaks for data perturbed by random noise (left) and translation (right). In contrast with global regression methods (SR, KSR), local subwindows are invariant to the translation of the pulse.
KSR, SVM and Extra-Trees. However, the random noise has a big effect on a global linear regression model (SR). Also, it is important to note that global regression approaches require at least 500 ICP pulses for training whereas the proposed approach has no such requirements.
6
Conclusion
We described an extension to a recently developed algorithm to extract morphological features from ICP signals. The proposed framework exploits local subwindows, randomly extracted from ICP pulses and is able to predict the position of each peak in a probabilistic way. The experimental results are on par or above existing methods and highlight the contribution of the proposed approach in presence of perturbed data. Interestingly, this method is generic enough to be applied to different peak detection problem and is not limited to ICP signals.
References 1. Hu, X., Xu, P., Scalzo, F., Miller, C., Vespa, P., Bergsneider, M.: Morphological Clustering and Analysis of Continuous Intracranial Pressure. IEEE Transactions on Biomedical Engineering (submitted, 2008) 2. Takizawa, H., Gabra-Sanders, T., Miller, J.D.: Changes in the cerebrospinal fluid pulse wave spectrum associated with raised intracranial pressure. Neurosurgery 20(3), 355–361 (1987) 3. Cardoso, E.R., Rowan, J.O., Galbraith, S.: Analysis of the cerebrospinal fluid pulse wave in intracranial pressure. J. Neurosurg. 59(5), 817–821 (1983) 4. Scalzo, F., Xu, P., Bergsneider, M., Hu, X.: Nonlinear regression for sub-peak detection of intracranial pressure signals. In: IEEE Int. Conf. Engineering and Biology Society (EMBC) (2008)
380
F. Scalzo et al.
5. Hu, X., Xu, P., Lee, D., Vespa, P., Bergsneider, M.: An algorithm of extracting intracranial pressure latency relative to electrocardiogram r wave. Physiological Measurement (2008) 6. Afonso, V.X., Tompkins, W.J., Nguyen, T.Q., Luo, S.: Ecg beat detection using filter banks. IEEE Trans. Biomed. Eng. 46(2), 192–202 (1999) 7. Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. Wiley series in probability and mathematical statistics. Wiley, Hoboken (2005); Leonard Kaufman, Peter J. Rousseeuw 8. Mare, R., Geurts, P., Piater, J., Wehenkel, L.: Random subwindows for robust image classification. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 34–40 (2005) 9. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, Boca Raton (1986) 10. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006) 11. Knerr, S., Personnaz, L., Dreyfus, G.: A stepwise procedure for building and training a neural network. Springer, Heidelberg (1990)
A New Shape Benchmark for 3D Object Retrieval Rui Fang1 , Afzal Godil1 , Xiaolan Li1,2 , and Asim Wagan1 1
National Institute of Standards and Technology, Maryland, U.S.A 2 Zhejiang Gongshang University, P. R. China {rfang,godil,lixlan,wagan}@nist.gov
Abstract. Recently, content based 3D shape retrieval has been an active area of research. Benchmarking allows researchers to evaluate the quality of results of different 3D shape retrieval approaches. Here, we propose a new publicly available 3D shape benchmark to advance the state of art in 3D shape retrieval. We provide a review of previous and recent benchmarking efforts and then discuss some of the issues and problems involved in developing a benchmark. A detailed description of the new shape benchmark is provided including some of the salient features of this benchmark. In this benchmark, the 3D models are classified mainly according to visual shape similarity but in contrast to other benchmarks, the geometric structure of each model is modified and normalized, with each class in the benchmark sharing the equal number of models to reduce the possible bias in evaluation results. In the end we evaluate several representative algorithms for 3D shape searching on the new benchmark, and a comparison experiment between different shape benchmarks is also conducted to show the reliability of the new benchmark.
1
Introduction
With the increasing number of 3D models created and available on the Internet, many domains have their own 3D repositories such as the national design repository for CAD models [1], Protein Data Bank for biological macromolecules [2],CAESAR for Anthropometry [3], the AIM@SHAPE shape repository [4] and the INRIA-GAMMA 3D Database [5] for research purposes, etc. Effectively searching a 3D repository for 3D shapes which are similar to a given 3D query model has become an important area of research. Traditional text based search engines are widely used in many 3D repositories to retrieve 3D shapes. Those text based search strategies only allow users to search 3D models by inputting keywords, file type, file size or by just browsing the thumbnail of 3D models, which may not meet all the demands. Because a text based method requires manually annotating the shapes which may introduce imprecision, it might not be able to properly represent the shape information. Furthermore, annotating a huge number of 3D models by hand is inefficient, inaccurate and impractical. In addition, the annotation of shapes is incomplete or not available in many cases. thus, a number of 3D shapes based search engines have been investigated to address this problem [6], [7], [8], [9], [10], etc. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 381–392, 2008. c Springer-Verlag Berlin Heidelberg 2008
382
R. Fang et al.
Shape based search methods do not require any manual annotation. The only thing they require is the shape descriptor which can automatically extract shape features and properly represent the shape information. It has been reported that the performances of shape based searching methods outperform text based methods [11]. For more about shape based searching algorithms, we refer readers to several surveys [12], [13], [14]. With a number of shape based retrieval methods appearing in the current literature, the question now is how to evaluate shape retrieval algorithms rationally with high confidence. Benchmarking is one answer to this question. Just like the methods used for evaluating text based [15], image based [16], [17], video based [18] and music based retrieval methods [19], [20], under a shape benchmark, different shape based retrieval algorithms are compared and evaluated in the same experimental environment by the same measurement tools from different aspects. Comparable results are then obtained and conclusions about their performance are drawn. By doing this, one then gets a good understanding of each algorithm and can judge which algorithm should be applied in a specific circumstance. In section 2, the related work of previous benchmarks is briefly reviewed; in section 3, we discuss the construction of the benchmark; in section 4, we discuss the evaluation measures used; in section 5, the experiment results of different algorithms on the new benchmark are reported and analyzed; in section 6, the reliability of the new shape benchmark is discussed, and finally conclusions are drawn in section 7.
2
Related Work
The SHape REtrieval Contest (SHREC) [21] is organized every year since 2006 by Network of Excellence AIM@SHAPE to evaluate the effectiveness of 3D shape retrieval algorithms. Many tools are also provided to compare and evaluate 3D retrieval methods. In 2006, one track was conducted to retrieve 3D mesh models on the PSB [22]. From 2007, several tracks were conducted which focused on specialized problems: the watertight models track, the partial matching track, the CAD models track, the protein models track, the 3D face models track, the track on stability of watertight models, the track on the classification of watertight models and the generic models track [23], [21]. The Princeton Shape Benchmark (PSB) is a publicly available database of 3D polygonal models with a set of software tools that are widely used by researchers to report their shape matching results and compare them to the results of other algorithms [24].The Purdue engineering shape benchmark (ESB) [25] is a public 3D shape database for evaluating shape retrieval algorithms mainly in the mechanical engineering domain. The McGill 3D shape benchmark [26] provides a 3D shape repository which includes a number of models with articulating parts. Other current shape benchmarks were introduced and analyzed in [24]. Although previous shape benchmarks have provided valuable contributions to the evaluation of shape matching algorithms, there are some limitations to evaluate general purpose shape matching algorithms.
A New Shape Benchmark for 3D Object Retrieval
383
– Domain dependent benchmarks like the ESB and the McGill benchmark can only be used to evaluate retrieval algorithms in their respective domains. – A number of classes in the PSB (basic classification) contain too few models. Actually, a benchmark database must have a reasonable number of models in each of its classes; five or ten is too few to get a statistically sound evaluation [27]. Take the base classification in the PSB as an example, in the training set there are 90 classes, 55 of which have no more than 10 models; 27 of which have no more than 5 models; in the testing set there are 92 classes, 59 of which have no more then 10 models; 28 of which have no more then 5 models. – Unequal number of 3D models in each class could cause bias when using the benchmark to evaluate different shape retrieval algorithms. Some authors [28] reported their results on the PSB and concluded that the quality of the retrieval results depends on the size of the query model’s class, and the higher the cardinality of a class the more significant the results will be (table 4 in [28]). Similar results were also reported in table 5 of [24], table 3 of [29]. Sometimes it is hard to decide which factor affects the results, the discriminative power of the algorithm on different model classes or the number of models in these classes. For example, suppose there are two algorithms A and B, and there are two kinds of 3D models: 5 birds and 30 cars; we want to evaluate the discrimination ability of the two algorithms; and further assume that algorithm A is good at discriminating birds and algorithm B is good at discriminating cars. Then the result of evaluation based on the given 3D models would favor B due to the unequal number of models in the two classes. Due to this reason, in SHREC 2007 watertight models track, all classes in the database were made up of the same number of objects to keep the generality of each query [23]. – In addition, models in some benchmarks have mesh errors like inconsistent normal, degenerated surfaces, duplicate surfaces, intersecting surfaces that do not enclose a volume, etc. To overcome some limitations of the current benchmarks, we propose a new shape benchmark with a set of tools for direct comparison and analysis of general purpose shape matching algorithms.
3
Benchmark Design Principle
There are two main steps to benchmark a shape database, the first of which is to get enough 3D shape models. All the 3D models in the new shape benchmark were acquired by the web crawler Methabot [30] from major 3D repositories on the Internet. We have obtained permission to freely redistribute all these models only for research purposes. The other step is to classify the 3D shape models into a ground truth database; we discuss it below in detail. As in the shape retrieval problem, we retrieve 3D objects solely according to their shape information other than their color, texture, etc., so in the benchmark, we use ASCII Object File Format (*.OFF) to represent the shape information of
384
R. Fang et al.
Fig. 1. Normalization procedure (left), a model with inconsistent normal orientation in some benchmarks (middle), and the same model with consistent normal orientation in the new benchmark (right)
each model, which consists of polygons, that is, the coordinates of points and the edges that connect these points. This shape format has benefits of simplicity and it only contains shape information, which allows us to concentrate on shape itself. 3D models downloaded from websites are in arbitrary position, scale and orientation, and some of them have many types of mesh errors [31]. Shapes should be invariant to rotation, translation and scaling, which require the process of pose normalization before many shape descriptors can be applied to extract shape features. Unfortunately, few previous shape benchmarks have done this simple but important step. For this purpose, in the benchmark database, every model is normalized: all the polygons are triangulated, scaled to the same size, translated to the center of mass, and rotated to the principle axes. Figure 1 (left side) gives an example of the normalization procedure. We partially solve some mesh errors like the inconsistent normal orientation problem. In figure 1, the model in the middle was taken from some benchmarks. It has inconsistent orientation of polygons on the same surface, (dark area and white area appear on the same surface). The model on the right side of figure 1 shows the model in our new shape benchmark with consistent orientation of polygons on the same surface. 3.1
Building a Ground Truth for the Benchmark
The purpose of benchmarking is to establish a known and validated ground truth to compare different shape matching algorithms and evaluate new methods by standard tools in a standard way. Building a ground truth database is an important step of establishing a benchmark. A good ground truth database should meet several criteria [32], like having a reasonable number of models, being stable in order to evaluate different methods with relatively high confidence, and having certain generalization ability to evaluate new methods. To get a ground truth dataset, in text retrieval research, TREC [15] uses pooling assessment [32] . In image retrieval research, as there is no automatic way to determine the relevance of an image in the database for a given query image [33], the IAPR benchmark [16] was established by manually classifying images into categories. In image processing research, the Berkeley segmentation dataset and benchmark [34] assumes that the human segmented images provide valid ground truth boundaries, and all images are segmented and evaluated by a group of people. In shape retrieval research, the PSB manually partitioned the
A New Shape Benchmark for 3D Object Retrieval
385
Table 1. 40 classes used in the new Shape benchmark 1 Bird 5 Biped 9 SingleHouse 13 HandGun 17 FloorLamp 21 DeskPhone 25 WheelChair 29 Bookshelf 33 Helicopter 37 Motorcycle
2 Fish 6 Quadruped 10 Bottle 14 SubmachineGun 18 DeskLamp 22 Monitor 26 Sofa 30 HomePlant 34 Monoplane 38 Car
3 NonFlyingInsect 7 ApartmentHouse 11 Cup 15 MusicalInstrument 19 Sword 23 Bed 27 RectangleTable 31 Tree 35 Rocket 39 MilitaryVehicle
4 FlyingInsect 8 Skyscraper 12 Glasses 16 Mug 20 Cellphone 24 NonWheelChair 28 RoundTable 32 Biplane 36 Ship 40 Bicycle
models of the benchmark database primarily based on semantic and functional concepts and secondarily based on shape attributes. As there is no standard measure of difference or similarity between two shapes, in the new shape benchmark, two researchers were assigned as assessors to manually classify objects into ground truth categories. When there are disagreements on which category some objects should belong, another researcher was assigned as the third assessor to make the final decision. This classification work is purely according to shape similarity, that is, geometric similarity and topology similarity. Each model was input to a 3D viewer, the assessor rendered it in several viewpoints to make a final judgment towards shape similarity. The assessors were told to assume that, within the same class, the shape of the objects should be with high similarity to each other, while, between classes, the objects should have distinctive shape differences. In this benchmark, we equalize each class and make them contain the same numbers of 3D models (20 models). Here are two main reasons for this: to avoid possible bias of evaluation, as it is discussed in section 2; and some evaluation measures are unstable due to few models in a class [35]. Table 1 shows the 40 classes used in the new shape benchmark and each class contains 20 models which are common in daily life.
4
Evaluation Measures
The procedure of information retrieval evaluation is straightforward. In response to a given set of users’ queries, an algorithm searches the benchmark database and returns an ordered list of responses called the ranked list(s). The evaluation of the algorithm then is transformed to the evaluation of the quality of the ranked list(s). As different evaluation metrics measure different aspects of shape retrieval behavior, in order to make a thorough evaluation of a 3D shape retrieval algorithm with high confidence, we employ a number of common evaluation measures used in the information retrieval community: Precision-Recall curve [36]; Average Precision(AP) and Mean Average Precision(MAP) [37]; E-Measures [36]; Cumulated gain based measurements [38]; Nearest Neighbor (NN), First-Tier (Tier1) and Second-Tier (Tier2) [24].
386
5
R. Fang et al.
Comparison Experiments and Discussion
In this section, in order to examine how the benchmark reveals features of different shape matching algorithms, we include several kinds of algorithms to compare on the new benchmark. Moreover, comparison experiments are conducted on both the entire benchmark and a specific class of the benchmark to get comprehensive understanding of shape matching algorithms. First we classify algorithms into several categories according to the kind of features the algorithms need and the way how these algorithms extract shape features. – View-based algorithm: Light Field Descriptor(LFD) [39]; Depth Buffer-Based (DepthBuffer) [40]; Silhouette-Based(SIL) [40]. – Statistic-based algorithm: D2 Shape Distributions(D2) [41]; AAD Shape Distributions (AAD) [42]; Linearly Parameterized Statistics(PS) [43]. – Ray-based algorithm: Ray-based with Spherical Harmonic representation (RSH) [40]. – Multiple descriptors: Hybrid Descriptors(Hybrid) [44]; Multiple Resolution Surflet-Pair-Relation Histograms(RSPRH) [45]; Exponentially decaying Euclidean distance transform(EDT) [40]. Now we perform comparison experiments on the whole benchmark using the evaluation measurements described in Section 5. Figure 2 shows the precision recall curves, the mean DCG curves and other measure score results of the 10 algorithms, respectively. Our main findings from the experiments are as follows: – View-based methods have obtained considerable results on the whole benchmark. We conjecture the reasons for this are that: they extract high level features of a 3D shape and the ground truth database was established mainly according to visual similarity of shapes. The idea of view-based shape descriptors is that two similar shapes should look similar in different viewpoints. This corresponds directly with the way in which human beings judge shape similarity. – Multiple descriptors also get very good results. In figure 2 (left), with the increasing recall value (bigger than 0.2), the performance of the hybrid descriptor outperforms the light field descriptor. We think the reason might be that the descriptor takes strength of each single shape descriptor, extracts and integrates the most important features at both high and low level. – The performance of statistic-based methods is not so good compared to other kinds of algorithms. One of the possible reasons might be that, statistical features alone are not capable enough to discriminate shapes with high accuracy, and that these statistical features are relatively low level features. Therefore, new and more powerful statistic features should be explored and created to get better retrieval performance. Comparison results on individual classes of the benchmark. In this subsection, we perform comparison experiments on every individual class given
A New Shape Benchmark for 3D Object Retrieval
387
Fig. 2. The precision-recall curve on whole benchmark (top left), the mean DCG curve on whole benchmark (top right), other comparison results on the whole benchmark (bottom left) and the MAP scores of the evaluated algorithms on each individual class of the proposed benchmark (bottom right)
in Table 1, and explore how different algorithms perform on a specific kind of 3D objects. The Mean Average Precision (MAP) is used to evaluate the performance of each algorithm on the 40 individual classes of the benchmark. Figure 2(bottom right) shows the MAP evaluation results of the ten algorithms on each individual shape class of the new benchmark. Our main findings from the experiments on individual classes are as follows: – Most algorithms do especially well in geometrically simple structure models, for example, Fish, Glasses, and Sword. – The performance of different shape matching algorithms on a specific class can vary a lot. Also, a shape matching algorithm will get quite different evaluation scores on different classes. The reason might be that, due to the characteristics of a shape matching algorithm, it is easy for the shape matching algorithm to extract the feature for certain classes of objects, while this is not easy for some other classes of objects. So, besides comparing the results on the whole benchmark, it is helpful to further discover the nature of a shape matching algorithm on each single class.
388
6
R. Fang et al.
The Reliability of the New Shape Benchmark
In this section, we explore the reliability of the new proposed benchmark by testing the effect of class set size on retrieval error. Voorhees and Buckley [46] proposed a method to estimate the reliability of retrieval experiments by computing the probability of making wrong decisions between two retrieval systems over two retrieval experiments. They also showed how the topic set sizes affect the reliability of retrieval experiments. A theoretical justification of this method has been discussed by Lin and Hauptmann [47]. We use this method to conduct an experiment and test the reliability of the new shape benchmark. The procedure of computing error rates was described in [46]. Here we summarize the procedure as follows: – Randomly split the benchmark database into two equal sets, each set containing 20 classes. – Use the MAP score of the 10 algorithms described in the last section, so that 2 the number of algorithm pairs run in the benchmark is C10 = 45. Randomly take several classes (from 2 classes up to 20 ) each time from one set and compute the MAP of every algorithm on these chosen classes. The same procedure is done on the other set. – The difference between two algorithms’ MAP is categorized into one of 19 bins ranging from 0.00 to 0.20 with 0.01 increment (i.e. (0.00, 0.01], (0.00, 0.01],...,(0.19, 0.20]). – Based on step 2, count the number of swaps for all pairs of algorithms and the number of MAP differences on that bin. For example, a swap occurs when in one set the MAP of algorithm A is bigger than that of B, while in the other set, the MAP of A is smaller than that of B. Count it when the size of differences between two algorithms’ MAP is in a certain bin. – Compute error rate at each bin by dividing the number of swaps by the total number of MAP difference on that bin. – For each bin of MAP difference, draw the error rates for the number of classes from 1 to 20, go over every bin and repeat the procedure. – Estimate the error rates of the whole benchmark by extrapolating the error rates using the number of classes up to half of the benchmark. Figure 3 (top left) shows the estimated error rates with the number of classes up to 40. From this figure, we can see that: for each curve, the error rate decreases with the increasing class set size, and for each class set size, the error rate decreases with the increasing size of difference of MAP scores. By looking at curves whose error rates do not exceed a certain value (e.g. 5%), we can estimate how much MAP difference is needed to conclude that retrieval algorithm A is better than retrieval algorithm B with 95% confidence. When the class set size increases from 15 to 40, the error rates are stably converging to zero. Especially, when the number of class is 40, it needs less than 0.03 difference in MAP scores to guarantee that the error rate is below 5%.
A New Shape Benchmark for 3D Object Retrieval
389
Fig. 3. Extrapolated error rates vs. class set sizes up to the whole benchmarks
We also conduct the same experiments on other different shape benchmarks: the base training database in PSB(90 classes) (Figure 3 (bottom right)) [6], the CCCC shape benchmark(55 classes) (Figure 3 (bottom left)) [7], and the National Taiwan University(NTU)’s shape benchmark (47 classified classes) (Figure 3 (top right)) [8]. We do not include the Utrecht University’s shape benchmark because it contains too few (five) classified classes in the databases. From the experiment results on other shape benchmarks, we can see that: to guarantee that the error rate is below 5% on the whole database, the NTU shape benchmark needs at least 0.07 difference in MAP scores, the CCCC shape benchmark needs at least 0.06 difference in MAP scores, and the Princeton shape benchmark needs at least 0.04 difference in MAP scores. At class set size of 40, which is also class set size of the new proposed shape benchmark, to guarantee that the error rate is below 5%, the NTU shape benchmark needs at least 0.07 difference in MAP scores, the CCCC shape benchmark needs at least 0.08 difference in MAP scores, and the PSB needs at least 0.07 difference in MAP scores. This indicates that the number of classes of the new proposed shape benchmark is large enough to get a sufficiently low error rate compared to other shape benchmarks. From the above discussion, the experiment can be considered as a strong evidence to support the reliability of the new benchmark to evaluate shape retrieval algorithms.
390
7
R. Fang et al.
Conclusion
We have established a new publicly available 3D shape benchmark with a suite of standard tools for evaluating generic purpose shape retrieval algorithms (the benchmark website is http://www.itl.nist.gov/iad/vug/sharp/benchmark). Several retrieval algorithms are evaluated from several aspects on this new benchmark by various measurements, and the reliability of the new shape benchmark is discussed. This reliable shape benchmark provides a new perspective to evaluate shape retrieval algorithms. It has several merits: high reliability (in terms of error rate) to evaluate 3D shape retrieval algorithms, sufficient number of good quality models as the basis of the shape benchmark, equal size of classes to minimize the bias of evaluation.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13.
14.
15. 16. 17. 18.
19. 20.
21.
http://www.designrepository.org/ http://www.rcsb.org/pdb/home http://store.sae.org/caesar/ http://shapes.aim-at-shape.net/ http://www-c.inria.fr/gamma/gamma.php http://shape.cs.princeton.edu/search.html http://merkur01.inf.uni-konstanz.de/CCCC/ http://3d.csie.ntu.edu.tw/ http://www.cs.uu.nl/centers/give/multimedia/3Drecog/3Dmatching.html http://3d-search.iti.gr/3DSearch Min, P., Kazhdan, M., Funkhouser, T.: A comparison of text and shape matching for retrieval of online 3d models. In: Proc. European conference on digital libraries, pp. 209–220 (2004) Tangelder, J.W.H., Veltkamp, R.C.: A survey of content based 3d shape retrieval methods. In: Proceedings of the Shape Modeling International, pp. 145–156 (2004) Bustos, B., Keim, A.D., Saupe, D., Schreck, T., Vranic, V.D.: Feature-based similarity search in 3d object databases. ACM Computing Surveys 37(4), 345–387 (2005) Iyer, N., Jayanti, S., Lou, K., Kalyanaraman, Y., Ramani, K.: Three-dimensional shape searching: state-of-the-art review and future trends. Computer-Aided Design 37(5), 509–530 (2005) http://trec.nist.gov/ http://eureka.vu.edu.au/∼grubinger/IAPR/TC12 Benchmark.html http://www.imageclef.org/ Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and trecvid. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330 (2006) Orio, N.: Music retrieval: A tutorial and review. Foundations and Trends in Information Retrieval 1, 1–90 (2006) Typke, R., Wiering, F., Veltkamp, R.C.: A survey of music information retrieval systems. In: Proceedings of the 6th International Conference on Music Information Retrieval, pp. 153–160 (2005) http://www.aimatshape.net/event/SHREC
A New Shape Benchmark for 3D Object Retrieval
391
22. Veltkamp, R.C., Ruijsenaars, R., Spagnuolo, M., van Zwol, R., ter Haar, F.: Shrec2006 3d shape retrieval contest, technical report uu-cs-2006-030. Technical report, Department of Information and Computing Science, Utrecht University (2006) 23. Veltkamp, R.C., ter Haar, F.B.: Shrec2007 3d shape retrieval contest, technical report uu-cs-2007-015. Technical report, Department of Information and Computing Science, Utrecht University (2007) 24. Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The princeton shape benchmark. In: Proceedings of the Shape Modeling International, pp. 167–178 (2004) 25. Iyer, N., Jayanti, S., Lou, K., Kalyanaraman, Y., Ramani, K.: Developing an engineering shape benchmark for cad models. Computer-Aided Design 38(9), 939–953 (2006) 26. Zhang, J., Siddiqi, K., Macrini, D., Shokouf, A., Dickinson, S.: Retrieving articulated 3-d models using medial surfaces and their graph spectra. In: International Workshop On Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 285–300 (2005) 27. Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 315–323 (1998) 28. Papadakis, P., Pratikakis, I., Perantonis, S., Theoharis, T.: Efficient 3d shape matching and retrieval using a concrete radialized spherical projection representation. Pattern Recognition 40, 2437–2452 (2007) 29. Laga, H., Takahashi, H., Nakajima, M.: Spherical wavelet descriptors for contentbased 3d model retrieval. In: Proceedings of the IEEE International Conference on Shape Modeling and Applications, pp. 15 (2006) 30. http://bithack.se/methabot/start 31. Veleba, D., Felkel, P.: Detection and correction of errors in surface representation. In: The 15th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (2007) 32. Jones, S., Van Rijsbergen, C.J.: Information retrieval test collections. Journal of Documentation 32, 59–75 (1976) 33. Leung, C.: Benchmarking for content-based visual information search. In: Laurini, R. (ed.) VISUAL 2000. LNCS, vol. 1929, pp. 442–456. Springer, Heidelberg (2000) 34. http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/ segbench/ 35. Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 33–40 (2000) 36. Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworths (1979) 37. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York (1983) 38. Jarvelin, K., Kekalainen, J.: Ir evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 41–48 (2000) 39. Chen, D., Tian, X., Shen, Y., Ouhyoung, M.: On Visual Similarity Based 3 D Model Retrieval. Computer Graphics Forum 22, 223–232 (2003) 40. Vranic, D.V.: 3d model retrieval. PH.D thesis, University of Leipzig, German (2004) 41. Osada, R., Funkhouser, T., Chazelle, B., Dobkin, D.: Shape distributions. ACM Transactions on Graphics (TOG) 21, 807–832 (2002)
392
R. Fang et al.
42. Ohbuchi, R., Minamitani, T., Takei, T.: Shape-similarity search of 3d models by using enhanced shape functions. International Journal of Computer Applications in Technology 23, 70–85 (2005) 43. Ohbuchi, R., Otagiri, T., Ibato, M., Takei, T.: Shape-similarity search of threedimensional models using parameterized statistics. In: Proceedings of the 10th Pacific Conference on Computer Graphics and Applications, pp. 265–274 (2002) 44. Vranic, D.V.: Desire: a composite 3d-shape descriptor. In: Proceedings of the IEEE International Conference on Multimedia and Expo. (2005) 45. Ohbuchi, R., Hata, Y.: Combining multiresolution shape descriptors for 3d model retrieval. In: Proc. WSCG 2006 (2006) 46. Voorhees, E.M., Buckley, C.: The effect of topic set size on retrieval experiment error. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 316–323 (2002) 47. Lin, W.H., Hauptmann, A.: Revisiting the effect of topic set size on retrieval error. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 637–638 (2005)
Shape Extraction through Region-Contour Stitching Elena Bernardis and Jianbo Shi University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA 19104 {elber,jshi}@cis.upenn.edu
Abstract. We present a graph-based contour extraction algorithm for images with low contrast regions and faint contours. Our innovation consists of a new graph setup that exploits complementary information given by region segmentation and contour grouping. The information of the most salient region segments is combined together with the edge map obtained from the responses of an oriented filter bank. This enables us to define a new contour flow on the graph nodes, which captures region membership and enhances the flow in the low contrast or cluttered regions. The graph setup and our proposed region based normalization give rise to a random walk that allows bifurcations at junctions arising between region boundaries and favors long closed contours. Junctions become key routing points and the resulting contours enclose globally significant regions.
1 Introduction We consider the problem of extracting salient shapes from an image, which is an important part of any image analysis. This problem has been studied in the context of region segmentation [11,7], contour grouping [12,13,6,2,10], as well as a combination of them [14,4]. For many applications, the key remaining challenge is detecting faint contours and boundaries along low contrast regions. Traditional approaches that rely only on local edge detection in order to detect all faint contours boundaries have the side effect of generating many additional spurious edges. On the one hand, contour grouping techniques, such as edge linking [3] and more recent methods such as untangling cycles [15], are often used as a way to prune out spurious edges and complete missing ones. The downside is that these methods depend on fragile local cues, and fail to take into account the global information provided by the regions along the boundary. On the other hand, region segmentation such as Normalized Cuts (Ncut) [11] has the ability of cleaning up spurious edges and bridging gaps among missing or faint contours. However, in order to find all the salient segments, oversegmentation is needed which leads to fragmentation along the salient shape boundary. The intuition behind our method, as shown in Figure 1, is to combine the Ncut segmentation of an image (shown in Fig. 2) with a flow on its edge map. There are two key insights: (1) region information provides edge boundaries with more global information of how they should be connecting with each other. From this more robust flow information, one can extract salient contours more easily even under low contrast and clutter; (2) we only use soft segmentation eigenvectors, which capture the likelihood of region segmentation. This allows us to retain edges which do not line up with the segmentation G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 393–405, 2008. c Springer-Verlag Berlin Heidelberg 2008
394
E. Bernardis and J. Shi
Fig. 1. The intuition behind our method: each segmentation region induces directional flows on the region boundaries. Right: each image pixel could have multiple induced flow directions, one for each nearby segmentation region. Contour grouping using the induced edge flow results in the stitching of fragmented segmentation regions.
Ncut
Ncut Eigenvectors
Filter Response
Fixed Charges on the Regions
Free Charges on the Edge Map Flux on eigenvectors
Edge Map
1111 0000 1111 0000 1111 0000 0000 1111 0000 00000 1111 11111 0000 1111 00000 11111 000 111 000 000111 111 000 111
Fig. 2. Region Segmentation and Lorentz force on edges. We compute the Ncut image segmentation eigenvectors to obtain a set of soft segmentation maps, one for each region. Using the soft segmentation map as negative particle charges, we compute the force on the initial image edges. Its direction encodes the local shape of the segmentation region, while the magnitude reflects the saliency of the image region, containing more global information than local image contrast.
boundaries and could be washed out in the discretized segmentation. In these cases, the contours extract the low contrast region information from the soft eigenvectors. This paper is organized as follows: the graph formulation is given in section 2; in section 3, we develop a random walk graph representation for contour extraction; an analogy to cyclic random walk is given in section 4; computations for extracting salient contours are in section 5 and experiments in section 6; Section 7 concludes the paper.
Shape Extraction through Region-Contour Stitching
395
2 Graph Setup 2.1 From Filter Responses to Edge Mask Motivated by human vision, we define the image gradient using oriented edge energy: e o OEΦ,α = (I ∗ fΦ,α )2 + (I ∗ fΦ,α )2 ,
(1)
e o where fΦ,α , fΦ,α , are a quadrature pair of even and odd symmetric filters at orientation Φ and scale α. Most algorithms detect edges by applying non-maximum suppression on OEΦ,α across different orientations Φ to obtain edge orientation Φmax , and then localize the edge by checking zero-crossing in fΦe max ,α . Such edge detection procedure often destroys edges around junctions. At the junctions, the boundary pixels could have multiple orientations. Forcing them to make a hard choice on a single edge orientation leads to erroneous orientation estimates. We compute the edge zero crossings for each orie ented fΦ,α . To remove spurious edge detections (those that extend from the boundary into the surrounding flat image region), we apply non-maximal suppression across the orientations allowing only pixels NOT on a zero crossing to suppress their neighbors. We initialize the edge mask E to include the surviving zero crossings.
2.2 Region Segmentation Using Normalized Cuts We use the NCut [11] graph partitioning setup to extract image region information. The set of points in the imagespace are represented as a weighted undirected graph Gregion = V region , W region , where the nodes V region of the graph are the image pixels and the W region is a similarity function between pairs of nodes computed by e o Intervening Contours [8] directly using fΦ,α , fΦ,α . The weight matrix W region is used to compute the NCut soft eigenvectors which will be denoted by {v (1) , . . . , v (n) }. Each eigenvector can be seen as an image itself, with the brightest pixels corresponding to the hard segmentation region labeling {S (1) , . . . , S (n) }. 2.3 From Region Segmentation to Lorentz Edge Flow on the Edge Mask Our method contrasts with previous contour grouping approaches in the sense that the magnitudes and orientations associated to each edge point are not given by the local geometrical properties of the contour itself, but rather they are induced from more global information computed by using both the edge mask E and the region segmentation soft eigenvectors {v (1) , . . . , v (n) } with corresponding labels {S (1) , . . . , S (n) }. Jalba et al. [5] introduced the charged particle model to recover shapes by modeling the image as a charged electric field. In their model, the free charges in the image regions are attracted to the contours by an electric field in which image pixels are assigned charges based on gradient magnitude of the image. In this work, the exact contours are not known a priori and finding the contours, viewed as static entities, relies on the repulsive forces within similar regions. Inspired by this, we invert the setup and exploit the initial region segmentation information to fix the electrical particles within the regions instead and let the electric field defined on them create a flow on the previously found
396
E. Bernardis and J. Shi
q2
q2
q2
q2
q2
q2
q2
q2
q2
q2
q2
q2
q1
q2
q2
q2
q2
q0
q1
q1
q2
q2
q1
q1
q1
q1
q1
q1
q2
q2
q2
q2
q2
q1
q2
q2
q2
q2
q0
q1
q1
q2
q2
q1
q1
q1
q1
q1
q1
− → d
q2
− → f
q2
q1
q0
q1
q1
q2
q2
q1
q1
q1
q1
q1
q1
→ − ⊥ f and region belonging
Fig. 3. Computation of the Lorentz force Fl from each of the soft region segmentation eigenvector. Left: Initial image edge detections superimposed on the segmentation eigenvector (resulting in the free particles pi shown with darker color). Dotted lines indicate the region boundary, white circles the fixed region particles ei and d denotes the distance vector ri − Rk . Computation of Fl tolerates imprecision in the region segmentation boundary. Center: an example of resulting Lorentz force Fl . Right: The corresponding directed Lorentz flow after a 90 degree rotation. The flow also encodes contour-region membership (indicated by the one sided edge flow).
edge map. For each eigenvector v (i) we view the N image pixels as charged particles with corresponding negative electric charges qi given by the pixels’ negative eigenvector magnitude value. To each eigenvector, we superimpose the initial edge map E and we keep as fixed particles ek only pixels that do not lie on the edges, the edge points instead will be considered as free particles pi . It is important to note that the contours found from the region segmentation and the edge map contours do not need to coincide (as is in the example illustrated in Fig. 3) and in fact the purpose of combining them is to allow one to correct the mistakes of the other. For each free particle pi (which will later correspond to nodes of our graph), with position vector ri , we are only interested in the Lorentz force F due to the electric field generated by the fixed charges ek as this force captures the external attractive interactions. Following the simplifications in [5] we omit the magentic field. The Lorentz force Fl at each particle pi can then be written as: Flpi = qi
M k:Rk =ri
ek ri − Rk , 4π0 ri − Rk 3
(2)
where the summation is over the M fixed charges ek with grid vector positions Rk that fall within a fixed radius of the free charge pi , and 0 is the electrical permittivity of free space. Note that the summation only accounts for interaction from the fixed charges and does not consider forces generated by all other free particles taken in isolation. We use the Lorentz force Fl induced from the soft region segmentation to compute a flow on the initial image edges. Each contour point in the original edge mask can have up to n copies (n being the number of NCut segments) in the line graph nodes V contour explained in the next section. We define the flow vector to be a 90 degree rotation of Fl . This flow’s orientation, denoted by Flθ or simply θ, captures global shape information of the regions, and its magnitude Flmag or simply m, reflects the region saliency instead
Shape Extraction through Region-Contour Stitching
397
FILTER BANKS e , fo fΦ,α Φ,α
REGION GRAPH
LINE GRAPH E
N CU T e , fo ) W region(fΦ,α Φ,α int−contour wi→j
wi→j Gregion = V region, W region (D − W )v = λv
EDGE INTENSITY FLOW
{v (1), . . . , v (n)}
LORENTZ EDGE FLOW
wi→j
wi→j
connection weak connection no connection flow vector and one sided region belonging flow vector and two sided region belonging contour contour Gcontour = Vold , Wold old contour (Edgeθ) OLD : Wold dir wi→j
Edge θ
contour contour Gcontour = Vnew , Wnew new mag
contour (F N EW : Wnew l
, Flθ ↔ Regionk)
wi→j
dir wi→j
mag
Flmag
reg
wi→j
Flθ
Region k
Fig. 4. Schematic representation of our method. The phase and magnitude of the edge filter responses are used to obtain the initial line and region graphs (first row). In the line graph, the nodes are image edge pixels E computed from the filter responses (section 2.1). In the region graph, the nodes are all image pixels, the graph is undirected and the weights are computed by Intervening Contours (illustrated on the right). The region segmentation eigenvectors computed by using NCuts are then used to compute the Lorentz Edge Flow on the edge mask E . Our line graph, here denoted by a new subscript, has nodes that are duplicated copies of the edge mask E with the flow vectors given by the Lorentz Edge Flow. Bottom Row: in contrast with previous contour grouping algorithms, denoted by a old subscript, which use image gradient orientation to compute the weights (lower left figure), the new vector flow (lower right figure) includes the global properties given by the segmentation eigenvectors {v (1) , . . . , v (n) } and inherits an implicit region belonging. The graph weights (section 2.4) are computed by using the new flow’s magnitude (proportional to the thickness of the arrows) and orientation as well as information of the region labeling (lower right boxes). The three different cases are best viewed in color.
398
E. Bernardis and J. Shi
of image contrast. To prune the number of nodes further, we threshold the magnitude of Fl and apply non-maximal supression across the flow’s orientation. 2.4 Contour Graph Weight Setup How do we utilize Lorentz force flow to detect salient contours? We will develop a graph formulation for this task as shown here. The first step is to define a graph Gcontour = V contour , W contour which consists of a set V contour of n nodes, and a directed weight matrix W contour with dimensions |V contour | × |V contour |. For the rest − → of the paper, W contour will be simply referred to as the directed graph W . For each pair of nodes (vi , vj ) where vj ∈ Nbhd (vi ) = {(i, j) : (xi , yi ) − (xj , yj ) ≤ δ} for a fixed distance δ, the weight wi→j of the edge connecting them is defined by: mag dir reg wi→j = wi→j wi→j wi→j ,
(3)
where wmag , wdir and wreg are the contour magnitude and direction and region cutting components. Flow directionality will further impose restricting conditions on which nodes i and j will be actually connected in the final weight matrix. Since there can be multiple nodes (each with its own (θ, m) pair) per image pixel, the weights of edges connecting two nodes corresponding to the same pixel are set to zero. For graph nodes (vi , vj ) with corresponding flow magnitudes mi and mj and orientations θi and θj respectively, the three components are as follows: mag contour magnitude wi→j : The higher magnitudes correspond to contours bounding a region in the current eigenvector. To enhance also smaller magnitude contours, we choose a magnitude component that prefers similar magnitudes: mag wi→j = exp (−|mj − mi |/σm ) .
(4)
dir contour bending wi→j : the directionality of the edgels is measured by how much bending is needed in order to complete a curve between two nodes vi and vj . The weight is given by the co-circularity conditions in terms of the angles αi and αj between the nodes orientation and the distance vector between them: 1 − cos(2αi − αj ) dir wi→j = exp − . (5) 1 − cos(2σd )
We restrict the bending amount by allowing only connections between nodes whose vectors satisfy cos αi ≥ 0 and cos αj ≥ 0. The orientations of both nodes cannot be simultaneously perpendicular with respect to the direction vector between them, i.e. cos αi cos αj = 0. We fix σd = π/4. reg region cutting wi→j : to avoid jumping between disconnected contours around the same region (hence cutting through the region rather than going around it) we look at the distance transform Δ of the regions to their boundaries (highest values will occur at the medial axis of the regions). For each region store the maximum Δ values as (1) (n) {Δmax , . . . , Δmax }. Let lij denote the set of points on the line between graph nodes
Shape Extraction through Region-Contour Stitching S (1) S (2) S (3) S (1) S (2)
....
....
S (3)
S (n)
1111 0000 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111 0000 1111
....
....
S (n)
FLOW DISSIPATES S (3)
399
ENHANCED STITCHING S (5)
S (8)
(b)
(a)
(a)
S (2)
S (7)
(b)
BEFORE
S (1)
S (4)
S (6)
AFTER
Fig. 5. Graph normalization for directed Contour Random Walk setup. Left: Graph nodes are grouped according to the region segmentation labels S k they arise from. We normalize the graph weights by the sum of connections reaching each region label S k separately. This gives a boost to contours flowing between regions. Right: Effects of this normalization. When the flows belonging to each S k are normalized to one, fainter flows that would otherwise not be able to compete with the stronger neighbors are enhanced (case b, indicated in green) instead of dissipated (case a, indicated in magenta). The ‘feather’ length on each arrow indicates the flow magnitude.
(vi , vj ). We then compare the maximum point on this line, max(Δ(lij )), with the highest value Δmax of the region S (k) in which it occurs: reg wi→j = exp − max(Δ(lij ))/(σr Δ(k) (6) max ) , hence penalizing cutting through the interior of the labeled regions {S (1) , . . . , S (n) }.
3 Salient Contour as Persistent Cyclic Random Walk → − − → We generate a directed random walk matrix P = D−1 W by normalizing the connections from each node. Normalizing the random walk matrix is known to provide a segmentation criterion robust to leakages. In contrast to the previous approaches which − → normalize W by its total weighted connections [11], we choose to normalize the connections within each eigenvector separately hence allowing the random walk to bifurcate. This is an important feature of our algorithm; it allows contours to flow on in different region segments without penalty. Since the connections of the flow in each region add up to one, this normalization effectively enhances faint contour flows that arise from low contrast regions, as well as minor flows through regions which would otherwise be suppressed by the stronger neighboring options. Recall that we can view − → − → − → − → − → W as column blocks with W = [W S (1) , W S (2) , . . . , W S (n) ] are ordered in terms of the original region labels {S (1) , . . . , S (n) }. The normalization for each column block − → − → [W S (β) ] for β = {1, . . . , n} (illustrated in Fig. 5) will then have the form Dβ−1 [W S (β) ] − → where Dβ is a diagonal matrix with entries Dβ (i, i) = j [W S (β) (i, j)].
400
E. Bernardis and J. Shi
According to our graph setup, finding salient image contours amounts to searching for cycles in this directed graph. How would salient cycles appear in this random walk and how would they be distinguishable from generic clutter? We first notice an obvious necessary condition. If the random walk starting at a node comes back to itself with high probability, then there likely is a cycle passing through it. We denote the returning probability by P r(i, t) = P r(i, t | || = t). Here is a random walk cycle with length t passing i. However, this condition alone is not enough to identify meaningful structures. Consider the case where there are many distracting branches in the random walk. In this case, paths through the branches will still return to the same node but with different path lengths. Therefore, it is not sufficient to require the paths to return only; they have to return in the same period t. Salient contours can be thought of as 1D cycles, structures that have a 2D geometry but are topologically 1D, i.e., a set of edgels with a well defined ordering and connections between them strictly follow it. 1D cycles have a special returning pattern probability P r(i, t). A random walk step on a 1D cycle tends to stay within the cycle, while moving a fixed amount forward in the cyclic ordering. Our task is to separate these persistent cycles from all other random walk ones. To quantify this observation, [15] introduces a peakness measure of the random walk probability pattern: ∞ P r(i, kT ) R(i, T ) = k=1 , (7) ∞ k=0 P r(i, k) which computes the probability that the random walk returns at steps of multiples of T . R(i, T ) being high indicates the existence of 1D cycles passing through node i. The key observation is that R(i, T ) closely relates to complex eigenvalues of the transition matrix P , instead of real eigenvalues [15]: Theorem (Peakness of Random Walk Cycles). R(i, T ) depends on P ’s eigenvalues:
λT
j
R(i, T ) =
j
( 1−λ T · Uij Vij ) j
1 j ( 1−λj · Uij Vij )
,
(8)
→ − where Uij and Vij are respectively the left and right eigenvectors of P . This theorem λT
j 1 shows that R(i, T ) is the average of f (λj , T ) = ( 1−λ T · Uij Vij )/ ( 1−λ · Uij Vij ). j j
For real λj , f (λj , T ) ≤ 1/T . For complex λj , f (λj , T ) can be large. For example, when λj = s · ei2π/T , s → 1, Uij = Vij = a ∈ R, f (λj , T ) → ∞. It is the complex eigenvalue with proper phase angle and magnitude that leads to repeated peaks.
4 Circular Embedding for Contour Grouping The above analysis shows that salient contours correspond to persistent cycles in random walk, and their persistency can be computed from the eigenvalues of the random walk. It has been shown in [15] that the eigenvectors are an approximate solution to the following ideal cost for circular embedding of salient contour grouping. A circular embedding is a mapping between the vertex set V of the original graph to a circle plus
Shape Extraction through Region-Contour Stitching
401
the origin: Ocirc : V → (r, θ) : Ocirc (i) = xi = (ri , θi ), where ri is the circle radius which can only take a positive fixed value r0 or 0. θi is the angle associated with each node. The ideal embedding encodes both the cut and the ordering of graph nodes, and maximizes the following score: Ce (r, θ, Δθ) =
θi <θj ≤θi +2Δθ ri >0, rj >0
− → 1 P ij /|S| · , Δθ
(9)
where S is a subset of graph nodes and Δθ = θj − θi is the average jumping angle. Optimizing this score is not an easy task. Moreover, we are not only interested in the best solution of eqn(9), but in all the locally optimal solutions, which give all the 1D structures in the graph. We find a relaxation by setting u = x, v = u · e−iΔθ . We → − set c = t0 e−iΔθ to be a constant. Eqn. (9) can be rewritten as maximizing ((uH P v · c)/(uH v)) with u, v ∈ Cn and is equivalent to the following optimization problem: → − maxn (uH P v)
u,v∈C
s.t. uH v = c.
(10)
→ − This problem leads exactly to P ’s complex eigenvectors as shown in [15].
5 Computational Solution One way to extract contours is to find the maximal covering cycle in the complex eigenvectors space using a modified shortest path algorithm. We compute line fragments by local edge linking to reduce complexity and construct a directed graph where each node represents a fragment. Two nodes are connected by a directed edge according to the embedding phase angle (which specifies their ordering) and spatial connectivity. For any two nodes in the graph, we seek to compute shortest path between them, which represents a contour hypothesis. We use dynamic programming to compute at each node vj for all its parents vi the recursive functions: Bj (li ) = maxlj (A(lj ) + L(lj ) + d(li , lj ) + k∈Cj Bk (lj )) and the best fragment lj∗ at which the max occurs. Here, A is the area spanned by li in the embedding space, d the phase overlap of li and lj , L is a measure of the fragment’s length and leaf nodes only consider the terms A, d and L. The optimal path L∗ is obtained by picking the fragment that maximizes (A(lr ) + L(lr ) + k∈Cr Bk (lr )) at the root node vr and then backtracking the values lj∗ at each node vj until a leaf node is reached. To discover more contours, we sample a set of paths around the optimal one. We compute this path by sampling over the marginal distributions, as in [1], given recursively by the functions Sj (li ) ∝ vc ∈Cj p(li , lj )Sc (lj ), where p(li , lj ) is the joint probability p(li , lj ) ∝ exp(−(A(li ) + d(lj , li ) + L(li ))). For leaf nodes, Sj (li ) only considers the term p(li , lj ). At the root node vr , we sample from ∝ vc ∈Cj Sc (lr ). All contour fragments for nodes vj thereafter are sampled from the marginals Sj (li ) until a leaf node is reached.
402
E. Bernardis and J. Shi
Algorithm 1. Global Contour Flow Stitching 1: From the initial filter responses, use the magnitude and phase to obtain (a) an initial Edge Mask E (section 2.1) and (b) NCut Region Segmentation (section 2.2) eigenvectors. 2: Use the NCut eigenvectors to compute the Lorentzª Edge Flow Fl on E (section 2.3). « 3: Define a new directed contour graph Gcontour = V contour , W contour in which the nodes are duplicated copies of the edge mask E with the flow vectors given by the Fl . 4: Compute graph weights using the new flow’s magnitude and orientation as well as informamag dir reg tion of the region labeling: wi→j = wi→j wi→j wi→j (section 2.4). − → 5: Solve for the eigenvectors of this new directed graph: V (D − W ) = V λ (section 3). 6: Extract maximum covering cycles using a modified shortest path algorithm in the complex embedding space (section 5) hence extracting corresponding salient contours in the image.
6 Experiments Examples of different contours extracted from different eigenvectors for the tooth image are shown in Fig. 6. We compare our extracted contours with NCut on several objects of the Berkeley Segmentation Database [9] in Fig. 7. We select from the extracted samples the contours enclosing the regions that best match the object by shape context on the boundaries. Piecing segments into an object from the oversegmentation would require searching through exponentially many combinations of the fragmented regions. Our contour grouping can draw samples from limited salient contours (the number of samples is quadratic in the size of contour fragments). Our contours improve efficency of segmentation and fix small leakage problems.
0.04
6
0.02
0.06
4 0.04
Im(λ)
Im(Eigenvector(λ))
8
Im(Eigenvector(λ))
0.1 0.08
0.02 0 −0.02
2 0 −2
−0.04
−4 −0.06
−6
0
−0.02
−0.04
−0.06
−0.08
−0.08 −0.1 −0.03
−8 −0.02
−0.01
0
0.01
0.02
0.03
0.04
Re(Eigenvector(λ))
0.05
0.06
4
5
6
7
8
Re(λ)
9
10
11
−0.1 −0.14 −0.12
−0.1
−0.08 −0.06 −0.04 −0.02
0
0.02
0.04
0.06
Re(Eigenvector(λ))
Fig. 6. Examples of paths found by sampling. Center: The top eigenvalues sorted by their real components. For the eigenvector associated to each eigenvalue, we sample several contours. Displayed are two eigenvectors with two sampled contours each.
7 Discussion The algorithm described in this paper, summarized in Algorithm 1, extracts salient closed contours by exploiting region segmentation. The results produce segments that withstand varying contrast on the object boundaries and are therefore, less fragmented. We contrast our method with two related approaches: 1. Region Stitching using Hard Segmentation: The naive approach to region stitching would be to oversegment an image and use contour grouping over the region boundaries
Shape Extraction through Region-Contour Stitching
403
Fig. 7. Extracted contours for several object images taken from the Berkeley segmentation dataset. From left to right: original image, ground truth object silhouette, ncut region segmentation, segmentation given from our extracted contour, untangling cycles contour grouping result (each contour with a different color), and finally, our extracted contour (yellow) on the original edge map (black). Note that the contour can be disconnected since we do allow jumping in the shortest path algorithm. The jumping is guided by the phase angle of the eigenvector in the embedding space, which allows to follow the contour throughout the boundary even if image continuity is broken. For all the images we used 30 eigenvectors for the Ncut segmentation and 20 for the contour grouping one.
404
E. Bernardis and J. Shi
to stitch the segments back together. The main disadvantage is that the amount of segmentation needed is unknown a priori and unnecessary overfragmentation increases computation and reduces region saliency. Moreover, this contour grouping would find cycles through the various region boundaries independent of regions’ size. Our approach only makes use of a few salient region segments and enables the region boundary map to be juggled using the soft eigenvectors information and the initial edge mask. 2. Contour Grouping using Soft Segmentation: The method in [7] indeed uses the global information to extract contours and localize junctions. However, as this method does not involve any grouping, it does not resolve ambiguity of contour grouping in the places where three regions merge (i.e. junctions). Our approach introduces a normalization that enables the contour to flow through the junctions given by the region segment boundaries resulting in long contours. Since the contour has an inherent region belonging associated to it, the extracted contour is also guaranteed to enclose a region. In this paper we highlight how salient contours enclosing objects can be detected by combining the complementary power of region segmentation and contour grouping. Regions bridge the gaps between contours due to faint boundaries. Contour flows stitch oversegmented regions into large and salient ones, which can greatly simplify tasks such as object extraction and detection. Results on real images have shown great potentials of our approach on extracting salient object boundaries. Acknowledgment. The authors would like to thank Philippos Mordohai for many useful comments and Liming Wang for helping in generating the final contour figures.
References 1. Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial Structures for Object Recognition. International Journal of Computer Vision 61(1), 55–79 (2005) 2. Fisher, B., Buhmann, J.M.: Path-Based Clustering for Grouping of Smooth Curves and Texture Segmentation. IEEE Transaction in Pattern Analysis and Machine Intelligence 25(4), 513–518 (2003) 3. Guo, C.E., Zhu, S.C., Wu, Y.N.: Primal Sketch: Integrating Texture and Structure. Computer Vision and Image Understanding 106(1), 5–19 (2007) 4. Jacobs, D.W.: Robust and Efficient Detection of Salient Convex Groups. IEEE Transaction in Pattern Analysis and Machine Intelligence 18(1), 23–37 (1996) 5. Jalba, A.C., Wilkinson, M.H.F., Roederink, J.B.T.M.: CPM: A Deformable Model for Shape Recovery and Segmentation Based on Charged Particles. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(10) (2004) 6. Mahamud, S., Williams, L., Thornber, K., Xu, K.: Segmentation of Multiple Salient Closed Contours from Real Images. IEEE Transaction in Pattern Analysis and Machine Intelligence (2003) 7. Maire, M., Arbelaez, P., Fowlkes, C., Malik, J.: Using Contours to Detect and Localize Junctions in Natural Images. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, AK (2008) 8. Malik, J., Belongie, S., Leung, T., Shi, J.: Contour and Texture Analysis for Image Segmentation. International Journal of Computer Vision (2001) 9. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In: Proc. 8th Int’l Conf. Computer Vision, vol. 2, pp. 416–423 (July 2001)
Shape Extraction through Region-Contour Stitching
405
10. Medioni, G.G., Guy, G.: Inferring Global Perceptual Contours from Local Features. In: IUW (1993) 11. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transaction in Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 12. Tu, Z.W., Zhu, S.C.: Parsing Images into Regions, Curves, and Curve Groups. International Journal of Computer Vision 26(2), 223–249 (2006) 13. Ullman, S., Shashua, A.: Structural Saliency: The Detection of Globally Salient Structures Using a Locally Connected Network. In: MIT AI Memo (1988) 14. Wang, S., Kubota, T., Siskind, J., Wang, J.: Salient Closed Boundary Extraction with Ratio Contour. IEEE Transaction in Pattern Analysis and Machine Intelligence (2005) 15. Zhu, Q., Song, G., Shi, J.: Untangling Cycles for Contour Grouping. In: 11th IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil (2007)
Difference of Gaussian Edge-Texture Based Background Modeling for Dynamic Traffic Conditions Amit Satpathy1,2 , How-Lung Eng2 , and Xudong Jiang1 1
2
School Of EEE, Nanyang Technological University, Nanyang Avenue, Singapore 639798
[email protected],
[email protected] Institute for Infocomm Research, A*STAR (Agency for Science, Technology and Research), 1 Fusionopolis Way, Connexis, Singapore 138632 {asatpathy,hleng}@i2r.a-star.edu.sg
Abstract. A 24/7 traffic surveillance system needs to perform robustly in dynamic traffic conditions. Despite the amount of work that has been done in creating suitable background models, we observe limitations with the state-of-the-art methods when there is minimal color information and the background processes have a high variance due to lighting changes or adverse weather conditions. To improve the performance, we propose in this paper a Difference of Gaussian (DoG) edge-texture based modeling for learning the background and detecting vehicles in such conditions. Background DoG images are obtained at different scales and summed to obtain an Added DoG image. The edge-texture information contained in the Added DoG image is modeled using the Local Binary Pattern (LBP) texture measure. For each pixel in the Added DoG image, a group of weighted adaptive LBP histograms are obtained. Foreground vehicles are detected by matching an existing histogram obtained from the current Added DoG image to the background histograms. The novelty of this technique is that it provides a higher level of learning by establishing a relationship an edge pixel has with its neighboring edge and non-edge pixels which in turn provides us with better performance in foreground detection and classification.
1
Introduction
Background modeling addresses the challenge of modeling processes of each background pixel such that presence of foreground in a current frame could be detected by means of subtraction. Existing background modeling methods can be broadly classified into two general approaches - pixel-based approach and block-based approach. In pixel-based approach, each pixel is assumed to be independent of each other and has an individual modeling that characterizes it. In block-based approach, an image is divided into blocks and features associated with a particular block are computed. Foreground objects are usually detected by comparing blocks. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 406–417, 2008. c Springer-Verlag Berlin Heidelberg 2008
Difference of Gaussian Edge-Texture Based Background Modeling
407
An early pixel-based method could be referred to the work by Wren et al. [1] where the color properties of each pixel was modeled as a single Gaussian distribution in YUV color space. A recursive updating scheme was adopted to allow adaptation to illumination changes in the background. However, the model was generally more confined for static background and indoor environment. In an outdoor environment, Wren et al.’s algorithm performed poorly due to the dynamic nature of the background processes which involved moving objects and illumination changes of greater variance compared to indoors. In an effort to combat this problem, Grimson et al., in [2] and [3], proposed to model each pixel value as a mixture of weighted Gaussian distributions (MoG). The background model for each pixel was recursively updated to handle lighting changes and repetitive motions of objects in the background. In an approach to avoid assuming any probability distributions of a pixel, several works came up with non-parametric methods for background modeling. In the W4 framework [4], each background pixel process was determined by its minimum intensity, maximum intensity and maximum difference in intensity between 2 consecutive frames. In [5] and [6], the authors proposed using a Gaussian (Normal) kernel estimator to estimate the probability density function of a pixel value. In [7], Jabri et al. proposed fusing gradient information with color properties to cope with shadows and unreliable color cues. Kiratiratanapruk et al. [8] used Canny edge detectors to extract edges of objects from traffic surveillance videos. The outputs from the Canny edge detectors were then dilated or eroded to connect the disconnected objects. In [9], Bernier et al. decided to include contour based features using Gaussian kernels to determine the histogram of oriented gradients at each pixel and to extend this histogram along with a spatial Gaussian kernel over a local area to obtain a polar histogram. Mason et al. [10] proposed two different histogram methods - edge and color - for computations of the histograms. They found that edge histograms gave better performance compared to color histograms. In [11], the authors proposed the exploitation of texture in video frames to model the background and detect foreground objects. A popular texture measure method, Local Binary Pattern (LBP), was proposed as a modeling method. The LBP operator codes the pixels of an image block by thresholding the neighboring pixels with the center pixel and considering the result as a binary number which is the LBP code. B−1 1 x≥0 LBPB,R (xc , yc ) = s(gb − gc )2b , s(•) = (1) 0
otherwise,
b=0
where gc corresponds to the grey value of the center pixel (xc , yc ), gb to the grey values of the B neighborhood pixels and B correspond to the number of user-defined bits (sampling points) used for representation of the LBP binary code at a user-defined distance of R from the center pixel. The frame was divided into overlapping square blocks. For each block, a group of M weighted LBP histograms was calculated to model the dynamic nature of the scene. The background model updating procedure was similar to that proposed by [3].
408
A. Satpathy, H.-L. Eng, and X. Jiang
Foreground detection was achieved by comparing an existing histogram of a block with the background histograms for the block. If a match was found, the image block was classified as a background and foreground otherwise. In [12], the authors modified the classification of foreground to a pixel of the block to extract the shape of the foreground objects more accurately. The interesting fact about this method was that the algorithm was computationally efficient, non-parametric and invariant to gray-scale changes. A summary of the information used for background modeling and the type of approach of the discussed background modeling methods and the proposed algorithm in this paper is given in Table 1. One of the main problems of using the color information based methods in Table 1 is that they rely on the color properties of pixels for determination of background and foreground. In a traffic surveillance problem, color information is minimal during periods of low light like in the night or early morning. Moreover, these methods are not robust to high noise presence in the images. For instance, during rainy days, the methods are unable to perform efficiently in the presence of raindrops and the highly dynamic watery surface of the roads. Furthermore, they are unable to handle situations in the night where the headlights of vehicles alter the illumination of the road surfaces in the scene. The edge information based methods in Table 1, which do not use color properties, use oriented edge filters to extract edges which only pick out edges oriented at particular angles and extract edges of particular strengths. Moreover, the resultant output is composed of disjointed lines (edges) which do not contain much information on the spatial structure of the background or the foreground. Furthermore, the edge-based methods do not examine the relation an edge pixel has with its neighbors. In the block-based approaches for edge-based methods, histograms corresponding to orientation angles are calculated over a block which only describes the location of each edge pixel in the spatial image. In [12], the authors presented an extensive comparison of their method with the state-of-the-art methods discussed in this paper. Motivated by their results whereby their algorithm gave much better performance for background modeling and foreground detection, we decided to adopt the LBP Texture Measure Modeling for background subtraction. Instead of using color properties for texture modeling, we propose adoption of unoriented edge-based features for texture modeling instead. The motivation for choosing edge as features is that edge-based features gave better performance compared to using color features [9][10]. In this paper, we propose the usage of an Added Difference of Gaussian (DoG) image and LBP texture measure to model the background and detect foreground objects in images. The Added DoG image contains unoriented edgetexture represntation of the grayscale spatial image combined from DoG images at different scales. The edge-texture information of the Added DoG image is then modeled using LBP Texture Measure Modeling method discussed in [12]. Using this method has 3 major advantages. Firstly, reliance on color properties is minimized. Secondly, the Added DoG image contains edges of all orientations which makes the modeling more accurate and the edges form textures that supports a
Difference of Gaussian Edge-Texture Based Background Modeling
409
Table 1. Summary of Background Modeling Methods Discussed Method Single Gaussian Distribution [1] Mixture of Gaussians [3] W4 Framework [4] Gaussian Kernel Estimator [6] Jabri et al.’s Model [7] Canny Edge Model [8] Contour Based Model [9] Mason et al.’s Model [10] LBP Texture Measure Model [11] Proposed DoG Edge-Texture Based Modeling
Information Used for Modeling Approach Color Color Color Color Edge and Color Edge Edge Edge or Color Color Edge
Pixel Pixel Pixel Pixel Pixel Pixel Block Block Block Block
block-based texture modeling. Thirdly, this method exploits the relationship an edge pixel has with its neighbors that enables a structural relationship of objects in the background to be studied. The proposed method also handles situations of illumination changes in low light conditions and situations where the noise causes the dynamicity of the background to be more varying.
2
Methodology for Difference of Gaussian Edge-Texture Based Modeling
In this section, we describe our proposed algorithm. The proposed algorithm consists of two parts - the generation of the Added Difference of Gaussian (DoG) image and the Local Binary Pattern (LBP) Texture Measure Modeling for the Added DoG image. 2.1
Added Difference of Gaussian Image - Edge-Based Texture Representation
For traffic surveillance videos, edges are strong features present in the data that can be used to represent the background and detect foreground objects. The background consists of straight roads with lane dividers and markings which contain strong edge presence. However, the edges are not oriented in any particular direction nor are their relative strengths known. Current edge detectors require orientation and edge strength thresholds to be specified which are not sufficient to model all the edges present in the image. Moreover, these edge detectors produce edges around structures but not within the structures which produce ”holes” in the image after thresholding. Therefore, there is not enough texture or spatial (shape) structure contained in the outputs to be used for modeling. DoG edge detectors are not sensitive to orientation of the edges and using different standard deviations yields edges of different strengths. Hence, this motivates us to explore the use of DoG edge detectors for the creation of an
410
A. Satpathy, H.-L. Eng, and X. Jiang
Fig. 1. Three thresholded Difference of Gaussian (DoG) images of σ= 1, 2 and 4 respectively (from top left clockwise)(i=2). The σ=1 DoG contains the most edgetexture detail but very little spatial structure information. As σ increases, the edgetexture details decreases but the spatial structure information of objects increases.
Fig. 2. The thresholded Added Difference of Gaussian (DoG) image on the top left and the unthresholded Added Difference of Gaussian (DoG) image on the top right. LBP histograms for a common point (indicated by the circles) in the images are shown. The thresholded DoG image yields a histogram that is peaked at a particular value (no distinction) while the unthresholded DoG produces a histogram that is more distinctive. The Added DoG image was obtained with i = 2, σ = 1 and K = 6.
edge-texture map. An unthresholded DoG edge detector can be described as follows: D(x, y, σ) = L(x, y, iσ) − L(x, y, σ) ,
(2)
where L(x, y, σ) is a Gaussian-blurred image obtained by convolving a Gaussian filter of standard deviation σ with the grayscale spatial image. i is a user-defined scale factor. The DoG edge detector can only detect the spatial structure of objects and edge strengths which are between the scales of σ and iσ. Hence, using a single DoG edge detector result is not sufficient to model the significant edges in the
Difference of Gaussian Edge-Texture Based Background Modeling
411
image. A balance (Fig. 1) has to be obtained between texture of the edges and spatial structure of the objects in the background. We propose a weight summation of DoG images to resolve the conflict between balancing the edge texture details and the spatial structure details. Unthresholded DoG images are obtained at various scales. Using weights whose sum equal to 1, they are then summed to form an unthresholded Added DoG image (Fig. 2) which is given by: K−2 λ(x, y) = ωk D(x, y, ik σ) , (3) k=0
where i is a scale factor, σ is the user-defined standard deviation, K is the number of Gaussian filtered images and D(x, y, σ) has been defined in (2). ω is a user assigned weight for each DoG whose sum is: K−2
ωk = 1 .
(4)
k=0
The motivation for using unthresholded DoG images is to keep some of the grayscale information at different scales which is useful for shadow removal and to obtain more distinctive LBP histograms to allow for differentiation between background and foreground objects. Once the unthresholded Added DoG image is obtained, LBP Texture Measure Modeling can then be performed on the Added DoG image to model the edge-texture information. 2.2
Background Modeling Using Local Binary Pattern Texture Measure
In the original LBP Texture Measure Modeling method proposed by [11] and [12], the LBP was performed on the spatial images. The grayscale non-negative pixel values were used directly for the computation of the LBP codes. In our method, the unthresholded Added DoG image consists of positive and negative pixel values. The LBP codes are computed as follows: LBPB,R (xc , yc ) =
B−1
s(λ(xb , yb ) − λ(xc , yc ))2b ,
s(•) =
1 0
x≥0 otherwise,
(5)
b=0
where λ(xc , yc ) corresponds to the unthresholded DoG value of the center pixel (xc , yc ) and λ(xb , yb ) to the unthresholded DoG values of the B neighborhood pixels. The Added DoG image is partitioned into N x N overlapping square blocks. Using overlapping blocks helps to extract the shape of moving objects more accurately. In the following, the background model procedure for a pixel is explained and the procedure is identical for every other pixel in the Added DoG image. The feature vector used to represent each square block is a normalized LBP → − histogram of 2B bins. We denote the histogram as h t . For a history of each square block, we model the histograms using M adaptive weighted histograms,
412
A. Satpathy, H.-L. Eng, and X. Jiang
Fig. 3. The left diagram is the background histogram of a pixel. The center diagram is the current histogram of a pixel. The right diagram is the overlapping plot of the 2 histograms. In the right diagram, the circles in red represent the current histogram of the pixel and the ones in blue the histogram of the background of the pixel. Histogram intersection gives a value of 0.69 while Bhattacharyya distance measure gives a value of 0.94. Majority of the overlapping bins have close values indicating similarity.
→ → {− m 0 , ..., − m M−1 }. The value of M is user-determined and the sum of the weights is 1. The weight of the mth histogram is denoted by ϕm . In order to update the background model for the pixel, it is necessary to compare the current histogram → − h t to each of the existing M adaptive weighted histograms and determine the best match. Heikkil¨a et al. [11][12] suggested using histogram intersection for histogram comparison in their papers. For our algorithm, we use the Bhattacharyya distance for comparison with a threshold of TD as the histograms are normalized: 2 √ Db = pj qj , B
(6)
j=1
where j is the bin index of the histograms being compared. The reason as to why we did not use histogram intersection is that histogram intersection assumes identical feature matching. Hence, similar features fail to be matched when this method of comparison is used (Fig. 3). The Bhattacharyya distance measure gives a value between 0 and 1 which is indicative of a probability of similarity. If the best matched distance measure falls below threshold, TD , the back→ − ground model histogram with the least weight is replaced with h t . A new low weight is assigned to the histogram and the weights are normalized. If the best matched distance measure satisfies the threshold requirement, the bins of the → − background model histogram that best matches with h t are updated as follows: → − − → → m k = γb h t + (1 − γb )− mk ,
(7)
where γb is a user-defined learning rate between 0 and 1. The weights of all the model histograms are also updated as follows: ϕk = γw Nk + (1 − γw )ϕk ,
(8)
where γw is the user-defined learning rate and Nk is 1 for the model histogram that best matches and 0 for the others. Once the learning is complete for the
Difference of Gaussian Edge-Texture Based Background Modeling
413
pixel, the model histograms are sorted in descending order according to their weights and the first L histograms are chosen as the background histograms using the following condition: ϕ0 + ... + ϕL−1 > TL ,
(9)
Fig. 4. Performance of the Developed Algorithm over a period of 24 hrs from early morning to late night. (Top to bottom, left to right) The results after morphological operations of the foreground detection are presented. The region-of-interest for the detection of the foreground objects is the traffic that is flowing from the bottom to the top. The robustness of the algorithm can be seen by comparing the detected foreground objects in the even columns with the original images in the odd columns.
414
A. Satpathy, H.-L. Eng, and X. Jiang
where TL is a user-defined threshold between 0 and 1 for selection. If the background is dynamic in nature, it is preferable to select a high value for TL and low value for static backgrounds. In order to detect the foreground objects, a histogram of an image block of the current Added DoG image is compared against the M background histograms. If the distance measure value for the best match is above TD , the center pixel of the image block is classified as background. Otherwise, it is classified as foreground. The background model is continuously updated using the update model described during the training phase.
3
Results
Fig. 4 shows the qualitative evaluation of the performance of the proposed algorithm at different time intervals over a 24 hr period from early morning to late night. The sample frames are in the odd columns while the corresponding segmentation results are in the even columns. The sample images depict the numerous challenging scenarios such as the lack of color information during the early morning and night period, the presence of raindrops in a rainy scenario which contributes to the higher dynamicity of the background and the presence of strong shadows during the day. In [11] and [12], the authors compared the performance of their proposed LBP texture modeling method with some of the state-of-the-art methods presented in Section 1 of this paper such as MoG and Mason’s Model. They highighted that their algorithm gave much better performance visually and numerically than the current existing popular methods. Results of our proposed method was compared visually against the results of Heikkil¨ a’s et al. method using the same data sets to gauge the performance of our proposed algorithm against theirs. In order for the comparison to be valid, the thresholds and the learning rate values were kept the same for both methods. The values for the parameters used in our experiments are given in Table 2. TD and ωk , were determined through experimental simulations on the data. For computation of the LBP codes and histograms, the size of the image block was 11 x 11 and the LBP histogram consist 16 bins. A total of fifty clean images from each data set were used for learning. The image resolution was 640 x 480 pixels. Fig. 5 shows the visual results between our method and the method proposed by Heikkil¨ a et al. In the first 2 rows, an early morning scenario is depicted Table 2. Parameter values. The parameter values for the Local Binary Pattern texture measure are the same for both algorithms. Parameter Value Parameter Value Parameter Value ω0 ω1 ω2 ω3 ω4
0.40 0.30 0.15 0.10 0.05
K i σ B R
6 2 1 4 2
M TD TL γb γw
5 0.77 0.90 0.015 0.015
Difference of Gaussian Edge-Texture Based Background Modeling Original Image
Results of Spatial Image
Results of Difference Of
Local Binary Pattern
Gaussian Edge-Texture
Texture Modeling
Based Modeling
415
Fig. 5. Results of Performance of Proposed Algorithm against Heikkil¨ a et al.’s method in Adverse Traffic Conditions
416
A. Satpathy, H.-L. Eng, and X. Jiang
characterized by low color information and a mild watery road surface. The motorcycle in the 1st row is not detected with Heikkil¨a et al.’s method while it is detected in the proposed algorithm. The lorry in the 2nd row is split into 2 blobs in Heikkil¨ a et al.’s method which also includes the shadow leading to false positive. This is not so in the proposed algorithm. A rain scenario is depicted in the 3rd row. The raindrops are presented as noise in the image and causes the background processes to be more dynamic than it should be. The foreground objects cannot be detected using Heikkil¨ a et al.’s method. A night time scenario with low color information and varying illumination throughout the background is presented in the 4th row. The bus cannot be detected completely using Heikkil¨ a et al.’s method leading to a false classification. The bus is detected completely in the proposed algorithm. Another night time scenario characterized by low color information, strong headlights, and darkness is presented in rows 5 - 7. In row 5, the headlights from the car is detected in Heikkil¨a et al.’s method leading to a false positive. In row 6, the headlights from the car in the opposite lanes causes shadows and illumination changes in the region-of-interest leading to a false detection using Heikkil¨a et al.’s method. In row 7, the black car in the left side of the image and part of the car in the center lane fail to be detected in Heikkil¨a et al.’s method. These are all corrected in the proposed algorithm. The data sets consist of night time, early morning and periods of heavy rain videos. The night time and early morning videos were almost void of any color information in the background or foreground. The videos also contained strong presence of camera noise which got highlighted because of the absence of any color information of the scene. The videos of heavy rain conditions contained strong spurious noise in the form of random raindrops. All these scenarios exhibited high dynamicity in the background and had strong noise present. The proposed algorithm in this paper has the ability to work robustly through these conditions unlike the spatial LBP texture modeling method proposed by Heikkil¨ a et al. The performance of the proposed algorithm can be better improved if the thresholds could be tailored for each scenario. However, this is not practical in a 24-hour surveillance network.
4
Conclusion
In this paper, we proposed an alternative form of background modeling and foreground detection using Difference of Gaussian (DoG) edge detectors at different scales summed to form a single Added DoG image as a representation of the edgetexture information contained in the scene. The edge-texture is measured using Local Binary Pattern (LBP) by dividing the Added DoG image into overlapping image blocks and computing the LBP histogram of the LBP codes for the block. Foreground detection is achieved by comparing histograms of the background models against the current histogram. Our method exploits the relationship an edge pixel has with its neighboring pixels, thereby learning the spatial structure of an object in the scene. Moreover, the algorithm is not completely reliant on color information for modeling, detection and classification. Hence, our method works robustly in scenarios where color information is minimal. Furthermore, using DoG edge detectors, edges of all orientations are obtained, giving more meaning to the edge-texture.
Difference of Gaussian Edge-Texture Based Background Modeling
417
Heikkil¨a et al. had already proven in [11] and [12] that their LBP Texture Measure Modeling method performed better than the current state-of-the-art methods discussed in Section 1. Therefore, we chose to compare our algorithm against theirs for performance benchmarks. The experimental results show that the developed algorithm is robust in dynamic traffic scenarios such as night time conditions or periods of heavy rain compared to Heikkil¨a et al.’s method. It is tolerant to strong illumination changes such as headlights of cars that project onto the road and is able to model the dynamicity of the background more efficiently, enabling it to work in situations where there is strong noise in the scene. We plan to extend the algorithm to cover multi-camera views of the same scene to handle occlusions and to develop better tracking algorithms to track objects successfully through occlusions.
References 1. Wren, C., Azarbayejani, A., Darell, T., Pentland, A.: Pfinder: Real-time tracking of the human body. IEEE Trans. Pattern Anal. Machine Intell. 19, 780–785 (1997) 2. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proceedings of Computer Vision and Pattern Recognition, vol. 2, pp. 246–252 (1999) 3. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Machine Intell. 22, 747–757 (2000) 4. Haritaoglu, I., Harwood, D., Davis, L.: W4 : Real-time surveillance of people and their activities. IEEE Trans. Pattern Anal. Machine Intell. 22, 809–830 (2000) 5. Elgammal, A., Harwood, D., Davis, L.: Non-parametric model for background subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1843, pp. 751–767. Springer, Heidelberg (2000) 6. Elgammal, A., Harwood, D., Davis, L.: Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90, 1151–1163 (2002) 7. McKenna, S., Jabri, S., Duric, Z., Wechsler, H.: Tracking interacting people. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 348–353 (2000) 8. Kiratiratanapruk, K., Dubey, P., Siddhichai, S.: A gradient-based foreground detection technique for object tracking in a traffic monitoring system. In: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 377–381 (2005) 9. Noriega, P., Bernier, O.: Real time illumination invariant background subtraction using local kernel histograms. In: Proceedings of the British Machine Vision Conference, vol. 3, pp. 979–988 (2006) 10. Mason, M., Duric, Z.: Using histograms to detect and track objects in color video. In: Proceedings of the 30th on Applied Imagery Pattern Recognition Workshop, vol. 3, pp. 154–162 (2001) 11. Heikkil¨ a, M., Pietik¨ ainen, M., Heikkil¨ a, J.: A texture based method for detecting moving objects. In: Proceedings of the British Machine Vision Conference, vol. 1, pp. 187–196 (2004) 12. Heikkil¨ a, M., Pietik¨ ainen, M.: A texture-based method for modeling the background and detecting moving objects. IEEE Trans. Pattern Anal. Machine Intell. 28, 657–662 (2006)
A Sketch-Based Approach for Detecting Common Human Actions Evan A. Suma, Christopher Walton Sinclair, Justin Babbs, and Richard Souvenir Department of Computer Science, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, NC 28223, U.S.A {easuma,cwsincla,jlbabbs,souvenir}@uncc.edu
Abstract. We present a method for detecting common human actions in video, common to athletics and surveillance, using intuitive sketches and motion cues. The framework presented in this paper is an automated end-to-end system which (1) interprets the sketch input, (2) generates a query video based on motion cues, and (3) incorporates a new contentbased action descriptor for matching. We apply our method to a publiclyavailable video repository of many common human actions and show that a video matching the concept of the sketch is generally returned in one of the top three query results.
1
Introduction
Automated human activity detection from video could be used for searching archived athletic footage or detecting particular actions in a real-time security setting. However, this type of search is still an open, challenging problem. Commercial solutions (e.g., Google Video) typically employ search methods which do not operate on the content of the video; instead, a text query is matched to metadata of the video such as the title, description, or user comments. The possibility of incomplete or incorrect metadata is a well-known limitation to this approach. This leads in to a host of methods that fall under the umbrella of content-based video retrieval (CBVR). The literature on CBVR is extensive; see [1] and [2] for surveys. Multiple taxonomies exist for the classification of CBVR approaches. Most relevant to our work are two broad classes of techniques characterized by the query method: (1) text-based (or concept-based) approaches and (2) example-based approaches. Text-based approaches, such as [3], typically rely on some (semi-supervised or unsupervised) step of grouping videos together based on some concept and refining the search within each cluster to obtain the desired result. Example-based approaches, as done in [4], typically match features of a query video against those in the database and return high-scoring matches. Text-based approaches work when the content of the videos can be described succinctly. Also, a user can attempt to fine-tune search results simply by selecting new keywords to try. However, these methods fail when the query is ambiguous G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 418–427, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Sketch-Based Approach for Detecting Common Human Actions
419
(e.g., ”driving” for cars versus swinging a golf club). Additionally, secondary objects may be overlooked if classification is based upon the primary object or action in the video. Example-based approaches can overcome the limitation of ambiguous searches because a query video is generally more informative than a text label. However, finding representative videos to use for querying other videos can be difficult. More specifically, if a video strongly matching a search concept were easily obtainable, it might not be necessary to perform the query in the first place. In this paper, we present a method for querying databases of videos of human actions. Our method relies on using sketches of objects and motion cues as queries for video retrieval. We believe that allowing the user to provide the input in this manner combines the best features of both the text- and example-based approaches. The search query is more informative than a text-based approach and does not require that the user explicitly provide an example video. The framework presented in this paper is an automated end-to-end system which (1) interprets the sketch input, (2) generates a query video based on motion cues, and (3) incorporates a new content-based action descriptor for matching. Sketch recognition is a related area where the goal is to infer the semantics of an input sketch. Unlike those methods, (e.g., [5]), we are not interested in classifying the action represented in the sketch, nor do we need to collect the gesture information associated with creating the sketch. Our goal is to search a video database or stream for a conceptual match. Searching image and video databases using sketches has previously been explored. In [6], the author presents a method using a sketchbased system to search a large static im(a) (b) age database. In [7], the authors present a system which queries videos using sketches of motion cues and mainly provides for Fig. 1. Example sketches generated queries focusing on the translation of using (a) the freehand drawing interobjects in a viewing frame and not the face and (b) human body model interfiner-grained articulated motions that our face system is capable of matching. The paper is organized as follows. In Section 2, we describe our approach for interpreting sketches. In Section 3, we explain our method for matching a query video against a database. In Section 4, we show the results of testing our method on a publicly-available human motion data set. Finally, in Section 5, we conclude with a discussion of limitations of our current approach and future directions for this work.
420
2
E.A. Suma et al.
Interpreting Sketches
The query can be provided as a freehand sketch or generated from a human body model. While the former allows for the motion of arbitrary objects, the latter is more consistent and robust for human motion. In this section, we describe how we interpret the input sketches and infer the motion being described. 2.1
Input Methods
Figure 1(a) shows an example freehand sketch query. Our system includes a sketch drawing application which provides two drawing modes: (1) figure mode and (2) arrow mode. In figure mode, the user has access to various drawing tools, common to popular image editing applications, to specify the structure of the figure. In arrow mode, the user can click and drag on the image to add a motion arrow. For representing human motion, we also a provide point-and-click interface to produce sketches by manipulating a human body model, as shown in Figure 1(b). Here, the user does not draw the figure. Instead, limbs on a human body model are positioned and resized by clicking and dragging the joint locations. For human motion queries, this method is a more robust than freehand sketching because the connectivity of the skeleton, as well as the locations of joints, is already known to the system. 2.2
Interpreting Motion Cues
While we are primarily concerned with videos involving human motion, a freehand sketch could potentially contain objects other than human figures. Also, an arrow in these images can either refer to a specific component of an object (e.g., a limb on a human) or the entire object moving as a whole. Thus, we examine the object’s skeleton to determine the joint locations which separate the movable components of the figure. We employ a series of basic image processing techniques to generate a skeleton from a sketch. First, the sketch is blurred using a Gaussian kernel and thresholded to produce a binary image. Then, a medial axis transformation algorithm is applied to generate a skeleton of the image. Figure 2 illustrates the process for interpreting the motion cues relative to the skeleton. First, the arrows are projected (in the reverse direction) onto the skeleton. We assume that the point of contact, or origination point, lies on the moving component being referenced by the arrow. To determine the component (which we assume to be a line segment) on which each origination point lies, the Hough transform [8] is used. In our implementation, we restrict this search to a local neighborhood that is defined by tracing nl pixels along the skeleton from the origination point in the direction towards the center of mass. Empirically, we found that 20 pixels accurately sample the line segment in a typical 300 x 300 pixel sketch. The endpoints of the returned line segment are stored as the endpoints of the component on the skeleton. The endpoint that is closest along the skeleton towards the center of mass is the detected joint location for the
A Sketch-Based Approach for Detecting Common Human Actions
(a)
(b)
421
(c)
Fig. 2. (a) A freehand sketch with arrows indicating movement. (b) The image after skeletonization with arrows projected back onto the skeleton. (c) The detected line segments and joint locations.
component. If the line segment contains the center of mass, we interpret this to mean that the arrow motion corresponds to a movement of the entire object. For sketches produced by manipulating the human body model, interpreting the arrows is simplified since the connectivity of the skeleton and locations of joints are known. The origination points are calculated in a similar manner by projecting each arrow in reverse until the skeleton is contacted. 2.3
Sketch Animation
The next step is to generate a video based on the sketch and motion cues to use as a search query. First, the image is segmented into individual components at the joint locations. This is done so that individual components can be translated and rotated according to the motion cues. At each joint, the image is ”cut” along the normal to the skeleton, separating the component from the rest of the object. In complex skeletons such as human figures, components may be the children of other components (e.g., forearm and upper arm). To account for this possibility, the parent of each component, if any, is calculated by tracing from the joint location along the skeleton towards the center of mass to detect other joints. For images generated from a human body model, the parent-child relationship between joints is already known. The angle of component rotation is determined by computing the angular difference between the component vector and the vector formed by the joint location and arrow end point. For each frame, each component is rotated by nr degrees, where nr is the total angular rotation divided by the (user-specified) number of frames. For complex motions, where both a child and parent component move, we chose to rotate the child component first, then add this to the rotation of the parent component. Figure 3 shows such an example. Though the motion sequence may not be visually pleasing, the imperfections in the generated video will not significantly affect the matching process.
422
E.A. Suma et al.
Fig. 3. (top) A freehand input sketch with multiple arrows and frames from the generated video sequence. (bottom) A sketch generated from a human body model and the corresponding video sequence.
3
Matching Videos
For this problem, it is important to model the content of the video rather than the appearance, since sketches do not share appearance characteristics with real video. We extend a recently developed shape descriptor, the R transform [9], into a motion descriptor. Compared to competing representations, the R transform is computationally efficient and robust to common image transformations. Here, we describe the R transform and our extension for matching video sequences. 3.1
R transform
The R transform was developed as a shape descriptor for object classification from images. The R transform converts a silhouette image to a compact 1D signal using the two-dimensional Radon transform. The Radon transform, like the Hough transform, is commonly used to find lines in images. For an image I(x, y), the Radon transform, g(ρ, θ), using polar coordinate (ρ, θ), is defined as: g(ρ, θ) = I(x, y)δ(x cos θ + y sin θ − ρ), (1) x
y
where δ is the Dirac delta function. Intuitively, g(ρ, θ) is the line integral through image I of the line with parameters (ρ, θ). The R transform extends the Radon transform by calculating the sum of the squared Radon transform values for all lines of the same angle, θ, in an image: R(θ) = g 2 (ρ, θ). (2) ρ
Figure 4 shows an example image, the derived silhouette showing the segmentation between the actor and the background, and the R transform.
A Sketch-Based Approach for Detecting Common Human Actions
(a)
(b)
423
(c)
Fig. 4. An image (a) is converted into a silhouette (b) to which R (c) is applied
Fig. 5. A set of silhouette keyframes from a video of an actor performing a kick action. The corresponding R transform curve is shown below each keyframe. The graph on the right shows the R transform histogram motion descriptor for the video.
The R transform has several properties that make it particularly useful as an motion descriptor for a sequence of silhouettes. First, the transform is translationinvariant. Translations of the silhouette do not affect the value of R transform, which allows us to match images of the same action regardless of the position of the actor in the frame. Second, the R transform has been shown to be robust to noisy silhouettes (e.g., holes, disjoint silhouettes). This invariance is useful to our method in that extremely accurate segmentation of the actor (in the real videos) from the background is not necessary. Third, when normalized, the R transform is scale-invariant. Scaling the silhouette image results in an amplitude scaling of R, so we use the normalized transform: R (θ) = 3.2
R(θ) maxθ (R(θ ))
(3)
R transform Histograms
The R transform has been previously extended for use in action recognition. In [10], the authors trained Hidden Markov Models to learn which sets of unordered R transform corresponded to which action and in [11], the authors extend the R transform to include the natural temporal component of actions by concatenating sequential R transform curves into an R transform surface. The motion descriptor presented here combines ideas from these two approaches.
424
E.A. Suma et al.
(a)
(b)
(c)
Fig. 6. (a) The R transform histogram from a sketch video of a pointing motion. (b) The R transform histogram from a real video of an actor performing a pointing motion. (c) The R transform histogram from a real video of an actor performing a standing motion.
The R transform can be applied to a single silhouette frame. A set of frames, therefore, can generate a set of R transform curves. The problem of matching action videos then becomes a problem of matching these sets of curves. For our representation, we maintain a 2D histogram of the R transform data. We discretize the 2D space of R transform curves into 180 (angles) * 20 (R (θ) values). Figure 5 shows the motion descriptor for a video of an actor kicking. The top row shows 4 silhouette keyframes and the bottom row shows the associated R transform for each frame. The graph on the right shows our R transform histogram motion descriptor for this video. 3.3
Matching R transform Histograms
Figure 6 shows (a) the R transform histograms for a generated sketch video of a pointing motion, (b) the R transform histogram from real video of the same action, and (c) a different R transform histogram from a real video of an actor performing a standing motion. On visual inspection, the histograms of the same motions, despite being from both sketches and real videos, appear more similar than the histograms from different motions. To quantify these differences, we employ a histogram-based distance metric. We use the 2D diffusion distance metric [12], which approximates the Earth Mover’s Distance [13] between histograms. This computationally efficient metric formulates the problem as a heat diffusion process by estimating the amount of diffusion from one distribution to the other. In Section 4, we demonstrate that this distance metric is robust to individual variations in action videos and can be used to as the basis of a simple nearest-neighbor classifier to discriminate between dissimilar actions.
4
Results
We used the Inria XMAS Motion Acquisition Sequences (IXMAS) dataset [14] to test our method. This data contains various actors performing multiple, different actions. We tested the system using 10 actions (check watch, cross arms, sit,
A Sketch-Based Approach for Detecting Common Human Actions Sketch
First Result
Second Result
Third Result
Kick
Kick
Punch
Point
Wave
Wave
Check Watch
Cross Arms
Punch
Throw
Point
Punch
425
Fig. 7. Sample results from 3 queries on the IXMAS dataset. For each query, the input sketch and keyframes from the 3 top scoring matches are shown.
stand, wave, punch, kick, point, pick up, and throw) from the set. For each action, we generated a sketch and calculated the matching score to all of the action videos. Figure 7 shows sample results for 3 of these types of queries. Each row shows the input sketch and a keyframe from each of the top 3 closest matching videos from the database. Table 1 summarizes the results. Each cell contains the distance between a user-generated sketch (rows) and a labeled video clip of an actor from the database (columns). (Lower values indicate a closer match.) For 9 out of 10 sketch queries, the intended video was one of the top three scoring matches to our input sketches. Some of the errors (e.g., punch-throw, cross arms-wave) are due simply to the similarity of these actions. For other actions (e.g., check watch), the self-occlusions inherent in the motion leads to ambiguity in the silhouette-based motion descriptor and unpredictable results.
426
E.A. Suma et al.
Table 1. Query Results. Each cell contains the distance between a user-genereated sketch and a labeled video clip from the IXMAS data set.
Sketch Watch Cross
Sit
Watch Cross Sit Stand Wave Punch Kick Point Take Throw
3.35 2.83 1.68 1.84 2.57 2.51 2.02 2.93 2.81 2.80
5
3.17 1.79 3.14 3.38 1.39 3.06 2.59 2.63 3.22 3.88
3.20 1.73 3.17 3.43 1.53 2.98 2.64 2.76 3.23 3.90
Video Sequence Stand Wave Punch Kick Point Take Throw 3.62 3.00 1.88 1.96 2.49 2.22 2.03 2.54 3.19 2.41
2.58 1.71 2.80 3.09 1.32 2.80 2.10 2.53 3.21 3.63
2.51 1.91 2.14 2.48 2.18 2.08 1.73 2.58 3.32 2.98
3.00 2.02 1.91 2.18 1.71 2.30 1.67 2.61 3.06 3.02
3.02 2.10 2.41 2.62 1.82 2.00 1.91 2.05 3.38 3.06
3.45 3.18 1.72 1.86 2.94 2.25 2.12 3.08 2.71 2.44
3.82 3.52 2.30 2.26 3.16 1.94 2.44 2.59 3.64 2.06
Discussion and Future Work
We presented a method for the detection of human actions based on sketch input. Sketches are an intuitive input method, but can be flawed, or, worse, not representative of the user’s intended query. This introduces an additional level of ambiguity compared to text-based approaches because it is based on the ability of the user. For our purposes, it makes the results somewhat harder to interpret as the residuals may be due to dissimilar actions or poorly sketched inputs. The method introduced in this paper aims to represent 3D motions with 2D sketches. While a sketch can represent motion that varies with multiple degrees of freedom, this method is limited to succinct, atomic actions and restricted in the viewpoint. It may be possible to overcome some of these limitations by developing more sophisticated (but, perhaps less intuitive) representations for common human actions. However, we feel that our approach can still be useful to domains such as athletics and surveillance where the large quantities of video data contain examples which can be succinctly described by simple motion cues. Finally, a technical limitation of our current approach is that it requires binary segmentation for the videos in the search database to generate the motion descriptor. We tried this approach using a different descriptor [15] and a matching method [16], which doesn’t require segmentation. Currently, this approach is far more computationally-intensive than our current approach and we are exploring ways to optimize both methods for use on real-time video feeds.
References 1. Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information retrieval: State of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl. 2, 1–19 (2006)
A Sketch-Based Approach for Detecting Common Human Actions
427
2. Marchand-Maillet, S.: Content-based video retrieval: An overview. Technical Report 00.06, CUI - University of Geneva, Geneva (2000) 3. Naphade, M.R., Huang, T.S.: Semantic video indexing using a probabilistic framework. ICPR 03, 3083 (2000) 4. Taskiran, C., Chen, J.Y., Albiol, A., Torres, L., Bouman, C., Delp, E.: Vibe: a compressed video database structured for active browsing and search. IEEE Transactions on Multimedia 6, 103–118 (2004) 5. Paulson, B., Hammond, T.: Marqs: retrieving sketches learned from a single example using a dual-classifier. Journ. on Multimodal User Interfaces 2, 3–11 (2008) 6. Lew, M.: Next-generation web searches for visual content. Computer 33, 46–53 (2000) 7. Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., Zhong, D.: Videoq: an automated content based video search system using visual cues. In: Fifth ACM Intnl conference on Multimedia, pp. 313–324. ACM, New York (1997) 8. Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curves in pictures. Commun. ACM 15, 11–15 (1972) 9. Tabbone, S., Wendling, L., Salmon, J.P.: A new shape descriptor defined on the radon transform. Comput. Vis. Image Underst. 102, 42–51 (2006) 10. Wang, Y., Huang, K., Tan, T.: Human activity recognition based on r transform. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 11. Souvenir, R., Babbs, J.: Learning the viewpoint manifold for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7 (2008) 12. Ling, H., Okada, K.: Diffusion distance for histogram comparison. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 246–253 (2006) 13. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: Proc. Intnl Conference on Computer Vision, pp. 59–66 (1998) 14. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104, 249–257 (2006) 15. Shechtman, E., Irani, M.: Matching local self-similarities across images and videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 16. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In:IEEE Transactions onWorkshop on Statistical Learning in Computer Vision, Prague, Czech Republic, pp. 17–32 (2004)
Multi-view Video Analysis of Humans and Vehicles in an Unconstrained Environment D.M. Hansen1 , P.T. Duizer1 , S. Park2 , T.B. Moeslund1 , and M.M. Trivedi2 1
2
Computer Vision and Media Technology, Aalborg University, Denmark Computer Vision and Robotics Research, University of California, San Diego
Abstract. This paper presents an automatic visual analysis system for simultaneously tracking humans and vehicles using multiple cameras in an unconstrained outdoor environment. The system establishes correspondence between views using a principal axis approach for humans and a footage region approach for vehicles. Novel methods for locating humans in groups and solving ambiguity when matching vehicles across views are presented. Foreground segmentation for each view is performed using the codebook method and HSV shadow suppression. The tracking of objects is performed in each view, and occlusion situations are resolved by probabilistic appearance models. The system is tested on hours of video and on three different datasets.
1
Introduction
Visual analysis of motion has received much interest in recent years for its practical importance in various application areas [1,2]. The majority of work on visual analysis has focused on either monitoring humans or vehicles, but the amount of work focusing on monitoring humans and vehicles simultaneously is relatively small. The simultaneous tracking of humans and vehicles is often desirable for enhanced situational awareness in real-world environments. The motions of humans and vehicles may involve different characteristics, which may request different analysis techniques. This paper presents an automatic visual surveillance system for simultaneously tracking humans and vehicles in unconstrained outdoor environments by using multiple cameras. Our system uses overlapping-view cameras to exploit the merit of multi-view approach to extract view-invariant features of objects (i.e., persons and vehicles) such as view-invariant size, position, and velocity of the objects in the real world environment. Given such view-invariant information, it would be possible to develop more enhanced ambient intelligent systems and intelligent infrastructure. For example, visual surveillance systems could capture the movements of vehicles and pedestrians and predict possible collisions, as illustrated in Figure 1. Furthermore, the motion-patterns of vehicles and pedestrians could be analyzed to detect abnormal behavior, e.g., drunk driving. Such incidents could control the traffic lights or directly communicate with the control of the vehicle. A prerequisite for such systems is robust detection and tracking of humans and vehicles, which is the focus of this paper. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 428–439, 2008. c Springer-Verlag Berlin Heidelberg 2008
Multi-view Video Analysis of Humans and Vehicles
(a) View 1
(b) View 2
429
(c) Virtual top-view
Fig. 1. Collision detection in a multi-view sequence. The trajectories are mapped to a virtual top-view for visualization purposes. A red arrow depicts the extended velocity vectors of the tracked objects, and the yellow crosses show intersections as an indication of a possible collision. The gray area in the right-most image shows the positions or footage region of the objects, calculated as the area on the ground plane covered in both views by the object. Note that only parts of the images are shown in order to increase visibility.
2
Previous Work
Many multi-view systems have been proposed and they can be categorized into two groups: disjoint camera views [3,4] vs. overlapping camera views [5,6]. While the disjoint views are effective for covering wide field of views, the overlapping views are desirable for efficient handling of severe occlusions which inevitably occur in unconstrained urban environments. The overlapping views furthermore improve the accuracy in estimating the position and size of objects. The object matching between multiple views can be achieved by: recognition based methods [7,8] or geometry based methods [6,9], or a combination of these [10]. When the views are not overlapping, the color histograms can be compared between different views for object matching [4], while the geometry-based methods [11] are preferred when the views are overlapping. Some approaches have used full camera calibration for accuracy [6,9], whereas other approaches in more recent works adopted hompgraphy mapping for versatility [12,13]. The full camera calibration based methods fused observations from the multiple views to produce fairly precise matching; however, the calibration is a resource-consuming process and a skill-demanding task in practice [2], especially in outdoor scenes. The homography-based methods [12,13] establish the matching between the multiple views by using the so-called ‘principal axis correspondence’ and epipolar geometry. Tracking objects can be enhanced by the particle filter or the Kalman filter [13] for tracking. The above systems aimed mainly at human tracking in multiple views. Some other homographybased method uses a graph-cut algorithm to segment and track each object [14]. One of the few systems that do multi-view tracking of both humans and vehicles is [15], where objects are tracked using their footage region. Objects occupying a larger area on the ground plane results in a larger footage region in the homography domain. However, since humans are significantly smaller than
430
D.M. Hansen et al.
vehicles the footage region is not robust for tracking humans. The works in [14,10] present good results using the footage region for tracking humans, but only in a constrained environment. Other systems that track both humans and vehicles are [16] and [17]. [16] calculates each object’s geolocation for camera handoff, but the system requires a three-dimensional site model which is difficult to obtain in practice. In contrast [17] does not require camera calibration, however, this system does not use overlapping cameras and does not produce an output suitable for interaction analysis. In conclusion, the systems summarized above do not provide an efficient solution for tracking of both humans and vehicles. Our proposed system combines the principal axis and footage region approaches in an integrated manner for establishing a better foundation in simultaneously tracking and analyzing the interactions between humans and vehicles in real-world busy traffic scenes.
3
System Overview
The system developed in this work consists of three modules, as depicted in Figure 2. For each synchronized camera input moving objects are segmented using the foreground segmentation module (Section 4). The moving objects are tracked using the single-view tracking module (Section 5). Following this, the tracks are matched across views using the multi-view correspondence (Section 6). Feedback from this module is used to improve the accuracy of the single-view tracking. The output from the system is a trajectory of each detected and tracked object. Since the system is quite comprehensive we have primarily focused on describing the novel contribution of this work, i.e., multi-view correspondence and evaluation on very long and realistic sequences.
Cam 1
Foreground segmentation
Single-view tracking
Trajectory information: Multi-view correspondence
Cam 2
Foreground segmentation
Single-view tracking
- Position - Size - Velocity - etc.
Fig. 2. System overview
4
Foreground Segmentation
The foreground segmentation allows the forthcoming modules to focus their attention on areas containing objects. An unconstrained environment present many challenges such as, background camouflage, reflection, illumination change, shadow, dynamic background and changing background. The foreground segmentation is based on motion segmentation followed by shadow suppression.
Multi-view Video Analysis of Humans and Vehicles
431
Motion segmentation is based on a robust background subtraction method which dynamically update the background model and add new background layers. We have in earlier work [18] showed the robustness of this approach in very long and unconstrained videos sequences. A problem left unsolved by the motion segmentation is moving cast shadows. Results of cast shadows are false shape, size and appearance of objects and merge of otherwise separate foreground objects. For shadow suppression an HSV color segmentation approach based on [19] is used. However, since this method is based solely on color segmentation, it is not optimal for separating cast shadow from self shadow, which is a part of the object. We therefore enhance the method with multi-view information to reduce the number of falsely classified shadow pixels. Concretely we consider the footage region, see section 6.2, as potential shadow. This improves the segmentation since the footage region in many cases only contains the cast shadow and not the object. For details see [20].
5
Single-view Tracking
The single-view tracking serves two purposes: 1) to classify objects as either human or vehicle and 2) to track each single object through the scene. We classify an object as either a human or a vehicle based on how steep the vertical projection of the foreground mask is using a spread measure from [12]. Though this measure is not view invariant, a threshold of 0.08-0.10 is sufficient to separate humans from vehicles. Furthermore, as in [12], humans walking in a group are split using two thresholds to find peaks and valleys in the vertical projection histogram. The detected object is split at each valley point. The tracking of objects are, whenever possible, done using the bounding boxes together with a Kalman filtering approach and some smoothing constraints, see [20] for details. To handle more complicated cases like occlusion we use an approach based on probabilistic appearance models [21]. Each track has its own probabilistic appearance model, which consists of an RGB color model with an associated probability mask. The use of probabilistic appearance models can be viewed as weighted template matching, where the template is an appearance model and the weights are given by the associated probability mask. The coordinates of the model are normalized to the object centroid. For each new track, a new probabilistic appearance model is created. A track refinement step is applied before updating the model at each match by finding the best fit in a small neighborhood e.g. 5 × 5 pixels. Track refinement increases the accuracy of the model. When updating, the model usually stabilizes after half a second at 15 fps. Detail on building and applying the model can be found in [21].
6
Multi-view Correspondence
The correspondence module match objects tracked within the overlapping region of the cameras. Multi-view correspondence is needed to handle difficult occlusion
432
D.M. Hansen et al.
cases, e.g. during full occlusion or at initialization where the probabilistic appearance models are not reliable. Since humans and vehicles are very different object types, two separate procedures are followed. 6.1
Correspondence of Humans
A principal axis is a vertical line in the image domain, which is fitted to each tracked human using least median of squares. Assuming a homography between views, the principal axes can be mapped from one view to the other. Analyzing the intersections between the principal axes in one view and the principal axes mapped into this view allows for finding correspondences between views [12]. With this approach two or more human objects are often tracked as one object, especially when people enter the scene as a group, even though they are separable in one view. Other work like [21] and [12] “wait” until a group split before tracking individuals. However, based on observations, people that enter as a group are most likely to stay as a group while moving through the scene. To enable correspondence of people/individuals in groups, a novel correspondence algorithm is developed in this work, which is executed after the correspondence algorithm from [12]. The new algorithm locates unpaired tracks in view 1, and test if they match an already paired track in view 2. If the matching satisfies two distance constraints the unpaired track in view 1 is paired with the paired track in view 2. The steps of the algorithm are explained in the following, with Figure 3 as illustration.
Ln2
L1m nm q12
m 1
x
y m2
Ln21
q
(a) View 1
mn 21
(b) View 2
m L12
(c) Virtual top-view
Fig. 3. Resolving the problem of a group being tracked as a single object in view 2. See text for explanation. Note that only parts of the images are shown in order to increase visibility.
1. A list (θ1unpaired ) of all unpaired principal axes in view 1 is created and a list (θ2paired ) of all paired principal axes in view 2 is created. unpaired 2. A principal axis (Lm is selected and compared with all principal 1 ) in θ1 axes in θ2paired , in this case (Ln2 ). For each comparison two distances are calculated:
Multi-view Video Analysis of Humans and Vehicles
433
Distance 1: Ln2 is mapped from view 2 into view 1 (Ln21 ) and the intersection point is found (qmn 21 ). The distance from the intersection point m to person m’s ground point location is found as D1 = |qmn 21 − x1 |. The distance is calculated this way for comparison with distance 2; the two distances are summed in step 3. m Distance 2: Lm 1 is mapped from view 1 into view 2 (L12 ) and the intersecnm tion point is found (q12 ). Furthermore, the ground point location, xm 1 , is also mapped from view 1 into view 2, ym 2 . The distance from the interm section point to the mapped ground point is found as D2 = |qnm 12 − y2 |. 3. If D1 < DT 1 and D2 < DT 2 , then n along with the score D1 + D2 are stored in the list Λm . DT 1 and DT 2 are selected empirically. 4. Select person n with the smallest score in Λm and label the pair (m, n). unpaired 5. Remove Lm . Go to step 2 and repeat the procedure with a 1 from θ1 unpaired different m until θ1 is empty. The algorithm is executed a second time with the order of view 1 and view 2 switched. Compared to [12] this is a greedy algorithm and not a global algorithm. The distance D2 is applied, because it is expected that the mapped ground point should be close to the intersection of the principal axes in view 2. Using the new correspondence algorithm it is possible to find the ground point location of a human within a group, even when it is tracked as a single object. This is fed back to the single-view tracking module where it is used to correct the track, especially the lower part of the bounding box and hence the centroid. In Figure 4 an example is provided.
(a) View 1
(b) View 2
Fig. 4. Tracking of occluded person. Despite being occluded by the vehicle, the person’s ground point is correctly located. View 1 is zoomed in on the person. A purple vertical line is a principal axis. The black box and the colored box indicate the bounding box location before and after performing correspondence.
6.2
Correspondence of Vehicles
The correspondence of vehicles utilizes the footage region, as in work by [15] and [14]. However, [15] and [14] does not apply single view tracking; both methods map all foreground pixels into a common domain and perform tracking in this
D.M. Hansen et al.
View 2
View 1
434
(a) Input frame
(b) Foreground pix- (c) Mapped and els overlapping pixels
Fig. 5. Finding overlap between vehicle tracks in view 1 and view 2. If there is overlapping pixels (marked by yellow), the tracks are a possible pair.
domain. By tracking in each view we are able to maintain an object’s track even when it is missing in one of the views, and still maintain the benefit of improved accuracy when the object is detected in both views. Our approach has two main steps. The first is to create an overlap matrix and, secondly, to solve any ambiguity expressed in the overlap matrix. The overlap matrix is created by mapping all foreground pixels belonging to a vehicle track from view 1 into view 2 and find overlap with any foreground pixels for a vehicle track in view 2. The principle is illustrated in Figure 5, where a vehicle is visible in both views. In the overlap matrix three scenarios are possible: One-to-one, Many-to-one, and Many-to-many. One-to-one is a straight forward situation with no ambiguity, see Figure 5. Many-to-many overlap occurs when two vehicles drive by each other and at some point occlude each other in one view. As a result the foreground mask of one of the vehicles will wrongly overlap with both vehicles in the other view. Many-to-one overlap (or one-to-many) could be caused by two vehicles being tracked as one in a view, or the view with many vehicle tracks could wrongly track noise classified as a vehicle. The many-to-one and many-tomany overlaps are considered as ambiguity. When solving it, the vehicle track with the most possible overlaps is solved first, etc. In the end, only one-to-one overlaps are left. One of two methods is applied to solve this ambiguity in both the manyto-many and many-to-one situation. The first method ensures that a historic relationship is preserved between vehicle tracks. If the vehicles have been corresponded correctly before the occlusion, the ambiguity is solved by maintaining the historic correspondence. However, it is not guaranteed that the historic relationship is available, e.g. at track initialization. In these situations a different approach based on a “plausible ground point” is applied. Taking the vehicle’s centroid and a vertical line through the centroid, the plausible ground point is located at the lowest foreground pixels on this vertical line, as illustrated in Figure 6. The plausible ground point is located on the ground, but also beneath the vehicle itself. When mapping this plausible ground point into the other view using the planar homography, the
Multi-view Video Analysis of Humans and Vehicles
435
Centroid Plausible ground point Warped ground point
(a) View 1
(b) View 2
Fig. 6. The plausible ground point can be used to solve ambiguities in the overlap matrix. The many-to-many overlap occurs in this situation because the vehicle with red centroid in view 1 occludes the lower part of the vehicle with the green centroid.
View 1
View 2
(a) Severe occlusion of the turquoise (b) Foreground pixels (c) Reconstructed car. shape Fig. 7. Solving severe occlusion by reconstructing the shape of the occluded vehicle. Note that only parts of the images are shown in order to increase visibility.
mapped point should therefore also be located beneath the car in this view. The mapped point is used to solve the ambiguity and is illustrated in Figure 6. The plausible ground point is only applied if the history is not reliable. After resolving the ambiguities and pairing the vehicle tracks, it might happen that a pair of vehicle tracks does not overlap even though it is historically expected. The missing pair of vehicle tracks could be caused by the entire bottom portion of the vehicle being occluded in one view. An example of this is shown in Figure 7(a), where the white van is occluding the bottom portion of the turquoise car in view 1. Before occlusion, the turquoise car has been paired correctly between views and a pairing is therefore expected. The foreground pixels assigned to the turquoise car in view 1 are shown in Figure 7(b) during occlusion. To solve this problem, the foreground mask is reconstructed from the probabilistic appearance models which holds a memory of the shape of the turquoise car, as shown in Figure 7(c). With the reconstructed shape it is again possible to find overlap. The view invariant representations for four vehicles are shown in Figure 8.
7
System Evaluation
The primary test of this system is performed using our own dataset, since this is much longer than public datasets. The cameras have a resolution of 352 × 240
436
D.M. Hansen et al.
(a) View 1
(b) View 2
(c) Virtual top-view
Fig. 8. Correct tracking of four moving vehicles, the white van is not detected since it is parked and is part of the background model. Vehicles are represented by their footage region in the virtual top-view. To avoid “holes” in the footage region caused by imperfect foreground segmentation a convex hull is fitted to the region. The point location of each vehicle is given by the centroid of the convex hull area. The noise detected in view 2 is not tracked in the virtual top-view since there is no corresponding object in view 1. Note that (c) has been cropped in order to increase visibility.
pixels. Inter-object occlusion and illumination changes frequent occur in the dataset. To ensure realistic results the system was running continuously with the same parameters for more than 67 hours. During this time we selected six sequences covering different times of the day (at different days). In total, 385 minutes of test data. This contains 1351 humans and 267 vehicles1 . If a track is missing in both or one view, or is wrongly corresponded or classified for more than 0.5 seconds it is considered an error. With this definition the percentage of correctly tracked objects is 67.9% with 2.3% non-existing objects being tracked. However, in most cases where a part of a track is missing it is maintained correctly in one of the views. This makes this type of error less serious and by ignoring it the percentage of correctly tracked objects increases to 88.2%. In contrast to single view tracking systems, this system handles occlusion very well, see Figure 8 and 10, and locates objects with high accuracy. In Figure 9 it is illustrated how the the accuracy is improved using correspondence between views. The most significant cause of error in this test is the foreground segmentation. Because the system is tested with long sequences, the quality of the foreground segmentation is not optimal, which cause a drop in performance. This leads to wrong object classification and track initialization especially for vehicles. The main problem with humans is that they are removed as noise when the misdetection is too severe. The most frequent cause of error in the foreground segmentation is background camouflage as exemplified in Figure 9(c). Furthermore, the shadow suppression parameters are hard to adjust for segmentation over day long periods. 1
Note that the system runs in near real-time on a 3Ghz computer with 2GB ram. When many large objects are present the framerate drops to around 10Hz.
Multi-view Video Analysis of Humans and Vehicles
437
(a) Correspondence dis- (b) Correspondence en- (c) Example of two persons with abled abled background colored clothes Fig. 9. (a) and (b): Example of the increased accuracy gained from doing correspondence for humans. (c): The RGB values of the foreground objects are very similar to the background; only the intensity sets them apart which makes them very hard to segment correctly. Note that only parts of the images are shown in order to increase visibility.
(a) View 1
(b) View 2
(c) Virtual top-view Fig. 10. Example of tracking in the PETS’2001 dataset. Three persons enter the scene as a group. The group correspondence algorithm initialize and track all three persons correctly even during occlusion by the white van. Note that groups are only given one ID, but the system still holds information about all three individuals as seen by the trajectories in the virtual top-view. Note also that (c) has been cropped in order to increase visibility.
438
7.1
D.M. Hansen et al.
Test on PETS’2001 Dataset
The system is tested on Dataset 1 and 2 of the PETS’2001 dataset. In Dataset 1, the only error that occurs is a track switch in a group with three people as they walk behind a recently parked car. All remaining humans and vehicles are tracked correctly. The only error that occurs in Dataset 2 is a person that is not detected properly due to his resemblance with the cars and road in the background. Otherwise the tracking is correct. An example from the PETS’2001 dataset is show in Figure 10. The three persons in view 2 are automatically initialized and tracked, see Figure 10(c), using the group correspondence algorithm presented in Section 6.1. This would not be possible without the group correspondence since the three persons are tracked as one as they enter, and the vertical projection can not be used to separate them. Other systems that track humans using the footage region like [15] and [14] would not be able to handle this situation, since the inter-object occlusion in both views would result in misshaped and wrongly connected footage regions. In fact this is in general the case for systems testing on this particular part of the PETS’2001 dataset, unless tracking is manually initialized.
8
Conclusion
We have presented a robust system for tracking humans and vehicles through their activities and interactions in an unconstrained outdoor environment using multiple surveillance cameras. To our knowledge, very little work has been done on traffic monitoring of both humans and vehicles, since most related work focuses on either humans or vehicles. The proposed system integrates novel versions of the principal axis-based approach and the footage region-based approach for simultaneously establishing the multi-view correspondence of humans and vehicles. The system robustly handles severe occlusions, efficiently locates individual humans in groups, and handles ambiguity in correspondences between vehicles. We conducted extensive experimental evaluations with long-term videos captured from unconstrained outdoor environments in different sites in different days. The experimental evaluation shows the efficacy of the system, and the proposed system can be used as a tracking module for higher-level behavior analysis of persons and vehicles such as the estimation of collision likelihood between humans and vehicles.
References 1. Moeslund, T.B., Hilton, A., Kruger, V.: A Survey of Advances in Vision-Based Human Motion Capture and Analysis. CVIU 104(2-3), 90–126 (2006) 2. Valera, M., Velastin, S.A.: Intelligent Distributed Surveillance Systems: A Review. Vision, Image, and Signal Processing 152(2), 192–204 (2005) 3. Javed, O., Rasheed, Z., Shafique, K., Shah, M.: Tracking in Multiple Cameras with Disjoint Views. In: ICCV 2003 (2003)
Multi-view Video Analysis of Humans and Vehicles
439
4. Porikli, F., Divakaran, A.: Multi-Camera Calibration, Object Tracking and Query Generation. In: Int. Conference on Multimedia and Expo., Washington, DC (2003) 5. Stauffer, C., Tieu, K.: Automated Multi-Camera Planar Tracking Correspondence Modeling. In: CVPR 2003 (2003) 6. Mittal, A., Davis, L.S.: M2Tracker: A Multi-View Approach to Segmenting and Tracking People in a Cluttered Scene. IJCV 51(3), 189–203 (2003) 7. Orwell, J., Remagnino, P., Jones, G.A.: Multi-Camera Colour Tracking. In: IEEE Int. Workshop on Visual Surveillance, Fort Collins, CO, USA, vol. 28(4) (1999) 8. Gilbert, A., Bowden, R.: Incremental, Scalable tracking of Objects Inter Camera. CVIU 111(1), 43–58 (2008) 9. Pflugfelder, R., Bischof, H.: People Tracking Across Two Distant Self-Calibrated Cameras. In: IEEE Conference on Advanced Video and Signal based Surveillance, London, UK (2007) 10. Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multi-Camera People Tracking with a Probabilistic Occupancy Map. PAMI 30(2), 267–282 (2008) 11. Calderara, S., Cucchiara, R., Prati, A.: Bayesian-Competitive Consistent Labeling for People Surveillance. PAMI 30(2), 354–360 (2008) 12. Hu, W., Hu, M., Zhou, X., Tan, T., Lou, J., Maybank, S.: Principal Axis-Based Correspondence between Multiple Cameras for People Tracking. PAMI 28(4), 625– 634 (2006) 13. Park, S., Trivedi, M.M.: Understanding Human Interactions with Track and Body Synergies (TBS) Captured from Multiple Views. CVIU 111(1), 2–20 (2008) 14. Khan, S.M., Shah, M.: A Multiview Approach to Tracking People in Crowded Scenes using a Planar Homography Constraint. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 133–146. Springer, Heidelberg (2006) 15. Park, S., Trivedi, M.M.: Analysis and Query of Person-Vehicle Interactions in Homography Domain. In: IEEE Conference on Video Surveillance and Sensor Networks, Santa Barbara, CA, USA (2006) 16. Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., Wixson, L.: A System for Video Surveillance and Monitoring, Tech. report CMU-RI-TR-00-12, Robotics Institute, Carnegie Mellon University (May 2000) 17. Shah, M., Javed, O., Shafique, K.: Automated Visual Surveillance in Realistic Scenarios. IEEE MultiMedia 14(1), 30–39 (2007) 18. Fihl, P., Corlin, R., Park, S., Moeslund, T.B., Trivedi, M.M.: Tracking of Individuals in Very Long Video Sequences. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4291, pp. 60–69. Springer, Heidelberg (2006) 19. Cucchiara, R., Grana, C., Piccardi, M., Prati, A., Sirotti, S.: Improving Shadow Suppression in Moving Object Detection with HSV Color Information. In: IEEE Conference on Intelligent Transportation Systems, Oakland, CA, USA (2001) 20. Hansen, D.M., Duizer, P.T.: Multi-View Video Surveillance of Outdoor Traffic, Master Thesis, Aalborg University, Denmark (2007) 21. Senior, A., Hampapur, A., Tian, Y.-L., Brown, L., Pankanti, S., Bolle, R.: Appearance Models for Occlusion Handling. IVC 24(11), 1233–1243 (2006)
Self-Organizing Maps for the Automatic Interpretation of Crowd Dynamics B. Zhan, P. Remagnino, N. Monekosso, and S.A. Velastin Digital Imaging Research Centre, Faculty of Computing, Information Systems and Mathematics, Kingston University, UK {B.Zhan,P.Remagnino}@kingston.ac.uk
Abstract. This paper introduces the use of self-organizing maps for the visualization of crowd dynamics and to learn models of the dominant motions of crowds in complex scenes. The self-organizing map (SOM) model is a well known dimensionality reduction method proved to bear resemblance with characteristics of the human brain, representing sensory input by topologically ordered computational maps. This paper proposes algorithms to learn and compare crowd dynamics with the SOM model. Different information is employed as input to the used SOM. Qualitative and quantitative results are presented in the paper.
1
Introduction
We are interested in devising methods to automatically measure and model the crowd phenomenon. Crowded public places are increasingly monitored by security and safety operators. There are companies (for example LEGION [1]) that employed large resources to study the phenomenon and generate realistic simulations: for instance to optimize the flow of people of a public space. Computer Vision research offers a large number of techniques to extract and combine information of a video sequence acquired to observe a complex scene. The life cycle of a computer vision system includes the acquisition of the monitored scene with one or more homogeneous or heterogeneous cameras, the extraction of features of interest and then the classification of objects, people and their dynamics. In simple scenes the background is extracted with statistical methods and then foreground data and related information are inferred to describe and model the scene. Background is usually defined as stationary data, for instance man made structure, such as buildings, in a typical video surveillance application, or the indoor structure of a building in a safety application, for instance deployed to monitor and safeguard elderly people in a home. Unfortunately, background modeling becomes rapidly less effective in complex scenes and its usefulness seems to be inversely proportional to the clutter measured in the scene. Figure 1 shows a small experiment testing the effectiveness of background modeling with different types of scenes. Three frames per chosen sequence and the resulting background image built with roughly 1000 frames are illustrated. The background modeling works well with the first scene; G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 440–449, 2008. c Springer-Verlag Berlin Heidelberg 2008
Self-Organizing Maps for the Automatic Interpretation of Crowd Dynamics
441
Fig. 1. The example frames and the built background images from three different scenes. Left to right: three different scenes; top to bottom, three example frames and the built background images, respectively.
it fails to recover the background of some regions in the second scene because of the frequent occupancy over these regions; and in the third scene, due to the continuous clutter, the background model can be barely recovered. Although sophisticated methods have been proposed for tracking crowded environments, such as the Particle Filter [2][3] and the Joint Probabilistic Data Association Filter (JPDAF) [4], the state of art describes only scenes with a limited number of people. In highly crowded scenes, tracking is not a viable option and it is more interesting and valuable to retrieve the global crowd motion instead of individual motion. Crowd motion estimation algorithms have been proposed using local descriptors [5] [6]. The overall objective of our research is to model
442
B. Zhan et al.
crowd dynamics. This is normally achieved based on the extracted information from visual data. Andrade [7][8][9] characterizes crowd behavior by observing the crowd optical flow associated with the crowd and use unsupervised feature extraction to encode normal crowd behavior. Zhan [10][11][12] proposes a crowd model using accumulated motion and foreground (moving objects) information of a crowded scene. In this paper we describe some work carried out applying self-organized maps to learn the dominant crowd dynamics. SOMs are widely used in mapping multidimensional data onto a low-dimensional map, example of applications include the analysis of banking data, linguistic data [13] and image classification [14]. In this paper, optical flow and raw image have been used as input to SOM; the result SOMs are then employed to classify different crowded scenes. This paper is organized as follow: In Section 2 some background of Self Organizing Map is given; in Section 3 the usage of optical flow input is introduced; in Section 4 the usage of raw image input is presented; and Section 5 gives the conclusion and a discussion on further work.
2
Self Organizing Map
The most common SOMs have neurons organized as nodes in a one- or twodimensional lattice. The neurons of a SOM are activated by input patterns in the course of a competitive learning process. At any moment in time, only one output neuron is active, the so called winning neuron. Input patterns are from an n-dimensional input space and are then mapped to the one- or two- dimensional output space of the SOM. Every neuron has a weight vector which belongs to the input space[15]. There are two phases for tuning the SOM with an input pattern I , competing and updating. In the competing phase every neuron is compared with I ; the similarity between I and the weights of all of the neurons are computed; and the neuron N (iw , jw ) (denoted by the neuron’s coordinates of the lattice) with highest similarity is selected as the winning neuron. In the update phase, for each neuron N (i, j), a distance is calculated as d2 = (i − iw )2 + (j − jw )2
(1)
the topological neighborhood function is then defined as: h(n) = exp(−
d2 ) 2σ 2 (n)
(2)
where n denotes the time, which can also be explained as the number of iterations. and σ 2 (n) decreases with the time. The weight of each neuron N (i, j) at time n + 1 is then defined by: w(n + 1) = w(n) + η(n)h(n)(x − w(n))
(3)
where w(n) and w(n + 1) are the weights of a neuron at time n and n + 1, while η(n) is the learning parameter, which decreases with time.
Self-Organizing Maps for the Automatic Interpretation of Crowd Dynamics
3
443
Optical Flow as Input
In this application, a SOM should capture the two major components of the crowd dynamics: spatial occurrence and orientation. Thus a four dimensional input space is chosen to be the weight space of the SOM, which can be represented as f : (x, y, θ, ρ) . Each data from the input space can be explained as the location where crowd moves and the motion vectors in the form of angle θ and magnitude ρ. The SOM used in this experiment is organized in a two-dimensional space and represented by a regular square lattice. 3.1
Visualization
Figure 2 illustrates three different video sequences with different dynamics. These video sequences have been input into the system, and Figure 3 shows the output SOMs. In the figure SOMs are visualized in the input space, i.e. showing the weight vector of each neuron. In the visualization, the color arrows and their locations are from the weight vector of neurons; the location of the arrows are from the first two components of the weight vectors (x, y), and the arrows show the second two components - the components of motion (θ, ρ). The different colors of the arrows are also indicating the different orientation of the motion. In the first video (the left image in Figure 2) the major crowd is moving from bottom left to top right of the scene. There is another crowd flow from bottom right of the scene which joins the major flow. In its SOM (the first one in Figure 3) the neurons with green arrows are clearly from the major flow and the ones with red and purple arrows are from the minor flow. In the second video (the middle image in Figure 2) it is an area of an entrance to a public space. Most of the people move from top to bottom in the illustrated scene. The crowd in the upper part of the scene is sparser and moves faster when compared to the crowd in the lower part of the scene. There is also a minor flow, which joins the major flow from right of the scene. In the built SOM (the second SOM in Figure 3), again the flows are clearly indicated. Furthermore, the SOM takes an umbrella shape, which represents the shape of the flow constrained by the obstacles in the scene. In the third video (the right image in Figure 2) the scene is a large open area
(a) Scene A
(b) Scene B
(c) Scene C
Fig. 2. The example frames from three different scenes
444
B. Zhan et al.
Fig. 3. Three different scenes and their visualization using SOM
with multiple crowd flows. The major flow is moving from right to left; however there are several minor flows, most of which are in the lower part of the scene. Again the SOM (the third in Figure 3) captures the major dynamics and also some minor flows. From the three examples, it can be concluded that the SOMs not only preserves the dominant motion vector, but also represents the shape of the regions with dominant motion of the scenes. 3.2
Classification
Visualizations of the SOMs have already provided information of recurrent motion. Scene classification has been carried out using the characters captured by the SOMs. To achieve this, comparisons of the SOMs built for different scenes have been carried out. The classification is based on the similarities of the SOMs. The topological structures of the lattice of SOMs, as well as the weights of the neurons of the SOMs are used for the comparison. Topological structure is an important feature of SOM, a large number of methods have been proposed to
Self-Organizing Maps for the Automatic Interpretation of Crowd Dynamics
445
measure it [16]. In this work a C- Measure is used, which is defined as: The similarity of the C- Measures of two different SOMs is calculated as C= FA (i, j)FV (wi , wj ) (4) (i,j∈i=j)A×A
where FA and FV are the similarity on input space (i.e. weight space) and output space (i.e. SOM lattice), respectively. The i and j are the index of the neurons and wi and wj are the weight of the indexed neurons. The similarity of correspondent SOM neurons is calculated: Simw = FV (wi , wj ) =
k
min(wik , wjk )
k=0
max(wik , wjk )
(5)
where Dimw is the dimension of the weight space, wik and wjk are the k − th element of the weights wi and wj , respectively. An average over the lattice has been calculated. This equation is used for calculating the similarity of the weights of two neurons. The similarity of the structure is calculated as the similarity of the C-Measures: min(Ci , Cj ) Simc = (6) max(Ci , Cj ) A combination of the two similarities - weight similarity and structure similarity are calculated by: Sim = Simc × Simw (7) The correspondence of the neurons is defined by the closeness of the values of their weights. Particularly for neuron (i) in SOM A, the correspondent neuron in SOM B is the one with the closest weight value of it. The matching could be asymmetrical, for example assuming for neuron (i) in SOM A, its correspondent neuron in SOM B is neuron (j); however for the neuron (j) its correspondent neuron in SOM A is not necessarily neuron (i). This can be caused by the situation that the neurons from SOM A and B are not in the same value scale. In some extreme case, all the neurons in one SOM could even be matched to the same neuron in the other SOM. As a result another interesting figure is the number of matched neurons in SOM B. The figure will indicate if the values of weights in SOM A are in the same scale of SOM B. It is combined with the last measure by: Sim + Nmatched /Ntotal S= (8) 2 The comparison is not symmetric also means that if SOM A is comparing with SOM B, the result will be different from using SOM B to compare with SOM A. Consequently, two similarities are generated the comparison of two SOMs. This experiment basically takes three scenes, and two sequences are extracted from each scene, so that there are 6 sequences in total in the experiment. The following confusion matrix illustrates the relative results. In Table 3.2, each row has the
446
B. Zhan et al.
Table 1. Confusion matrix of SOMs from different scenes (Scn abbreviates Scene)
Scene Scene Scene Scene Scene Scene
A-1 A-2 B-1 B-2 C-1 C-2
Scn A - 1 1 0.633261 0.330897 0.35838 0.369259 0.366577
Scn A - 2 0.653097 1 0.33033 0.332804 0.400326 0.318921
Scn B - 1 0.484192 0.372017 1 0.641464 0.443745 0.414272
Scn B - 2 0.433234 0.315155 0.645102 1 0.426589 0.429349
Scn C - 1 0.468101 0.426438 0.4264 0.467114 1 0.687943
Scn C - 2 0.458993 0.400024 0.465297 0.455613 0.715606 1
Fig. 4. Selected SOM neurons built from a single scene
similarity value of a SOM with the other sequences. There are two values: i.e. similarity of SOM A comparing to SOM B and similarity of SOM B comparing to SOM A. The values above 0.5 are in bold font in the table, and they are all from the video sequences from the same scenes.
4
Raw Image as Input
In this application the whole image is regarded as input feature for the SOM. The raw data has been used with three channel color image. In other word, the weight of the SOM is in a W × H × 3 space, where W and H are the width and the height of the image, respectively. Dimensional of the input video data are reduced from W × H × 3 (Image space) to (n × n) (lattice space). The neurons of the SOMs retain the different status of the particular scene. Some selected neurons from SOM constructed by raw images are illustrated in Figure 4. The neurons illustrated the different crowd status of the square, and also some trajectories of the crowd. The above experiment is carried out over video sequence which contains only one single crowded scene. In the following experiment, the SOM is built from video sequences consisting of more than one crowd scene. Figure 5 shows two neurons from the SOM built by a video sequence which contains two different crowded scenes. The built SOM has modeled the two different scenes (Example frames from the two scenes can be found in Figure 2(a) and 2(b)).
Self-Organizing Maps for the Automatic Interpretation of Crowd Dynamics
447
Fig. 5. Two neurons from SOM built by a video sequence which contains scene A and B
3 2 6000 1 4000
0 3
2.5
2
2000 1.5
1
0.5
0
0
(a) Train video sequence
3 3 2
2 4000
1
1
4000
3000 0 3
2000 2.5
2
1.5
1000 1
0.5
0
0
(b) Test video sequence from scene A
3000 0 3
2.5
2000 2
1.5
1000 1
0.5
0
0
(c) Test video sequence from scene B
Fig. 6. Tracking of the winning neuron
Different neurons of the SOM represent different dynamics in the video sequence. The tracking of the winning neurons indicates the transition between dynamics. Figure 6(a) shows the changes of the winning neurons on the SOM lattice when using the training video sequence (the coordinates are shown by the left side vertical plane. The axis with numbers from 0 to 5000 is the time line.) There is an obvious transition between the winning neurons in the middle of the time line where it represents the changing of the scenes. New image sequences from the two scenes are used as input to the SOM to test its ability of scene classification. Figure 6(b) and 6(c) are the result of tracking the winning neuron
448
B. Zhan et al.
over time. The winning neurons of the first scene are on the same plane, and for the second scene the winning neurons never get to the previous plane.
5
Conclusion and Future Works
This paper presented crowd analysis work using Self-organizing Maps. experiments were carried out testing optical flow and raw images to train the SOM. In the first case the built SOM of each crowded scene shows its capability of capturing the major dynamics. Scene classification is carried out by quantitatively comparing the built SOMs. In the second cases, the SOM can capture major dynamics from more than one scene. Experiment shows that the frames from different scenes activate the neurons from different locations of the lattice so that they can be labeled and classified. This work is, to our knowledge, the first attempt to employ SOM in crowd analysis applications. It reveals the great potential of SOM in handling this problem. More experiments, for example with different input features can be carried out. Also a deeper analysis of relationships between neurons can be involved to build a better model of the dynamics.
References 1. Legion: (Legion group plc), http://www.legion.biz/about/index.html 2. Venegas, S., Knebel, S., Thiran, J.: Multi-object tracking using particle filter algorithm on the top-view plan. Technical report, LTS-REPORT-2004-003, EPFL (2004), http://infoscience.epfl.ch/getfile.py?mode=best&recid=87041 3. Cai, Y., de Freitas, N., Little, J.J.: Robust visual tracking for multiple targets. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 107–118. Springer, Heidelberg (2006) 4. Karlsson, R., Gustafsson, F.: Monte Carlo data association for multiple target tracking. Target Tracking: Algorithms and Applications (Ref. No. 2001/174). IEE 1 (2001) 5. Zhan, B., Remagnino, P., Velastin, S., Bremond, F., Thonnat, M.: Matching gradient descriptors with topological constraints to characterise the crowd dynamics. In: IET International Conference on Visual Information Engineering, VIE 2006, pp. 441–446 (2006) ISSN: 0537-9989, ISBN: 978-0-86341-671-2 6. Zhan, B., Remagnino, P., Velastin, S.A., Monekosso, N., Xu, L.Q.: Motion estimation with edge continuity constraint for crowd scene analysis. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4292, pp. 861–869. Springer, Heidelberg (2006) 7. Andrade, E., Fisher, R.: Modelling crowd scenes for event detection. In: Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), vol. 01, pp. 175–178. IEEE Computer Society, Washington (2006) 8. Andrade, E., Fisher, R.: Hidden Markov models for optical flow analysis in crowds. In: Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), vol. 01, pp. 460–463. IEEE Computer Society, Washington (2006) 9. Andrade, E.L., Blunsden, S., Fisher, R.B.: Performance analysis of event detection models in crowded scenes. In: Proc. Workshop on Towards Robust Visual Surveillance Techniques and Systems at Visual Information Engineering 2006, Bangalore, India, pp. 427–432 (2006)
Self-Organizing Maps for the Automatic Interpretation of Crowd Dynamics
449
10. Zhan, B., Remagnino, P., Velastin, S.: Analysing Crowd Intelligence. In: Second AIxIA Workshop on Ambient Intelligence (2005) 11. Zhan, B., Remagnino, P., Velastin, S.: Visual analysis of crowded pedestrain scenes. In: XLIII Congresso Annuale AICA, pp. 549–555 (2005) 12. Zhan, B., Remagnino, P., Velastin, S.: Mining paths of complex crowd scenes. In: Bebis, G., Boyle, R., Koracin, D., Parvin, B. (eds.) ISVC 2005. LNCS, vol. 3804, pp. 126–133. Springer, Heidelberg (2005) 13. Kirt, T., Vainik, E., V˜ ohandu, L.: A method for comparing self-organizing maps: case studies of banking and linguistic data. In: Eleventh East-European Conference on Advances in Databases and Information Systems ADBIS, Varna, Bulgaria, Technical University of Varna, pp. 107–115 (2007) 14. Lefebvre, G., Laurent, C., Ros, J., Garcia, C.: Supervised Image Classification by SOM Activity Map Comparison. In: Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), vol. 02, pp. 728–731 (2006) 15. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River (1994) 16. Polani, D.: Measures for the organization of self-organizing maps. Self-Organizing neural networks: recent advances and applications, 13–44 (2002)
A Visual Tracking Framework for Intent Recognition in Videos Alireza Tavakkoli, Richard Kelley, Christopher King, Mircea Nicolescu, Monica Nicolescu, and George Bebis Department of Computer Science and Engineering, University of Nevada, Reno {tavakkol,rkelley,cjking,mircea,monica,bebis}@cse.unr.edu
Abstract. To function in the real world, a robot must be able to understand human intentions. This capability depends on accurate and reliable detection and tracking of trajectories of agents in the scene. We propose a visual tracking framework to generate and maintain trajectory information for all agents of interest in a complex scene. We employ this framework in an intent recognition system that uses spatio-temporal contextual information to recognize the intentions of agents acting in different scenes, comparing our system with the state of the art.
1
Introduction
Recognizing the intentions of others is an important part of human cognition, especially in activities in which collaboration or competition play an important role. To achieve autonomy, robots will have to have similar capabilities. This will be especially critical in tasks requiring substantial human-robot interaction. Any system that uses visual clues to infer the intent of active agents in a scene requires a robust mechanism to track these agents at pixel level, to accurately generate and maintain their trajectories at world level, and to pass this world-level information to higher level stages of the process. In this paper, we propose a system that performs these tasks and is robust to illumination changes, background clutter, and occlusion. In order to detect humans in videos several methods have been proposed. Dalal and Triggs in [1] proposed a method to detect human bodies in images by employing Histogram of Oriented Gradients (HoG) in conjunction with a linear SVM trained for the class of human bodies. Ramanan et al. in [2] proposed a geometric and probabilistic approach to detect human bodies in video sequences. However, these methods are very time consuming, which makes them undesirable for applications with real-time constraints. We demonstrate our vision system both in isolation and as a part of a larger intent recognition system. That system uses hidden Markov models of activities in a Bayesian framework that uses context to improve the system’s performance. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 450–459, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Visual Tracking Framework for Intent Recognition in Videos
2
451
Object Tracking Module
We provide a set of vision-based perceptual capabilities for our robotic system that facilitate the modeling and recognition of actions carried out by other agents. As the appearance of these agents is generally not known a priori, the only visual cue that can be used for detecting and tracking them is image motion. Although it is possible to perform segmentation from an image sequence that contains global motion, such approaches – typically based on optical flow estimation [3]– are not very robust and are time consuming. Therefore, our approach uses more efficient and reliable techniques from real-time surveillance, based on background modeling and segmentation: – During the activity modeling stage, the robot is moving while performing various activities. The appearance models of other mobile agents, necessary for tracking, are built in a separate, prior process where the static robot observes each agent that will be used for action learning. The robot uses an enhanced mean-shift tracking method to track the foreground object. – During the intent recognition stage, the static robot observes the actions carried out by other agents. This allows the use of a foreground-background segmentation technique to build appearance models on-line, and to improve the speed and robustness of the tracker. The robot is stationary for efficiency reasons. If the robot moves during intent recognition we can use the approach from the modeling stage. Fig. 1 shows the block diagram of the proposed object tracking frameworks. 2.1
Intent Recognition Visual Tracking Module
We propose an efficient Spatio-Spectral Tracking module (SST) to detect objects of interest and track them in the video sequence. The major assumption is that
(a)
(b)
Fig. 1. The two object tracking frameworks for (a) activity modeling using a modified mean-shift tracker and (b) intent recognition using a spatio-spectral tracker
452
A. Tavakkoli et al.
the observer robot is static. However, we do not make any further restrictions on the background composition, thus allowing for local changes in the background such as fluctuating lights, water fountains, waving tree branches, etc. The proposed system models the background pixel changes using an Incremental Support Vector Data Description module. The background model is then used to detect foreground regions in new frames. The foreground regions are processed further by employing a connected component processing in conjunction with a blob detection module to find objects of interest. These objects are tracked by their corresponding statistical models which are built from the objects’ spectral (color) information. A laser-based range finder is used to extract the objects’ trajectories and relative angles from their 2-D tracking trajectories and their depth in the scene. However, the spatio-spectral coherency of tracked objects may be violated in cases when two or more objects occlude each other. A collision resolution mechanism is devised to address the issue of occlusion of objects of interest. This mechanism uses the spatial object properties such as their size, the relative location of their center of mass, and their relative orientations to predict the occlusion (collision). Incremental Support Vector Data Description. Background modeling is one of the most effective and widely used techniques to detect moving objects in videos with a quasi-stationary background. In these scenarios, despite the presence of a static camera, the background is not completely stationary due to inherent changes, such as water fountains, waving flags, etc. Statistical modeling approaches estimate the probability density function of the background pixel values. Parametric density estimation methods, such as Mixture of Gaussians techniques (MoG) are proposed in [4]. If the data is not drawn from a mixture of normal distributions the parametric density estimation techniques may not be useful. As an alternative, non-parametric density estimation approaches can be used to estimate the probability of a given sample belonging to the same distribution function as the data set [5]. However, the memory requirements of the non-parametric approach and its computational costs are high since they require the evaluation of a kernel function for all data samples. Support Vector Data Description (SVDD) is a technique which uses support vectors in order to model a data set [6]. The SVDD represents one class of known data samples in such a way that for a given test sample it can be recognized as known, or rejected as novel. Training of SVDDs is a quadratic programming optimization problem. This optimization converges by optimizing only on two data points with a specific condition [7] which requires at least one of the data points to violate the KKT conditions – the conditions by which the classification requirements are satisfied– [8]. Our experimental results show that our SVDD training achieves higher speed and require less memory than the online [9] and the canonical training [6]. A normal data description gives a closed boundary around the data which can be represented by a hyper-sphere (i.e. F (R, a)) with center a and radius R, whose volume should be minimized. To allow the possibility of outliers, slack variables i ≥ 0 are introduced. The error function to be minimized is:
A Visual Tracking Framework for Intent Recognition in Videos
F (R, a) = R2 + C
i xi − a2
453
(1)
i
subject to: 2
xi − a ≤ R2 + i
∀i.
(2)
Kernel functions K(xi , xj ) = Φ(xi ) · Φ(xj ) are used to achieve more general descriptions. After applying the kernel and using Lagrange optimization the SVDD function becomes: L= αi K(xi , xi ) − αi αj K(xi , xj ) ∀αi : 0 ≤ αi ≤ C (3) i
i,j
Only data points with non-zero αi are needed in the description of the data set, therefore they are called support vectors of the description. After optimizing (3) the Lagrange multipliers should satisfy the normalization constraint i αi = 1. Our incremental training algorithm is based on the theorem proposed by Osuna et al. in [8]. According to Osuna a large QP problem can be broken into series of smaller sub-problems. The optimization converges as long as at least one sample violates the KKT conditions. In the incremental learning scheme, at each step we add one sample to the training working set consisting of only support vectors. Assume we have a working set which minimizes the current SVDD objective function for the current data set. The KKT conditions do not hold for samples which do not belong to the description. Thus, the SVDD converges only for the set which includes a sample outside the description boundary. The smallest possible sub-problem consists of only two samples [7]. Since only the new sample violates the KKT conditions at every step, our algorithm chooses one sample from the working set along with the new sample and solves the optimization on them. Solving the QP problem for two Lagrange multipliers can be done analytically. Because there are only two multipliers at each step, the minimization constraint can be displayed in 2-D. The two Lagrange multipliers should satisfy the inequality in (3) and the linear equality in the normalization constraint. We first compute the constraints on each of the two multipliers. The two Lagrange multipliers should lie on a diagonal line in 2-D (equality constraint) within a rectangular box (inequality constraint). Without loss of generality we consider that the algorithm starts with finding the upper and lower bounds on old old old α2 which are H = min(C, αold 1 + α2 ) and L = max(0, α1 + α2 ), respectively. new The new value for α2 is computed by finding the maximum along the direction given by the linear equality constraint. If the new value for αnew exceeds the 2 bounds it will be clipped (ˆ αnew ). Finally, the new value for α is computed using 1 2 the linear equality constraint: old new αnew = αold 1 1 + α2 − α2
(4)
The algorithm for background modeling using the incremental SVDD is composed of two modules. In the background modeling module the incremental
454
A. Tavakkoli et al.
(a)
(b)
(c)
(d)
Fig. 2. Comparison of methods in presence of irregular motion in water surface video: (a) Original frame. (b) MoG results. (c) AKDE results. (d) INCSVDD results.
learning scheme uses the pixel values in each frame to train their corresponding SVDDs. In the background subtraction module the pixel value in the current frame is used in the trained SVDD to label it as a known (background) or novel (foreground) pixel. Background Modeling Results. We compare the results of background modeling and foreground detection using our method with the AKDE [5] and MoG [4] on the water surface video. The results of MoG, the AKDE and our technique are shown in Figure 2 (b), (c), and (d), respectively.
(a)
(b)
(c)
Fig. 3. Background modeling using the proposed method
Figure 3 shows the results of foreground detection in videos using our method. The fountain, waving branches, and flickering lights in Figure 3(a), (b), and (c) pose significant challenges. Blob Detection and Object Localization. In the blob detection module, the system uses a spatial connected component processing to label foreground regions from the previous stage. However, to label objects of interest a blob refinement framework is used to compensate for inaccuracies in physical appearance of the detected blobs due to unintended region split and merge, inaccurate foreground detection, and small foreground regions. A list of objects of interest corresponding to each detected blob is created and maintained to further
A Visual Tracking Framework for Intent Recognition in Videos
455
process and track each object individually. This raw list of blobs corresponding to objects of interest is called the spatial connected component list. Spatial properties about each blob such as its center and size are kept in the spatial connected component list. The list does not incorporate individual objects’ appearances and thus is not solely useful for tracking purposes. The process of tracking individual objects based on their appearance in conjunction with their corresponding spatial features is carried out in the spatio-spectral tracking mechanism. Spatio-Spectral Tracking Mechanism. A system that can track moving objects (i.e. humans) requires a model for individual objects. These appearance models are employed to search for correspondences among the pool of objects detected in new frames. Once the target for each individual has been found in the new frame they are assigned a unique ID. In the update stage the new location, geometric and photometric information for each visible individual are updated. This helps recognize the objects and recover their new location in future frames. Our proposed appearance modeling module represents an object with two set of histograms, for the lower and upper half of the body. In the spatio-spectral tracking module a list of known objects of interest is maintained. This list represents each individual object and its corresponding spatial and color information along with its unique ID. During the tracking process the system uses the raw spatial connected component list as the list of observed objects and uses a statistical correspondence matching to maintain the ordered objects list and track each object individually. The tracking module is composed of three components: – Appearance modeling. For each object in the raw connected component list a model is generated which contains the object center of mass, its hight and width, the upper and lower section foreground masks, and the multivariate Gaussian distribution models of its upper and lower section pixels. – Correspondence matching. The pixels in the upper and lower sections of each object in the raw list are used against each model in the ordered list of tracked objects. The winner model’s ID then is used to represent the object. – Model update. Once the tracking is performed the models will be updated. Any unseen object in the raw list is then assigned a new ID and their models are updated accordingly. Collision Resolution. In order for the system to be robust to collisions – when individuals get too close so that one occludes the other– the models for the occluded individual may not reliable for tracking purposes. Our method uses the distance of detected objects and uses that as a means of detecting a collision. After a collision is detected we match each of the individual models with their corresponding representatives. The one with the smallest matching score is considered to be occluded. The occluded object’s model will not be updated but its new position is predicted by a Kalman filter. The position of the occluding agent is updated and tracked by a well known mean-shift algorithm. After the collision is over the spatio-spectral tracker resumes its normal process for these objects.
456
A. Tavakkoli et al.
(a)
(b)
(c)
(d)
Fig. 4. Detection results in four frames of a video sequence: Top row: detected foreground masks. Bottom row: Tracked objects.
Tracking Results. The proposed tracking algorithm is tested on several video sequences in which people would appear in the scene, pass each other and reappear in the scene after leaving it. Fig. 4 shows the detection and tracking results of the algorithm. As it can be seen from Fig. 4(b) the system resolves the occlusion and recovers very quickly once the occluded object reappears, Fig. 4(c).
3
Context-Based Intent Recognition
The information from the vision system is used by the intent recognition module. This module has two parts: a set of activity models and a set of context models. The overall architecture is inspired by Theory of Mind [10]. 3.1
Hidden Markov Models for Representing Activities
Our activity modeling approach uses hidden Markov models (HMMs). In our system, a single HMM models a single basic activity (“two people approach one another”). The hidden states of such a model correspond to “small-scale” intentions or components of a complex activity. The visible symbols of our models correspond to changes in the parameters produced as the output of the tracking module. The system is trained using the approach described above during the activity modeling stage and the Baum-Welch algorithm. 3.2
Context Models for Inferring Intent
Recent neuroimaging work [11] suggests that humans use the context in which an activity is performed to infer the intent of an agent. Our system mimics this capability to improve intent recognition. In our system, an intention consists of a name and an activity model, and a context consists of a name and a probability distribution over the possible
A Visual Tracking Framework for Intent Recognition in Videos
457
intentions that may occur in the current context. Our goal is to determine the most likely intention of an agent, given a sequence of observations of the agent’s trajectory and the current context. Letting s be an intention, v be a sequence of visible symbols, and c be a context, we want to find arg max p(s|v, c). s
Applying Bayes’ rule and simplifying, this is equivalent to finding s that maximizes p(v|s, c)p(s|c). Moreover, the above product can be further simplified to p(v|s)p(s|c). This follows from our definition of a context. Since we assume that a context only provides a distribution over intention names, knowing the current context tells us only about the relative likelihood of the possible intentions. A context says nothing about the hidden or observable states of any activity model, and so does not affect the independence assumptions of the underlying HMM. Thus to find the most likely intention given a context c and observation sequence v, compute p(v|s)p(s|c) for each intention s, and select the one whose probability is largest. The value of p(v|s) can be calculated using the HMM that belongs to s, and p(s|c) is assumed available in the definition of the system’s context. 3.3
Intention-Based Control
Our system is also able to dispatch behaviors on the basis of an observed agent’s intentions. To do this, the system performs inference as above, and examines both the current context and decoded internal state of the activity model. On the basis of this state, the robot performs a predefined action or set of actions.
4
Experimental Results
To test our system, we had a robot observe interactions among multiple human agents and multiple static objects in three sets of experiments. To provide a quantitative evaluation of intent recognition performance, we use two measures: – Accuracy rate = the ratio of the number of observation sequences, of which the winning intentional state matches the ground truth, to the total number of test sequences. – Correct Duration = C/T , where C is the total time during which the intentional state with the highest probability matches the ground truth and T is the number of observations.
458
A. Tavakkoli et al.
The accuracy rate of our system is 100%: the system ultimately chose the correct intention in all of the scenarios in which it was tested. In the first set of experiments, the same footage was given to the system several times, each with different a context, to determine whether the system could use the context alone to disambiguate agents’ intentions. We considered three pairs of scenarios: leaving the building on a normal day/evacuating the building, getting a drink from a vending machine/repairing a vending machine, and going to a movie during the day/going to clean the theater at night. Quantitative results can be seen in Table 1. The second set of experiments was performed in a lobby, and had agents meeting each other and passing each other both with and without contextual information about which of these two activities is more likely. Results for the two agents involved are given in Table 1 for both the context-free meet scenario and the context-assisted meet scenario. Table 1. Quantitative Evaluation Scenario (with Context) Correct Duration [%] Leave building (Normal) 96.2 Leave building (Evacuation) 96.4 Theater (Cleanup) 87.9 Theater (Movie) 90.9 Vending (Getting Drink) 91.1 Vending (Repair) 91.4 Meet (Unbiased) - Agent 1 65.8 Meet (Unbiased) - Agent 2 72.4 Meet (Biased) - Agent 1 97.8 Meet (Biased) - Agent 2 100.0
We tested intention-based control in two scenarios. In the first, the robot observes a human dropping a bag in hallway, recognizes this as a suspicious activity, and follows the human. In the second scenario, an agent enters an office and sets his bag on the table. Another agent enters the room and steals the bag. The observer robot recognizes the theft and signals a robot in the hallway. The second robot then finds the thief, and follows him. In both cases, the observer robot was able to correctly use intentional information to dispatch the contextappropriate behavior. Additionally, by using context to cut down on the number of intentions the robot has to consider in its maximum likelihood calculation, we improve the efficiency of our system.
5
Conclusion
In this paper, we proposed a vision-based framework for detecting and tracking humans in video. The trajectories of these entities are passed to an intent
A Visual Tracking Framework for Intent Recognition in Videos
459
recognition module that allows the system to infer the intentions of all of the tracked agents based on the system’s spatio-temporal context. We validated our system through several experiments, and used agents’ trajectories to correctly infer their intentions in several scenarios. Lastly, we showed that inferred intentions can be used by a robot for decision-making and control.
Acknowledgments This work was supported in part by the Office of Naval Research under grant number N00014-06-1-0611 and by the National Science Foundation Award IIS0546876.
References 1. Dalal, N., Triggs, B.: Histogram of Oriented Gradients for Human Detection. In: International Conference on Pattern Recognition, pp. 886–893 (2005) 2. Ramanan, D., Forsyth, D., Zisserman, A.: Tracking People by Learning Their Appearances. IEEE PAMI 29, 65–81 (2007) 3. Efors, J., Berg, A., Morri, G., Malik, J.: Recognizing action at a distance. In: Intl. Conference on Computer Vision (2003) 4. Stauffer, C., Grimson, W.: Learning Patterns of Activity using Real-Time Tracking. IEEE Transactions on PAMI 22, 747–757 (2000) 5. Tavakkoli, A., Nicolescu, M., Bebis, G.: Automatic Statistical Object Detection for Visual Surveillance. In: Proceedings of IEEE Southwest Symposium on Image Analysis and Interpretation, pp. 144–148 (2006) 6. Tax, D., Duin, R.: Support Vector Data Description. Machine Learning 54, 45–66 (2004) 7. Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods - Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1998) 8. Osuna, E., Freund, R., Girosi, F.: Improved Training Algorithm for Support Vector Machines. In: Proc. Neural Networks in Signal Processing (1997) 9. Tax, D., Laskov, P.: Online SVM Learning: from Classification and Data Description and Back. Neural Networks and Signal Processing, 499–508 (2003) 10. Premack, D., Woodruff, G.: Does the chimpanzee have a theory of mind? In: Behavioral and Brain Sciences, vol. 1, pp. 515–526 (1978) 11. Iacobini, M., Molnar-Szakacs, I., Gallese, V., Buccino, G., Mazziotta, J., Rizzolatti, G.: Grasping the Intentions of Others with Ones Own Mirror Neuron System PLoS Biol. 3(3), 79 (2005)
Unsupervised Video Shot Segmentation Using Global Color and Texture Information Yuchou Chang1, Dah-Jye Lee1, Yi Hong2, and James Archibald1 1
Dept. of Electrical and Computer Eng., Brigham Young University, Provo, UT USA Dept. of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
2
Abstract. This paper presents an effective algorithm to segment color video into shots for video indexing or retrieval applications. This work adds global texture information to our previous work, which extended the scale-invariant feature transform (SIFT) to color global texture SIFT (CGSIFT). Fibonacci lattice-quantization is used to quantize the image and extract five color features for each region of the image using a symmetrical template. Then, in each region of the image partitioned by the template, the entropy and energy of a cooccurrence matrix are calculated as the texture features. With these global color and texture features, we adopt clustering ensembles to segment video shots. Experimental results show that the additional texture features allow the proposed CGTSIFT algorithm to outperform our previous work, fuzzy-c means, and SOM-based shot detection methods.
1 Introduction Due to the rapid growth of multimedia databases and the increasing demand to provide online access to these databases, automated indexing of a large number of videos is a key problem in content-based video retrieval (CBVR). The efficiency of the retrieval process depends on how video shots, scenes, and events are organized in the video database [1]. Several papers in the literature [2-6] have proposed strategies for segmenting video into separate video shots around which a database can be organized. Shot detection or segmentation can generally be categorized into five classes: pixel-based, histogram-based, feature-based, statistics-based, and transform-based methods [7]. Features that help a user or machine judge if a particular frame belongs to a shot are critical for shot segmentation. Many visual features have been proposed for describing the contents of an image. Scale-invariant feature transform (SIFT) has been shown to be the most robust, invariant descriptor of local features [9, 10]. However, SIFT operates on grayscale images rather than color images that make up the vast majority of recorded videos. In this paper, we propose a clustering-based hard-cut shot segmentation algorithm for video indexing, which extends the SIFT approach by adding global, color and texture aspects. In our previous work [8], we modified SIFT to incorporate color and global space, and we obtained better results than methods based on the self organizing map (SOM) [4] and fuzzy c-means [5]. In this paper, we extend our work by including texture information as another constraint on SIFT. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 460–467, 2008. © Springer-Verlag Berlin Heidelberg 2008
Unsupervised Video Shot Segmentation Using Global Color and Texture Information
461
Generally, clustering-based shot detection methods use only one clustering algorithm to categorize frames into corresponding shots. Berkhin [16] classified clustering algorithms into 8 groups such as hierarchical methods, partitioning methods, gridbased methods, constraint-based clustering, and so forth. Although many clustering algorithms have been described in the literature, no one clustering method can outperform all other methods over all data sets. Each clustering method has its own advantages and disadvantages that result in different performance for different data sets. No one single clustering algorithm is the best for all data sets. Considering the success of clustering ensembles [11, 12] in machine learning in recent years, we propose a novel video segmentation method suitable for indexing at the shot level that uses clustering ensembles and incorporates texture features. Compared with existing approaches such as our previous work [8], an SOM-based method [4], and fuzzy c-means [5], the proposed method is shown to be more effective in detecting video shots.
2 The Proposed Method 2.1 Color Global SIFT SIFT uses a one-dimensional (1-D) vector of scalar values associated with each pixel as a local feature descriptor, but it cannot be used to process color images which generally consist of three-dimensional (3-D) vector values. The main difficulty of applying SIFT to color images is that no color space is able to use 1-D scalar values to represent colors. Although there are many color space conversion methods that transform 3-D RGB color values to other color spaces such as HSV, CIE Lab and so forth, the transformed color spaces still represent colors in 3-D. 24-bit color images have three color components, red, green, and blue, which are combined to generate over 16 million unique colors. Compared to a 256-level grayscale image, a color image can convey much more visual information, providing the human perceptual system with much more detail about the scene. However, humans cannot distinguish between colors that are very similar, so only a subset of the 16 million colors is useful. In order to describe color compactly yet accurately, we use Fibonacci latticequantization [13] to obtain one dimensional color indices to represent colors. Unlike three-dimensional color spaces (RGB, HSV, CIE Lab, etc.), this representation uses a single scalar value to typify each color, and the indices can be used to denote visual “distance” between their corresponding colors. From Fig. 1, it can be seen that distinct colors can appear very similar in a grayscale image (Fig. 1(b)). However, the image quantized by the Fibonacci lattice (Fig. 1(c)) displays robust discriminative colors despite using fewer colors (185) than the 256 levels of gray in Fig. 1(b). Figure 1(d) shows the color indices as a grayscale image. Based on this accurate representation of colors, we perform the normalization process in Eq. (1) on quantized indices to obtain SIFT keypoint descriptors.
I N ( x, y ) =
q( x, y ) − qmin × 255 qmax − qmin
(1)
In Eq. (1), IN(x, y) is the normalized value at the pixel location (x, y) in the image, q(x, y) is the color index value at that pixel, and qmax and qmin are maximum and minimum symbol values within the image.
462
Y. Chang et al.
(a)
(b)
(c)
(d)
Fig. 1. Color image displayed as a grayscale image, its quantized indices, and quantized colors
In order to extend SIFT to global features which can better describe the contents of the whole frame, we partition the image frame into symmetric regions to extract new global features. Assume that, after performing SIFT based on Fibonacci latticequantization, one image has NI keypoints, each of which is a 128-dimension vector. We construct a template shown in Figure 2 to gather position information for constructing CGSIFT. This template consists of 24 distinct regions that increase in size as their distance from the center of the image increases. Generally, objects in the center of an image attract more attention than surrounding objects, which are generally considered to be background or trivial details. Each partition is assigned a unique identifying label. The eight regions nearest the center are labeled 1 to 8, the eight intermediate regions are labeled 9 to 16, and outermost regions are labeled 17 to 24. In each region of these 24 regions, a mean color value is calculated as follows:
Fig. 2. A new space description template for constructing CGSIFT
Unsupervised Video Shot Segmentation Using Global Color and Texture Information
∑
VColorMean _ i =
(x,y )∈R i
463
q(x, y) , where i=1, 2,···,24
NumPi
(2)
In Eq. (2), Ri represents partition i in the template, NumPi is the number of pixels in Ri, and q(x, y) is the color index of pixel (x, y). In a similar manner, we calculate the color variance in each region:
VColorVar _ i =
∑
( x,y )∈R i
(q(x, y) − VColorMean _ i ) 2 NumPi
, i=1,2,···,24
(3)
The third component of CGSIFT is the number of keypoints in each region, denoted VNumKeypoints_i, i=1,2,···,24. Since keypoints can reflect the salient information within the image, if one region has a higher number of keypoints, it should naturally be considered as a more important part of the image frame. The next two components of CGSIFT are the mean and variance of the orientation of keypoints identified by the original SIFT in the region. These two components are calculated according to Eqs. (4) and (5):
∑ j =1
NumKeyi
VOrientMean _ i =
o( x, y )
, i=1,2,···,24
(4)
NumKeyi
NumKeyi is the number of keypoints in region i, and o(x, y) is the orientation of the keypoint within current region i. Variances of the keypoints in each region are obtained as follows:
∑
NumKeyi
VOrientVar_ i =
j=1
(o(x, y) −VOrientMean_ i )2 NumKeyi
,
i=1,2,···,24
(5)
Although a SIFT keypoint is rotation invariant for each pixel, average orientation of keypoints in each region is needed to provide global information. Mean and variance of the orientation of all keypoints in each region are calculated as measures of global information. These five components of the CGSIFT (VColorMean_i, VColorVar_i, VNumKeypoints_i, VOrientMean_i, and VOrientVar_i.) are used to construct a 5×24=120-dimension feature vector of CGSIFT. Thus, CGSIFT combines color, salient points, and orientation information simultaneously, resulting in more robust operation than can be obtained using single local SIFT grayscale feature. Moreover, CGSIFT can be used as the basis for color video shot detection. 2.2 Join Texture Information with Other Features In each region of the image as shown in Fig. 2, we also extract texture information to supplement the five global features mentioned in the previous section. In order to describe the contents of the image, we employ a co-occurrence matrix [14] – a classic texture descriptor. Other texture descriptors such as wavelet based technique can also be used here. Recall that each pixel in each of the 24 regions has been assigned a Fibonacci lattice-index normalized to 0-255 using Eq. (1). Because none of the 24
464
Y. Chang et al.
regions is a regular rectangle, the indices of the void pixels are filled with zero to form a regular rectangle for each region for the calculation of a co-occurrence matrix. The co-occurrence matrix of each region is calculated as follows. if I(p,q) = i and I(p + Δx,q + Δy) = j n m ⎧1 c(i, j ) = ∑ p =1 ∑q =1 ⎨ 0 otherwise ⎩
(6)
[Δx, Δy] is set to be [0,1], [-1,1], [-1,0], and [-1,-1] respectively. These four combinations of Δx and Δy provide the rotation invariance at 0, 45, 90, and 135 degrees. After obtaining the co-occurrence matrix for each region, their entropy and energy are calculated according to Eqs. (7) and (8), producing features that describe texture information within each region:
Entropy = −∑
255 i= 0
∑
255 j= 0
g(i, j)log(g(i, j))
Energy = ∑i =0 ∑ j =0 g (i, j ) 2 255
255
(7) (8)
where g(i, j) is the normalized Fibonacci lattice quantization index in each region. Therefore, in addition to five color features in each region, two scalar values of entropy and energy are added to the 120-dimension feature vector to form a new 7×24=168-dimension feature vector, which includes both global color and texture information for the detection of salient feature points. With the addition of texture information, we name this new feature descriptor as color global texture SIFT (CGTSIFT). 2.3 Clustering Ensembles As noted in Section 1, many different clustering methods have been used for shot detection. We use a novel clustering strategy with a clustering ensemble for shot detection. Rather than using a single clustering method, a clustering ensemble focuses on knowledge reuse [11] of the existing clustering groups so as to achieve a reasonable and accurate final partition result. In order to depict the essential CGTSIFT feature distribution as accurately as possible, we adopt random initial clustering centroids that generate different results depending on the initial centroids selected. The procedure of using a k-means singleclustering algorithm for processing a color frame consists of the following steps: 1. Determine the numbers of clusters K1,K2,···,KM for M k-means clusterings to form M clustering results on CGTSIFT features for a set of frames. (K1,K2,···,KM are generally selected to be slightly larger than the final number of clusters). 2. For each single k-means clustering, randomly select Ki (i=1,2,···,M) CGTSIFT features of M frames as the initial clustering centroids. 3. For each frame, assign to the group that has the closest centroid based on the Euclidean distance measure. 4. After all frames have been assigned to a group, recalculate the positions of the current clustering Ki (i=1,2,···,M) centroids. 5. Repeat Steps (3) and (4) until the centroids no longer move, then go to Step (6). 6. Repeat Steps (2), (3), (4), and (5) until M separate k-means clusterings have been created.
Unsupervised Video Shot Segmentation Using Global Color and Texture Information
465
Using the clustering groups generated by the repeated application of the k-means single-clustering method, the ensemble committee is constructed for the next ensemble step. We use the clustering-based similarity partition algorithm (CSPA) [11] as the consensus function to yield a combined clustering. (Complete details about CSPA can be found in reference [11].) The combined clustering is used as the final partition of the video shots.
3 Experimental Results The proposed feature extraction algorithm CGTSIFT was tested using three videos: “Grand Canyon”, “The Future of Energy Gases”, and “Campus”. “Grand Canyon” and “The Future of Energy Gases” are test videos listed in TRECVID 2001. Both were downloaded from the Open Video Project [15]. “Campus” was captured on the campus of Brigham Young University. We manually segmented videos into groups of the hard-cut shots, which were used as evaluation ground truth. Fig. 3 shows six of the key frames of the “campus” video, belonging to six of the ten shots contained in the video. Table 1 summarizes relevant information about the three videos.
Fig. 3. Six key frames of the “campus” video Table 1. Video Data Used for Comparison
Video Grand Canyon The Future of Energy Gases Campus
# of Frames 600
# of Shots 30
220
11
100
10
466
Y. Chang et al.
We used recall and precision as measures of performance. These are defined in the equations below:
Recall = D
D + DM
Precision = D D + D
F
where D is the number of shot transitions correctly detected by the algorithm, DM is the number of missed detections, and DF is the number of false detections (transitions that should have been detected but were not). A study of the number of initial clusters is beyond the scope of this paper. The number of initial clusters is specified based on prior knowledge. We set the number of cluster centroids in each componential k-means single-clustering to be 35, 13 and 12 for “Grand Canyon”, “The Future of Energy Gases” and “Campus” respectively, and the final partition numbers to be 30, 11 and 10. It can be seen in Table 2 that recall and precision measures are better for the proposed method than for fuzzy c-mean [5], for SOM [4], and for our previous work [8], a clustering ensemble on CGSIFT without texture information. Texture information included in CGTSIFT increases the robustness of the feature, which allows CGTSIFT to discriminate categories more accurately. Since we only incorporated texture information into CGSIFT to form the proposed CGTSIFT, the performance improvement is not obvious. More descriptors can be incorporated into the template shown in Figure 2 to further improve shot detection accuracy. Table 2. Performance evaluation of the clustering ensemble on CGTSIFT with texture information, fuzzy c-means, SOM and our previous work-clustering ensemble on CGSIFT without texture information
Video
Clustering Clustering Ensemble on Ensemble on Fuzzy C-Means SOM CGTSIFT with CGSIFT w/o texture texture Recall Precision Recall Precision Recall Precision Recall Precision
“Grand 100.0% Canyon” “The Future of 100.0% Energy Gases” “Campus” 100.0%
90.6%
66.7%
13.8%
89.7%
82.9%
96.6%
90.6%
58.8%
76.9%
17.5%
76.9%
55.6%
100.0%
58.8%
90.0%
69.2%
56.3%
100.0%
81.8%
100.0%
90.0%
4 Conclusion In this paper, we presented a new method of feature extraction for shot detection that adds texture information to our previous work. The proposed CGTSIFT algorithm avoids the limitations of traditional SIFT by incorporating color and texture information in global space. Evaluation using three videos shows that the proposed feature extraction and clustering ensemble algorithm outperforms our previous work and methods based on fuzzy-c means and SOM.
Unsupervised Video Shot Segmentation Using Global Color and Texture Information
467
Mikolajczyk et al. made a performance comparison of descriptors for local interest regions. We extended the local descriptor SIFT to become a global descriptor. Furthermore, we incorporated color and texture information into it. How to use graphbased techniques to extend local descriptors to global space will be a focus of our future work.
References 1. Li, Y., Kuo, C.J.: Video Content Analysis Using Multimodal Information: For Movie Content Extraction. In: Indexing and Representation. Kluwer Academic Publishers, Dordrecht (2003) 2. Liu, T., Katpelly, R.: Content-Adaptive Video Summarization Combining Queueing and Clustering. In: IEEE International Conference on Image Processing, pp. 145–148 (2006) 3. Mezaris, V., Kompatsiaris, I., Boulgouris, N.V., Strintzis, M.G.: Real-Time CompressedDomain Spatiotemporal Segmentation and Ontologies for Video Indexing and Retrieval. IEEE Transactions on Circuits and Systems for Video Technology 14(5), 606–621 (2004) 4. Koskela, M., Smeaton, A.F.: Clustering-Based Analysis of Semantic Concept Models for Video Shots. In: International Conference on Multimedia and Expo., pp. 45–48 (2006) 5. Lo, C.C., Wang, S.J.: Video Segmentation Using A Histogram-Based Fuzzy C-Means Clustering Algorithm. In: International Conference on Fuzzy Systems, vol. 2, pp. 920–923 (2001) 6. O’Connor, N., Cziriek, C., Deasy, S., Marlow, S., Murphy, N., Smeaton, A.: News Story Segmentation in the Fischlar Video Indexing System. In: IEEE International Conference on Image Processing (2001) 7. Rui, Y., Huang, T.S., Mehrotra, S.: Constructing Table-of-Content for Videos. Multimedia Systems 7, 359–368 (1999) 8. Chang, Y.C., Lee, D.J., Hong, Y., Archibald, J.K.: Unsupervised Video Shot Detection Using Clustering Ensemble with a Color Global Scale-Invariant Feature Transform Descriptor. Color in Image and Video Processing of the EURASIP Journal on Image and Video Processing 2008, Article ID 860743, 10 pages (February 2008) 9. Lowe, D.G.: Distinctive Image Feature from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 91–110 (2004) 10. Lowe, D.G.: Object Recognition from Local Scale-Invariant Features. In: International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999) 11. Strehl, A., Ghosh, J.: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research (3), 583–617 (2002) 12. Kuncheva, L.I., Vetrov, D.P.: Evaluation of Stability of K-Means Cluster Ensembles with Respect to Random Initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11), 1798–1808 (2006) 13. Mojsilovic, A., Soljanin, E.: Color Quantization and Processing by Fibonacci Lattices. IEEE Transactions on Image Processing 10, 1712–1725 (2001) 14. Haralick, R.M., Shanmugam, K., Distein, I.: Textural Features for Image Classification. IEEE Transactions on Systems, Man and Cybernetics 3(6), 610–621 (1973) 15. Open Video Project, http://www.open-video.org/index.php 16. Berkhin, P.: Survey of Clustering Mining Techniques, Accrue Software Inc., Technical Report (2002) 17. Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1615–1630 (2005)
Progressive Focusing: A Top Down Attentional Vision System Roland Chapuis, Frederic Chausse, and Noel Trujillo LASMEA/Clermont Universit´e, 24 av des Landais, 63177 Aubiere Cedex, France
Abstract. The principle of a vision system based on a progressive focus of attention principle is presented. This approach considers the visual recognition strategy as an estimation problem for which the purpose is to estimate both precisely and reliably the parameters of the object to be recognized. The object is constituted of parts statistically dependent one each other thanks to a statistical model. The reliability is calculated within a bayesian framework. The case of lane sides detection for driving assistance is given as an illustration.
1
Introduction
A vision system combines an image capture device (i.e. a camera), hardware and software for image acquisition, storage and processing in order to reach a visual goal. It defines a relationship linking image raw data, sometimes called stimuli, to a synthetic representation constituting a model of the scene at a particular time. Its is characterized by the strategy implemented to define this link. Perception strategies have been defined during time within the context of images analysis. Computer Vision introduced by D. Marr [1] is a well known example. Information is processed on three levels (low, mid and high) of abstraction. This process links the image to a representation, a model, of the observed scene, organized on elementary surfaces. A lower rank level transmits its output information to the input of the upper rank level. There is no other link between two consecutive levels except this hierarchical organization. Computer vision is thus an open-loop system: the output does not react on the input. This leaves to the low level module the huge responsibility not to be mistaken while extracting useful features. With active perception ([2], [3], [4], [5]) is introduced the first closed-loop system for which a feedback strategy depending on a priori knowledge is built to control the perception. To the modular approaches, is clearly opposed a systemic approach that requires the definition of a global perception strategy. Finally, the idea of active perception suggests, through the necessity of a perception control process, the requirement for orientating the information capture to the important parts of the scene according the the goal to be reached. This suggests the necessity of a focus of attention process in the perception problem. Attentional perception requires a control process implementing a strategy of information management (choosing models, operators, data, . . . ). Three kinds of mechanisms do exist: G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 468–477, 2008. c Springer-Verlag Berlin Heidelberg 2008
Progressive Focusing: A Top Down Attentional Vision System
469
1. Bottom-up control: the attentional process is independent of the goal to reach. It is guided by salient features extracted from raw sensor data. These features are used by a high level process (recognition, analysis, et. . . ) which infers the information required by the application. 2. Top-down control: the focus of attention is guided by the goal to be reached which is expressed in terms related to the synthetic representation of the perceived scene. Top-down control is particularly marked by the definition of Regions Of Interest (ROIs) in the sensors data. Interesting features are more likely to be found inside these ROIs. 3. Hybrid control: both the precedent systems interact in the perception system. J.K. Tsotsos [6] considers the focus of attention as a main characteristic of a vision system. The main objective is to reduce the combinatorial complexity of methods scanning exhaustively the sensors data. He defines the focus of attention at several levels: – Focus within the geometrical sensor space: control of the sensor’s field of view which means control of its pose as in active perception, – Focus in the feature space : control of the sensor data analysis in terms of ROIs, control of the tuning (resolution, threshold, etc. . . ) of features extraction operators, – Focus in the model space: control of kind of the representation of the scene, – Focus in the decisions space: control of events, of objects, of tasks to do in order for the robot to reach its goal. Focus of attention means for the perception system to make different kinds of choice making it possible to select: – The feature to be observed (what), – The ROI in which this feature is to be observed (where) – The operator the most able to extract the feature inside the ROI (how) The choice depends on relevance, facility, capability of the operators and likelihood or completeness of the features. Finally, an attentional process is linked to an adaptive mechanism making it possible to reconsider one stage of the perceptive analysis in order to maintain its coherency. This short presentation of well known vision systems and in particular of attentional systems intends to expose the formal framework of the solution presented in this communication. We propose a top-down system then silence is voluntarily kept on bottom-up attentional systems such as the principle of saliency maps introduced by Koch and Ulmann in the middle eighties [7] that has been 20 years later largely popularized by Itti [8] or Christensen [9]. The proposed vision system is strongly based upon these remarks and works. It is a top-down attentional system including a focus of attention in the feature space as defined by J.K. Tsotsos. It rests on a control process using the current state of the perception to orient the analysis of sensors’ raw data. The objective is to extract at best useful information in order to complete the visual task.
470
R. Chapuis, F. Chausse, and N. Trujillo
We consider this visual task is mainly a recognition task. Thus we propose a recursive recognition strategy called progressive focussing. The principle of this algorithm is detailed in the next section. It is illustrated in the last section by the road recognition application for which it was initially designed.
2 2.1
The Principle of Progressive Focusing Introduction
The visual goal is considered as a recognition task aiming at extracting a representation of the object to be recognized. For that an object is supposed to be represented by a vector of n parameters denoted by X. It must be noticed the elements of X are object’s characteristic parameters (size, color, appearance) completed with scene and/or image position. The recognition is considered here as a problem of estimating elements of X. For that, a recursive recognition is implemented in which iterator k denotes the recognition level or recognition depth. 2.2
Representation of an Object
For a given recognition level k, the object state is denoted by Ek and is composed of: – a parameters vector X k = (x1 , x2 , . . . , xn )T which is an estimate of the unknown vector X at depth k, – the covariance matrix Cxk of X k , – the recognition probability P (Ok ) which represents the integrity of the estimation made at level k. It is thus the probability for the true vector X to fit the assumed gaussian probability density function represented by X k and Cxk . Thus, for a given object, the state is the set Ek = {X k , Cxk , P (Ok )} This means an object is represented by an unknown vector X supposed to fit a normal law centered on X k of covariance Cxk with a probability P (Ok ). The final goal is to find the optimal state Ek minimizing the unprecision of the estimation of X and maximizing P (Ok ). 2.3
Initialization
Initialization aims at giving an initial value E0 to the state of the object. This is an off-line process giving E0 = {X 0 , Cx0 , P (O0 )}. P (O0 ) defines the probability for the unknown vector X to be inside the 1σ ellipsoid of the gaussian law N (X 0 , Cx0 ). 2.4
Object’s Parts
An object is constituted of Np parts. Part i is defined by a vector of mi parameters χi = (λ1 , λ2 , . . . , λmi )T and of the associated covariance matrix Σi (see for ex. .
Progressive Focusing: A Top Down Attentional Vision System
471
For a particular part, vector χi is the image of vector X through a focusing ⎛ ⎞ fi1 (x1 , x2 , . . . , xn ) ⎜ fi2 (x1 , x2 , . . . , xn ) ⎟ ⎜ ⎟ function fi such as : χi = fi (X) + ξi with fi (X) = ⎜ ⎟. .. ⎝ ⎠ . fimi (x1 , x2 , . . . , xn ) Thus the different parts of an object depend one each other thanks to the parameters of an object. ξi represents the part independency making it possible to add some tolerance on a part according to the object: the ideal model represented by function fi can then differ from the reality with little risk on the recognition performances. The covariance matrix of χi is Σi = Ji CxJTi with Ji the jacobian matrix of function fi . The parts must be detected to achieve the recognition process. It must be considered that at a particular level of the recognition, all parts haven’t the same ”ability” to improve the recognition. Considering event di corresponding to a successful detection of part i, this ability is characterized by: – P (di |O): probability to detect the part (or something compatible with the considered part in a given region of interest (ROI)) knowing the object is present. – P (di |O) : probability to detect the part (in a given region of ROI) knowing the object is not present. These two last probabilities depends on the size of the ROI which is deduced from χi and Σi . The state of a part i is then defined by ηi = {χi , Σi , P (di |O), P (di |O), di } Boolean di is true if the part is correctly detected false otherwise. However, state ηi of a part i depends on the state of the object. Here relies the focusing behavior of the method: state Ek of the object at recognition level k constrains state ηi of the part. Then k cannot be greater than the number of parts Np . 2.5
Evolution of the Recognition: Progressive Focus of Attention
Let’s remember the recognition is considered as an estimation process of X. Thus the algorithm is continuously turned toward two objectives: – having a good precision of the estimation – keeping the integrity of the estimation Figures 1 represent the flow chart of the progressive focussing algorithm. The different stages are: – Initialization: definition of the initial object state for depth k = 0 E0 = {X 0 , Cx0 , P (O)0 }, – Part selection: parts are detected sequentially one at a stage. The selection determines the part ”best” improving the recognition,
472
R. Chapuis, F. Chausse, and N. Trujillo
Initialization
E0 Ek Part selection
ηi
k/k
Focalization
ηi
k++
k+1/k
Detection
ηi
k+1/k+1
State update
Ek+1 recognition ?
End
Fig. 1. Chart describing the progressive focusing principle
– Focusing: for the selected part, optimization of its internal state such as the detection is applied in a probable interval of its state. The focusing on a part consists in making the hypothesis the object is situated into a smaller zone than the one in which it was previously supposed to be, – Detection: once the focusing is finished, the process tries to detect the part. The detection can succeed or not. In the positive case, the detection provides an observation making it possible to update the vector χi of the part as well as its covariance Σi , – State Update: if the detection is successful, the probability P (O|di ) of the object to be present as well as its state vector and covariance matrix are updated. Only P (O|di ) is calculated if the detection fails, – Object recognition: the sequence Part Selection → Focusing → Detection → Update stops when a recognition criterion is reached. It is application dependent. 2.6
Detailed Description of the Progressive Focusing Stages
Part Selection. The purpose is to choose the part to be detected. In an ideal situation, the recognition should be precise (good estimation of ) and sure (high P (O)). A successful detection should increase the precision as well as the probability for the estimated value of vector X to really be inside the interval of the state space of X corresponding to that precision. This is a compromise between precision and integrity, the ideal being to be sure estimated X is in a narrow interval. The part selection criterion has to deal with both these aspects. The criterion should also take into account the probability of the part to be detected: no need to spend time to recognize a part having few chance to be detected. Finally the selected part i corresponds to criterion Ci = P (O|di )P (di )P p −→ max
Progressive Focusing: A Top Down Attentional Vision System
473
with (O) P (O|di ) = P (diP|O)P (di ) P (di ) the probability to detect part i P p = I N (X k+1 , Cxk+1 )dX the requested precision of the estimation in the corresponding interval I of the part state space. This is an interval of tolerance corresponding to prior functional specifications. P (O) being the same for all parts thus Ci = P (di |O).P p P (di |O) is determined in the focusing stage (detailed further). To calculate P p two stages are necessary: 1. determine the covariance matrix Cxk+1 once the part is detected:
−1 Cxk+1 = (In×n −Kk+1 J) Cxk with Kk+1 = Cx0 JT JCx0 JT +Σk+1|k+1 2. deducing the probability for vector X k+1 to be inside interval I: P p = X k+1 + I2 N (X k+1 , Cxk+1 )dX. X −I k+1
2
This last calculus is not obvious except in the rectangular interval case (independant variables). To circumvent this problem interval I is considered as
T gaussian rather than uniform: I = exp − 21 (X − X k+1 )C−1 I (X − X k+1 ) with CI the covariance matrix defining the precision to reach in the state space of X. Finally, P p = √ −1 1 . |CI Cxk+1 +In×n |
with In×n the identity matrix of correct dimension n. Focusing on Parts. Considering the current state of the recognition Ek = {X k , Cx0 , P (Ok )}, focusing consists in predicting the state ηk+1|k for each part: ¯ k+1|k , d} ηk+1|k = {χk+1|k , Σk+1|k , P (d|O)k+1|k , P (d|O) This predicts the state of the part from the current state of the object (part index i is omitted here for clearness): – χk+1|k = f (X k ), – Σk+1|k = Jk Cxk JTk ¯ k+1|k ): both these probabilities are particular for – P (d|Ok+1|k ) and P (d|O) each part and depend on the analyzed interval (defined around χk+1|k thanks ¯ k are supposed to be to the covariance matrix Σk+1|k ). P (d|Ok ) and P (d|O) known for a given interval of fixed size (they can be obtained from particular training phases). The main effect is to reduce the sizes of future ROIs which is the most visible characteristic of the progressive focusing (figure 4). – d = 0 : part is not detected Parts Detection. After focusing on it, the part detection is attempted. Detection consists in applying a low level processing (i.e. a detector) in the ROI of the part. The detection set the boolean d of the part if the detection succeed or
474
R. Chapuis, F. Chausse, and N. Trujillo
clear it if not. In the successful case (d = true) state vector χ and its covariance are updated. Then after detection the part’s state becomes: ¯ k+1 , d) ηk+1|k+1 = {χk+1|k+1 , Σk+1|k+1 , P (d|Ok+1 ), P (d|O) with: 1. If detection succeeds: – χk+1|k+1 : new values out of the detector – Σk+1|k+1 : their covariance matrix, – P (d|Ok+1 ) = P (d|O)k : unchanged, ¯ k+1 = P (d|O) ¯ k : unchanged, – P (d|O) – d = 1. 2. If detection fails: – χk+1|k+1 = χk+1|k : unchanged, – Σk+1|k+1 = Σk+1|k : unchanged, – P (d|Ok+1 ) = P (d|O)k : unchanged, ¯ k+1 = P (d|O) ¯ k : unchanged, – P (d|O) – d = 0. Update of the Object’s State. As soon as a part is processed (detected or not) object’s state Ek+1 is updated. Two cases are possible: 1. If the last part detection is successful (d = true) then: X k+1 = X k + Kk+1 χk+1|k+1 − f (X k )
−1 Kk+1 = Cxk JTk JCxk JTk + Σk+1|k+1 Cxk+1 = (In×n − Kk+1 Jk ) Cxk P (d|Ok ).P (Ok ) P (Ok+1 ) = ¯ (1 − P (Ok )) P (d|Ok ).P (Ok ) + P (d|O). 2. If the last part detection has failed (d = f alse) then: X k+1 = X k
;
Cxk+1 = Cxk (1 − P (d|Ok )) .P (Ok ) P (Ok+1 ) = (1 − P (d|Ok )).P (Ok ) + (1 − P (d|O¯k )).(1 − P (Ok )) End of the Recognition. Deciding wether an object is recognized or not is not so simple. An object could be considered as recognized if all its parts have been successfully detected. But detectors are weak: false positive can happen anytime. That’s why the same kind of criterion used in the part selection is used. The object is considered as recognized, if there exists a recognition depth k such as P (O|d)P p > Sr with: P p = I N (X k , Cxk )dX k I is the interval of the state space in which the values of the object’s parameters must be situated regarding the objective of the application. Sr is the threshold associated to these applicative specifications.
Progressive Focusing: A Top Down Attentional Vision System
3
475
Example of the Evolution of the Progressive Focusing for a Particular Application
The lane side detection necessary for current applications in driving assistance systems is considered here as an example to illustrate the progressive focussing algorithm. The purpose is to estimate the lateral positioning X0 of the vehicle on its lane as well as its yaw angle ψ. Some other parameters such as the lane width L and curvature Ch or the camera inclination angle α are also estimated (figure 2).
Fig. 2. Road/vehicle model: the parameters to be estimated
Thus for that application, the vector X is defined as X = (X0 , ψ, α, Ch , L)T . The parts are constituted of lane side abscissas ui on the image plane corresponding to fixed rows vi (figure 3).
Fig. 3. The parts: lane sides points on the image plane
The parts are linked to the object’s parameters (i.e. the elements of vector X thanks to the focusing function (eq. 1). u i = eu {
ev z0 vi − ev α Ch − (x0 + βL) + ψ} 2(vi − ev α) ev z0
with β = 1 for right lane side and β = 0 for left side. eu and ev are the camera intrinsic parameters.
(1)
476
R. Chapuis, F. Chausse, and N. Trujillo
Initialization k=0
Part selection focusing and detection
Update
k=1
k=2
k=3
Fig. 4. Illustration of the progressive focusing on the analysis of a single image of a road scene for lane sides detection application and for 3 consecutive recognition depths (k = 0 . . . 3)
So for that application, χ = (u1 , u2 , . . . , u2n ). The covariance matrix Σ of χ is calculated using the the jacobian matrix of equation 1 regarding X0 , ψ, α and L. The initialization is simply done considering average values for X0 , ψ, α and L and their acceptable (realistic) variations. This section is dedicated to the illustration of progressive focussing. Thus for simplification, the initial probability P (O)0 can be considered as 1. The part detection is achieved using classical edge detection. The edge points are fitted to a straight line using the median least squares method. For this very basic detector, probability P (d|O) depends on the category of road: with white lines, the detection succeeds very often if the line is not occluded. Actually this depends mainly on the size of the ROIs: the progressive focussing involves ROIs becoming more and more narrow which more and more increases the signal to noise ratio. This is one important interest of this method. For P (d|O), it must be considered that the detector never return anything if there is no lane side. This last probability is fixed to a value close to zero. The evolution of the progressive focusing on an image taken from an onboarded camera in a highway context is presented on figure 4 for recognition depth k = 0 (initialization) to k = 3. Many experiments of progressive focusing have been conducted for this application including real time implementation and test in real traffic conditions ([10]).
Progressive Focusing: A Top Down Attentional Vision System
4
477
Conclusion
The progressive focusing vision system is presented here. It is a visual recognition strategy aiming at estimating the parameters defining an object while optimizing both the precision and reliability of the estimation. The example of road sides detection in an image illustrates the focus of attention that progress iteratively. The object model is constituted of several parameters linked to several parts through a focusing function. The actual evolution of that method concerns the part selection stage which is being formalized within the same bayesian framework as the rest of the strategy.
References 1. 2. 3. 4. 5.
6. 7. 8.
9.
10.
Marr, D.: Vision. W.H.Freeman and Company, New York (1982) Bajcsy, R.: Active perception. Proc. IEEE 76, 996–1005 (1988) Ballard, D.H., Brown, C.M.: Principles of animate vision. CVGIP 56, 3–21 (1992) Ballard, D.H.: Animate vision. Artif. Intell. 48, 57–86 (1991) Davison, A., Murray, D.: Simultaneous localization and map-building using active vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 865– 880 (2002) Tsotsos, J., Culhane, S., Yan Key Wai, W., Lai, Y., Davis, N., Nuflo, F.: Modeling visual attention via selective tuning. Artificial intelligence 78, 507–545 (1995) Koch, C., Ullman, S.: Shift in selective visual attention: towards the underlying neural circuitry. Human Neurobiology 4, 227–229 (1985) Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE transactions on pattern analysis and machine intelligence 20, 1254–1259 (1998) Ramstrom, O., Christensen, H.: Object detection using background context. In: Procedings of the 17th International Conference on Pattern Recognition (ICPR 2004). IEEE Computer Society, Los Alamitos (2004) Aufrere, R., Chapuis, R., Chausse, F.: A model-driven approach for real-time road recognition. International Journal Machine Vision Applications 13, 95–107 (2001)
The Benefits of Co-located Collaboration and Immersion on Assembly Modeling in Virtual Environments David d’Angelo, Gerold Wesche, Maxim Foursa, and Manfred Bogen Fraunhofer IAIS, Virtual Environments, Sankt Augustin, Germany {david.d-angelo,gerold.wesche,maxim.foursa, manfred.bogen}@iais.fraunhofer.de
Abstract. In this paper we present a quantitative assessment of user performance in an assembly modeling application. The purpose of the evaluation is to identify the influence of immersion and collaboration on the performance in assembly and manipulation tasks in a virtual environment (VE). The environment used in this study is a projection-based system that supports 6DOF head and hand tracking for two independent users. In four experiments we compare the performance in stereoscopic and monoscopic viewing modes and in collaborative and single-user interaction modes. The user study is based on a realistic, application-oriented modeling scenario from the manufacturing industry. The task completion time is used as the performance metric. The results of this study show that collaborative interaction and stereoscopic viewing is not only preferred by users, but also have a significant effect on their performance.
1
Introduction
The use of immersive Virtual Environments for design-engineering applications is still not very common. One of the main reasons is that the effort needed to implement an immersive system with dedicated 3-D user interfaces and rich functionality is enormous when compared to the implementation of desktop systems. However, simple assembly modeling scenarios can be realized with moderate effort by keeping user interfaces intuitive and by assuming an availability of primitive parts. The task of constructing a composite model from selectable predefined parts naturally maps to manual direct 3-D interaction that resembles the way humans work in a physical world. Despite the fact that people perform manual labor routinely in a collaborative way, most 3-D modeling environments support only single-user interaction. In order to assess the advantages of colocated collaborative interaction we implemented a two-user assembly modeling system, which allows constructing complex models from pre-defined parts. Previous studies rated the importance of stereopsis ambiguously. According to McMahan [1], stereoscopic vision is not essential for object manipulation tasks and does not provide any significant benefit. In contrast to this, Narayan et al. [2] reported that for a collaborative manipulation task, users were more efficient G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 478–487, 2008. c Springer-Verlag Berlin Heidelberg 2008
Assembly Modeling in Virtual Environments
479
with stereoscopic vision than with monoscopic vision. As a result of the ambiguity of these conclusions, we tested our assembly task with single- and two-user interaction and with monoscopic and stereoscopic vision. Our hypothesis was that stereoscopic vision and co-located collaborative two-user interaction are better for solving a complex compound assembly task, which requires users to precisely locate, transform and align objects at specific target positions in space. In the next section we present an overview of related work and directly compare it to our work. In section 3 we describe the objectives of the conducted user study, the test setup, and the experiment procedure. The statistical analysis along with a discussion of the obtained results is given in section 4.
2
Related Work
The main contribution of this paper is a user study investigating simultaneous interaction of two co-located users performing a compound assembly task in a 3-D projection-based environment with and without stereoscopic vision. Only a few papers present user studies involving cooperative manipulation. Pinho et al. [3] performed a study of cooperative object manipulation based on rules. They found out that cooperative manipulation can provide increased usability and performance for complex tasks. However, these results were preliminary, because they did not involve any objective performance measurements. Compared to this work, the users in our environment separate tasks on the basis of verbal agreement due to the fact that they are co-located, i.e. they can directly see each other and communicate in a face-to-face manner. This kind of collaboration support was first introduced by Agrawala et al. [4]. Narayan et al. [2] investigated the effects of immersion on team performance in a collaborative environment for spatially separated users. One user wore a monoscopic head-mounted display (HMD), whereas the other user was interacting in a CAVETM [5] environment, in which immersive factors, such as head-tracking and stereoscopic vision, could be separately enabled or disabled. Narayan et al. also varied the levels of immersion for a collaborative task. In addition to this approach, we tried to find out if co-located collaboration is more effective than single-user interaction. Narayan et al. also noticed that stereoscopic vision has a significant effect on team performance, unlike head tracking, which has no notable effect on the performance at all. Apart from the mentioned works we did not find any other comparative user studies of assembly modeling tasks in collaborative virtual environments. A plausible reason for this could be that true multi-user projection-based systems are still rare due to technical difficulties or high costs. For example, the following projection-based setups provide true multi-viewer support: The two-user Responsive Workbench [4], the multi-viewer display system of the University of Weimar [6], and the TwoView display system [7] used in our study. There are many user studies that compare the performance of different display systems for very simple object manipulation tasks. The results achieved so far are inconsistent and seem to depend on the kind of task to be performed. Swan II
480
D. d’Angelo et al.
et al. [8] designed several task sets, which had to be performed at a workbench, in a CAVETM , at a standard desktop computer and at a projection wall. In their user study, the desktop based system performed best, followed by the CAVETM and the projection wall. The workbench was ranked as the worst system, although the authors admitted that this is most propably a result of the poor projector they had used to drive the workbench. Furthermore, most tasks in this user study were text-based, so it seems natural that a high resolution desktop screen can be superior to projection based systems. Other studies indicate the superiority of immersive displays for manipulative tasks. Figueroa et al. [9] compared the user performance of placement tasks in different VR settings: a standard desktop PC, a large 2D smart board, an HMD and a space-mouse-driven desktop PC. The HMD performance was the best, whereas the PC scored worst, although the difference was not significant. In [10] Ni et al. explored the effect of large sized high-resolution displays on the user task performance of navigation and wayfinding tasks. They showed that large displays are beneficial, because they both facilitate spatial perception and information gathering.
3 3.1
System Design System Overview
For the user study, we used an immersive assembly-based modeling system called VEAM (Virtual Environment Assembly Modeler), which was developed by Foursa et al. [11]. VEAM provides rapid prototyping and product customization functionalities, which allow designing a broad range of products from a limited set of basic mature part models. The assembly algorithms of VEAM do not rely on the geometric representation of parts as the criterion for possible part connections, since polygonal geometric models are only approximative and therefore do not represent the real parts precisely. Instead, VEAM uses a grammar-based approach to encode the part connectivity information. The basis of this approach is the so-called handle concept. A handle is a part attribute, which specifies its connection possibilities in conjunction with a particular connection semantic. Different types of handles are implemented in VEAM, to provide distinct connection semantics. Matching relations between handles are specified by a grammar module, which can be modified at runtime. In general, handles are used to define the set of models that can be constructed out of a set of parts. Moreover, they assist a user during the actual process of assembling these parts. VEAM is completely integrated into the INT-MANUS [12] manufacturing process chain. As a visual display, we used the TwoView display system [7] developed at the Fraunhofer-Institut f¨ ur Intelligente Analyse- und Informationssysteme (IAIS). It is a projection-based Virtual Environment that is capable of displaying two active stereoscopic image pairs at the same time, providing the possibility of co-located two-user collaboration. Each of the two stereoscopic image pairs is generated by a single active-stereo DLP projector equipped with polarization filters for the
Assembly Modeling in Virtual Environments
481
separation of the two image pairs. The shutter glasses have corresponding filters so that each user sees his/her own stereo projection. We used a 6DOF infrared tracking system with an outside-in configuration for the tracking of users’ heads and input devices. VEAM supports a variety of input devices, however we only used stylus-type devices, which we equipped with passive retroreflective markers needed for the 6DOF tracking. 3.2
Experiment Objectives
We proposed two hypotheses for increasing the efficiency of assembly tasks in virtual environments: first, we assumed that a given task can be accomplished faster by two collaborating users than by a single user, and second, we assumed that stereoscopic viewing is superior to monoscopic viewing in respect to user task performance. A compound assembly task is expected to benefit from collaboration, since a single user can only handle one part at a time, whereas two users can solve the task by assembling parts in parallel. Furthermore, two users can communicate about the best strategy to reach and align the parts properly, as in reality, where two persons perform manual activities together. Theoretically, stereoscopic viewing should also support users to precisely locate the correct positions when connecting parts, particularly in the depth dimension of the environment. This should help users dragging parts to their target position quicker, without the need to rotate the model for resolving depth perception problems. 3.3
Experimental Task Description
All participants had to assemble a table out of a table plate and four table legs and place it at a specific position on a floor plate. This involves five docking tasks: four tasks are required for the assembly of the table legs and the table plate and the fifth task corresponds to the placement of the assembled table onto the floor. These docking tasks consist of the positioning and the alignment of an object at a specified target location. They are frequently used to evaluate the performance of three-dimensional interaction techniques and input devices [13,14,15]. However, in many cases only one atomic docking task without any relation to a concrete application is used. The advantage of this approach is that such tasks can be designed to increase the exactness of measurements and statistical methods. The disadvantage is that only the performance of isolated tasks is measured, whereas VE applications typically require successive execution of several distinct tasks. In order to establish more realistic, application-oriented conditions for performance measurements, we used a more complex compound assembly task in our experiment. The assembly task requires the users to visually locate the position of parts, identify their target locations, select individual parts and move them to the target positions while continuously relying on the spatial perception to visually check the results of actions.
482
D. d’Angelo et al.
Fig. 1. Image taken during the training phase of the collaborative two-user modeling task with monoscopic vision
3.4
Experiment Procedure
The user study was carried out in the display laboratory of Fraunhofer IAIS and involved twenty users. The users were research scientists and student assistants or trainees. Sixteen of the participants were male and four female, all aged from 20 to 42. Their proficiency with Virtual Reality applications differed from experts to novice users. Each user had to complete the experimental task in four modes: in single and collaborative two-user modes with stereoscopic and monoscopic vision for each mode. Head tracking was active in all four modes. In the collaborative mode, each user could perceive the correct perspective image supported by individual, simultaneous head tracking. To create equal conditions for all users, the pairs of participants working collaboratively were chosen at random. All users had to do ten repetitions of all four modes, whereas the sequence in which the users had to perform these tasks was randomly selected. In each assembly task, the modeling parts were randomly positioned in space, while the sum of the inter-object distances was kept constant for all initial configurations. The timer clock, which measured and logged the task completion time, was built-in into the application. It started and stopped automatically with the first selection and final assembly operations respectively. Prior to each of the four test runs, the users had the possibility to practice the tasks without time limits. To introduce as little distraction as possible, only the evaluator and the participants were present in the room during the entire procedure.
Assembly Modeling in Virtual Environments
483
Table 1. Illustration of the minimal, maximal, and average task completion times, their standard deviations and the relative speedup factors of each assembly mode Assembly mode
Single-user monoscopic Single-user stereoscopic Two-user monoscopic Two-user stereoscopic
(C2 ) (C1 ) (C3 ) (C4 )
Min. Time
Max. Time
10.52 6.96 8.30 5.30
76.94 40.23 34.10 25.47
Average Standard Relative speed-up Time deviation factors (C2 )(C1 ) (C3 ) (C4 ) 26.45 10.64 1.00 1.41 1.63 2.32 18.72 5.67 0.71 1.00 1.15 1.63 16.22 5.51 0.61 0.87 1.00 1.43 11.40 3.81 0.43 0.61 0.70 1.00
After the completion of the experimental tasks, the users had to fill out a two-paged questionnaire containing questions about their subjective impressions and preferences. In addition, the questionnaire contained a free text annotation field, where a user could make suggestions for further improvement and express criticism. The duration of the entire user test, including the time needed for the completion of the questionnaire, was about 20-30 minutes per user. In figure 1 two users are shown during the training phase of the collaborative assembly task with monoscopic vision.
4
Analysis and Discussion
In this user study, we collected two different data types. The first type is the user task performance measured in seconds, which we analyzed statistically. The second type is the subjective user preferences we obtained after processing the questionnaires. 4.1
Task Performance
Our first hypothesis was that collaborative two-user modeling is more efficient than single-user modeling. With a one-way analysis of variance (ANOVA) with an alpha-level of 0.05 we detected a statistically significant effect (F(1,598) = 151.78, p < 0.00001) of single and two-user modeling on the task completion time. On average, users were able to solve the assembly task approximately 1.6 times faster with collaboration than without. Table 1 shows the minimal, maximal and average time as well as the standard deviations of the task completion times and the relative speed-up factors of each assembly mode. Four box plots of the measured task completion times are shown in figure 2. These plots make it clear, that the users solved the assembly task with the highest efficiency by using collaborative two-user interaction with stereoscopic vision (C3 ), while single-user interaction with monoscopic vision performed the worst (C2 ). Another interesting observation with respect to collaborative interaction is that the standard deviation of the measured task completion times of the two-user tasks is smaller than the single-user tasks (as shown in table 1). This indicates that users were
484
D. d’Angelo et al.
70
60
50
40
30
20
10
Fig. 2. Box plots illustrating the measured task completion times. The median value of each configuration is shown as dotted line, mild outliers are drawn as “+” and extreme outliers as “o”. C1 denotes single-user stereoscopic, C2 single-user monoscopic, C3 twouser stereoscopic and C4 two-user monoscopic modeling mode.
able to perform the task in a more stable way in collaboration. According to our observations, this is directly connected to the possibility to communicate with the other user, to discuss decisions, and to aid one another in difficult situations. Our second hypothesis said that stereoscopic viewing is beneficial for solving the assembly task. Stereoscopic viewing had a statistically significant effect on the task completion times in the single-user case (F(1,358) = 70.27, p < 0.00001) as well as in the two-user case (F(1,178) = 42.69, p < 0.00001). Both in the single-user and two-user case, the users solved the assembly task on average approximately 1.4 times faster with stereoscopic vision than with monoscopic vision. This shows that stereoscopic vision is beneficial for complex docking tasks in Virtual Environments. A two-way ANOVA analysis of the data did not show any statistically significant interaction effect between collaborative/single user and stereo/mono dimensions (F(1,279) = 4.99,p < 0.026). 4.2
User Preferences
The subjective user preferences were collected by means of the questionnaire. All answers could be marked as grades in the five-point scale (strongly agree, agree, neutral, disagree, strongly disagree). In one of its questions users were asked to judge how well and intuitive the task could be solved with the four different techniques and configurations. Most of the users rated collaborative two-user
Assembly Modeling in Virtual Environments
485
modeling as the best (mean of all answers = 1.28, variance of all answers = 0.61) and single-user monoscopic as the worst (mean = 3.28, variance = 0.73). When asked if stereoscopic vision provided a benefit to solve the task, they stated that stereopsis was very supportive to visually locate the target position (mean 1.51, variance = 0.71) and during the translation and orientation of the part in space (mean = 1.57, variance = 0.70). The results of the statistical analysis and the conclusions drawn from them are completely consistent with the collected subjective user preferences. In particular, the statistical finding that the task could be solved with the highest efficiency using collaborative two-user interaction with stereoscopic vision clearly conforms with the user preferences, since most users (19 out of 20) preferred this configuration. 4.3
User Behavior
We also observed the user behavior during the experiment execution. We were especially interested in examining if users tend to apply various assembly strategies in different test runs and assembly modes. A noteworthy observation was that most of the users applied the same assembly sequence for solving the task with monoscopic viewing, regardless of the initial position of the modeling parts. In contrast to this, the users tended to vary their assembly sequences, depending on the initial position of the parts, when they were provided with stereoscopic vision. This indicates that stereoscopic vision was beneficial for the users to perceive the spatial context and therefore increased the efficiency. We found that in most of the collaborative two-user tests, the users agreed on a specific strategy to solve a task beforehand. In case this strategy turned out to be suboptimal in the following, most users immediately discussed new tactics on how they can improve the efficiency, unlike to the single-user tasks, were the users did not tend to alter their strategy so frequently to find the most efficient solutions. This approach to continuously improve the efficiency and to explore new strategies directly resulted from the opportunity to work collaboratively with another user. Thus, co-located collaboration is a helpful technique to increase the performance and to explore new ways of solving problems.
5
Conclusion and Future Work
In this paper we presented a comparative evaluation of immersive single-user and collaborative two-user assembly techniques and determined the impact of stereoscopic and monoscopic viewing on the user task performance. We explained the exact task description, the test procedure, and the evaluation results in detail. The results clearly showed an average speed-up factor of approximately 1.6 that collaborative interaction can provide as an additional benefit for immersive assembly modeling systems. Our hypothesis that stereoscopic viewing is important for users while working on a complex assembly task has been confirmed, not only by the measured
486
D. d’Angelo et al.
task completion times, but also according to the subjective user preferences collected by the questionnaire. Most users favored stereoscopic viewing to solve the modeling task, because it helped them conceiving the three-dimensional modeling space, which is especially important for assembly tasks. In the future, we plan to further improve the efficiency of the collaborative two-user modeling functionality of the VEAM system. A first promising approach would be the integration of techniques and mechanisms to split complex manipulation tasks into small subtasks, which can be accomplished by each of the involved users independently. This splitting could be based on the separation of the degrees of freedom of the manipulation, as proposed in [3], or on a specific sequence of the subtasks. Co-located collaboration can be further enhanced by supporting specialized views for each of the two users, and to assign specific tasks to these views. Collaborative techniques such as hand-over of selected parts to the other user would establish are tighter coupling of both users during interaction, which could be very useful. Locking of parts and collaborative, simultaneous manipulation can be beneficial as well. Currently, all participants of the user study used the same kind of stylus-type input device for the interaction with the VEAM system. We are planning to conduct another user study to evaluate the effects on the user task performance caused by different input devices like the NOYO [14].
Acknowledgments We would like to thank all subjects in the experiment for their time and effort. Part of this work was done within the Intelligent Networked Manufacturing System (INT-MANUS) project, which is partly funded by the European Commission under the joint priority IST-NMP (NMP2-CT-2005-016550).
References 1. McMahan, R.P., Gorton, D., Gresock, J., McConnell, W., Bowman, D.A.: Separating the effects of level of immersion and 3d interaction techniques. In: VRST 2006: Proceedings of the ACM symposium on Virtual reality software and technology, pp. 108–111. ACM, New York (2006) 2. Narayan, M., Waugh, L., Zhang, X., Bafna, P., Bowman, D.: Quantifying the benefits of immersion for collaboration in virtual environments. In: VRST 2005: Proceedings of the ACM symposium on Virtual reality software and technology, pp. 78–81. ACM, New York (2005) 3. Pinho, M.S., Bowman, D.A., Freitas, C.M.: Cooperative object manipulation in immersive virtual environments: framework and techniques. In: VRST 2002: Proceedings of the ACM symposium on Virtual reality software and technology, pp. 171–178. ACM, New York (2002)
Assembly Modeling in Virtual Environments
487
4. Agrawala, M., Beers, A.C., McDowall, I., Fr¨ ohlich, B., Bolas, M., Hanrahan, P.: The two-user responsive workbench: support for collaboration through individual views of a shared space. In: SIGGRAPH 1997: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 327–332. ACM Press/Addison-Wesley Publishing Co., New York (1997) 5. Cruz-Neira, C., Sandin, D.J., DeFanti, T.A.: Surround-screen projection-based virtual reality: the design and implementation of the cave. In: SIGGRAPH 1993: Proceedings of the 20th annual conference on Computer graphics and interactive techniques, pp. 135–142. ACM Press, New York (1993) 6. Fr¨ ohlich, B., Blach, R., Stefani, O., Hochstrate, J., Bues, M., Hoffmann, J., Kr¨ uger, K.: Implementing Multi-Viewer Stereo Displays. In: Proceedings of WSCG 2005, Plzen, Czech Republic (2005) 7. Riege, K., Holtk¨ amper, T., Wesche, G., Fr¨ ohlich, B.: The bent pick ray: An extended pointing technique for multi-user interaction. In: 3DUI 2006: Proceedings of the 3D User Interfaces, Washington, DC, USA, pp. 62–65. IEEE Computer Society Press, Los Alamitos (2006) 8. Swan II, J.E., Gabbard, J.L., Hix, D., Schulman, R.S., Kim, K.P.: A comparative study of user performance in a map-based virtual environment. In: VR 2003: Proceedings of the IEEE Virtual Reality 2003, Washington, DC, USA, p. 259. IEEE Computer Society Press, Los Alamitos (2003) 9. Figueroa, P., Bischof, W.F., Boulanger, P., Hoover, H.J.: Efficient comparison of platform alternatives in interactive virtual reality applications. Int. J. Hum.Comput. Stud., 73–103 (2005) 10. Ni, T., Bowman, D.A., Chen, J.: Increased display size and resolution improve task performance in information-rich virtual environments. In: GI 2006: Proceedings of Graphics Interface 2006, Toronto, Ont., Canada, Canadian Information Processing Society, pp. 139–146 (2006) 11. Foursa, M., Wesche, G., d’Angelo, D., Bogen, M., Herpers, R.: A two-user virtual environment for rapid assembly of product models within an integrated process chain. In: Proceedings of X Symposium on Virtual and Augmented Reality, Jo˜ ao Pessoa, Brazil, pp. 143–150 (2008) 12. Foursa, M., Schlegel, T., Meo, F., Praturlon, A.H., Ibarbia, J., Kop´ acsi, S., Mezg´ ar, I., Sall´e, D., Hasenbrink, F.: INT-MANUS: revolutionary controlling of production processes. In: SIGGRAPH 2006: ACM SIGGRAPH 2006 Research poster, p. 161. ACM, New York (2006) 13. Zhai, S.: User performance in relation to 3d input device design. In: SIGGRAPH 1998: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, vol. 4, pp. 50–54. ACM Press, New York (1998) 14. Simon, A., Doulis, M.: Noyo: 6dof elastic rate control for virtual environments. In: VRST 2004: Proceedings of the ACM symposium on Virtual reality software and technology, pp. 178–181. ACM, New York (2004) 15. Fr¨ ohlich, B., Hochstrate, J., Skuk, V., Huckauf, A.: The globefish and the globemouse: two new six degree of freedom input devices for graphics applications. In: CHI 2006: Proceedings of the SIGCHI conference on Human Factors in computing systems, pp. 191–199. ACM, New York (2006)
Simple Feedforward Control for Responsive Motion Capture-Driven Simulations Rubens F. Nunes1,2 , Creto A. Vidal1 , Joaquim B. Cavalcante–Neto1, and Victor B. Zordan2 1
CRAb – Federal University of Cear´ a, Brazil rgl – University of California, Riverside, USA {rubens,cvidal,joaquim}@lia.ufc.br,
[email protected] 2
Abstract. Combining physically based simulation and motion capture data for animation is becoming a popular alternative to large motion databases for rich character motion. In this paper, our focus is on adapting motion-captured sequences for character response to external perturbations. Our technique is similar to approaches presented in the literature, but we propose a novel, straightforward way of computing feedforward control. While alternatives such as inverse dynamics and feedback error learning (FEL) exist, they are more complicated and require offline processing in contrast to our method which uses an auxiliary dynamic simulation to compute feedforward torques. Our method is simple, general, efficient, and can be performed at runtime. These claims are demonstrated through various experimental results of simulated impacts.
1
Introduction
Demand for realistic animation of characters is increasing in game and entertainment industries, as well as in various other areas, such as biomechanics, robotics and rehabilitation; and important advances have occurred in recent decades [1]. Motion capture is a popular tool for generating realistic animation. However, adapting the data easily to unpredicted interactions in the environment without introducing artifacts is still a challenge. In contrast, dynamic simulation produces interactive responses to perturbations caused by the environment directly by applying collision forces to the model. Nevertheless, the difficulty of designing controllers for complex models, such as humanoids, which are capable of producing motions as realistic as the ones obtained by motion capture, limits its applicability. A growing body of work proposes methods for combining these two techniques [2,3,4,5,6,7] (among others.) In 2002, Zordan and Hodgins [2] introduced feedback controllers for reactive characters capable of tracking captured motions. In their technique, proportional derivative (PD) feedback controllers calculate torques for tracking data while also allowing the simulations to react to external stimulus. One issue with this technique is that stiff feedback controllers are required for good tracking. According to [3], biological systems use feedforward as the predominant control input and only use low-gain feedback to correct small G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 488–497, 2008. c Springer-Verlag Berlin Heidelberg 2008
Simple Feedforward Control
489
deviations from the desired trajectory. They show that low-gain feedback control is capable of tracking the captured motions faithfully by adding a feedforward term to the control equation. Thus, during impacts, reactions are simulated with low stiffness without the need to update the gains (as done in [2]). In this paper, our focus is on adapting motion-captured sequences for character response to external perturbations. Our solution is based on the framework proposed by [3] but we replace the method of computing the feedforward term with a simple and efficient method, which can be performed at runtime. Yin et al. [3] calculate their feedforward term by using inverse dynamics in a preprocessing stage. Our method calculates the feedforward term through use of high-gain feedback controllers applied to an ‘auxiliary’ simulation executed in parallel with the main simulation of the character. Unexpected impacts are considered only in the main simulation, which uses low-gain feedback controllers. Our method is effective for handling of small external disturbances that do not lead to changes in balance and does not depend on the balance strategy employed. In addition, we introduce several novel contributions throughout our investigation, e.g. an automatic method for scaling control gains and adding a purposeful delay to make more natural reactions as well as the introduction of a forward internal model (as described in [8]) to aid in correcting disturbances from expected impacts.
2
Related Work
Several researchers have proposed methods for combining dynamics and motion capture. Shapiro et al. [4] created a framework based on [9] able to manage both dynamic and kinematic controllers. Zordan et al. [5] used dynamic simulation to produce transitions between captured motions in order to simulate reactions to impacts. Recent papers have built controllers for tracking captured motions while maintaining balance [6,10,7]. As in [2,3], discussed in the introduction, some other works also proposed strategies to handle the specific problem addressed in this article. Pollard and Zordan [11] use an approach similar to that proposed by Yin et al. to control hand grasps. A feedforward term is pre-computed by inverse dynamics to offset the influence of the arm movement and gravity. In 2007, Yin et al. [6] proposed to calculate the feedforward term by using a variation of Feedback Error Learning (FEL) which is specific to cyclic movements such as walking, but did not address more general movements such as the non-cyclic ones we use in our example. In contrast to our approach, general FEL requires multiple passes to accumulate a stable feedback error value and it is therefore less conducive to online motion generation tasks. Yin et al. get around this by exploiting the repetition of cyclic motions. Da Silva et al. [7] calculate a feedforward term by formulating and solving a Quadratic Programming (QP) problem. Because of the complexity of the optimization, they update the feedforward term one to two orders of magnitude more slowly than the simulation dynamics. In contrast, our feedforward term is computed in lockstep with the simulation at interactive rates.
490
R.F. Nunes et al.
Some related works offer alternatives to selecting gains for torque calculations. In order to track the captured motions, Wrotek et al. [12] propose applying external feedback torques directly at the bodies, based on world-space error, instead of joint-space error. They report that with this technique the gains could be defined in an easier and more flexible way. However, the approach requires the gains to be updated using the same strategy presented in [2] in order to address reactions to impacts. Allen et al. [13] presented an automatic method to calculate the gains, by considering time information (synchronization) provided by the user. However, when an impact occurs, their approach to guarantee timing yields a character that changes its stiffness indirectly, based on the size of the disturbance. This effect is not readily observed in biological systems and thus we opt to maintain stiffness in lieu of guaranteeing the length of the resulting response during impact. Kry and Pai [14] presented a technique in which contact forces also are captured along with the movement. This information is used to estimate the tension on the joints. In contrast, our method does not require access to such information.
3
Approach Overview
The proposed method in this paper estimates the feedforward control present in biological systems through an auxiliary dynamic simulation performed simultaneously with the main simulation. The auxiliary simulation uses high-gain feedback controllers in order to produce torques that make the virtual human track the captured motions faithfully. This process corresponds to the pre-processing done by biological systems when performing well-trained motions. In a sense, the auxiliary simulation acts as an internal model (see [8]) which perfectly matches the dynamics of the character. The feedforward terms used in the main simulation are exactly those torques produced in the auxiliary simulation. Thus, the torques applied to the simulated character during the main simulation consist of the sum of the feedforward terms with the torques produced by the low-gain feedback controllers used in the main simulation. The general scheme shown in Figure 1 illustrates the steps of the proposed method: 1) Obtaining the model’s desired state, qd (t), from the captured motion; 2) Calculation, using high-gain feedback controllers, of the torques, τaux , used in the auxiliary simulation to correct for the error between the current state, qaux (t), and the desired state, qd (t); 3) Calculation, using low-gain feedback controllers, of the torques, τ , used in the main simulation to correct for the error between the current state, q(t), and the desired state, qd (t); 4) Application of the torques, τaux , to the model in the auxiliary simulation and sending τaux (feedforward torques, f f ) to the main simulation; 5) Application of the sum τ + f f to the model in the main simulation; 6) Integration in time of the two simulations, considering the external perturbations, and updating the current states of the model. Note that the high-gain feedback controllers only have access to the current state of the auxiliary simulation. Therefore, the resulting torques calculated by
Simple Feedforward Control
491
Fig. 1. Overview of the proposed method
these controllers are feedback torques for the auxiliary simulation, but feedforward torques for the main simulation, since these controllers have no access to the state of the main simulation.
4
Feedforward and Feedback Torque Calculations
Non-linear PD controllers are used both in the auxiliary and in the main simulation to correct for the error between the current state and the desired state of the model, through the following expression:
τ = I ks f (θe ) (θd − θ) + kd θ˙d − θ˙
(1)
where θ and θd are the current and desired joint angles; θ˙ and θ˙d are the current and desired joint velocities; ks and kd are the proportional (spring) and derivative (damper) gains; I is the inertial matrix of the outboard bodies (that is, the body chain affected by the joint) for each joint; θe = (θd − θ); and f (θe ) is a non-linear factor defined as a function of θe . To limit the strength of the character, τ is capped during the dynamic simulation. The values of θd used in both simulations are obtained directly from the captured motions. The values of θ˙d can be estimated by finite differences also from the captured data. However, in the auxiliary simulation, due to high-gain feedback, these values can be reduced to zero without reducing the quality with which the auxiliary model tracks the captured motions. On the other hand, in
492
R.F. Nunes et al.
the main simulation, we found that, due to low-gain feedback, the information about desired velocity is important. In our implementation, desired velocity information for the main simulation is obtained conveniently and directly from the auxiliary simulation, since the auxiliary model tracks the captured data faithfully. Therefore, the values of θ˙d from the auxiliary simulation are less sensitive to noise than finite difference estimates obtained with the captured data. Selecting and scaling gain values. Using the inertia scaling, introduced by Zordan and Hodgins [2], allows only a pair of gains (ks and kd ) to be specified for the whole body, eliminating the laborious process of specifying gain values for each joint. Allen et al. [13] describe the details of the calculation of this factor, which must be updated at each simulation step. With this scale, one pair of gains (ks and kd ) has to be defined by the animator for each simulation (auxiliary and main). As in [13], we note the influence of the ratio kd : ks on the simulated motions: 1) if kd : ks is low, the model tends to oscillate in the neighborhood of the desired configuration; and 2) if kd : ks is high, the model tends to be overdamped and move slowly, as if in water. Considering these observations, it is ideal that the ratio kd : ks be high in the neighborhood of the desired configuration, to prevent oscillations; and low, when the model is far from the desired configuration, to allow it to return to tight tracking in a timely fashion. To accomplish these two goals simultaneously, we add the scalar f (θe ) to the error calculation. Fattal and Lischinski [15] use non-linear PD controllers in which the ratio is updated by modifying kd . We update the ratio kd : ks by scaling the term ks instead of kd which intuitively has the desired effect of increasing the stiffness as the error increases, like when an impact occurs. To scale ks , we employ the term f (θe ). In addition, we use this term to introduce a purposeful delay in the reaction. After receiving an impact, biological systems have a small delay in reacting due to the neural flow of information and delay of the synapses [3]. The delay is included in the definition of f (θe ),
⎧ ⎪ ⎨
1, |θe | f (θe ) = , ⎪ ⎩ λ M,
if |θe | ≤ λ if |θe | ≤ Mλ
(2)
otherwise
where λ corresponds to the delay in the reaction and M is the maximum allowable value of the factor. θe is measured in radians, λ must be positive and M must be greater than 1. Analyzing the function, we have: f (θe ) ≥ 1; f (θe ) is linear, for λ ≤ |θe | ≤ M λ; |θe |/λ = 1, when |θe | = λ; and |θe |/λ = M , when |θe | = M λ. In all tests, λ = 0.1 and M = 5.
5
Forward Internal Model for Expected Reactions
Based on concepts introduced by Kawato in [8], we propose an additional use of the auxiliary simulation. Kawato discussed the existence of two forms of internal model, an inverse internal model which computes motor commands (in our case torques) and a forward internal model which predicts the effect of those motor
Simple Feedforward Control
493
commands based on an approximation of the dynamic state. We can draw interesting parallels between our auxiliary simulation and Kawato’s internal models, but one insight that comes from this breakdown is that if the character knows about the dynamics of an impact, an internal forward model could be used for predicting and correcting for errors resulting from the disturbance. Following this perspective, we introduce the notion of an expected impact which is “anticipated” by applying the collision to the auxiliary model. For an expected reaction, as when a person braces for a known impending impact, our auxiliary simulation accounts for the disturbance by simulating the impact and incorporating its correction into the feedforward terms, τaux . The effect in the main simulation is an automatic, anticipatory response of specifically the joints neighboring the impact (which see an increase in error in the auxiliary simulation.) Of course there are other forms of anticipation and trained reactions (for example, see [16]) but this simple adjustment allows us to automatically generate an important aspect of anticipation to expected interactions. Without such a strategy, the problem of tuning the character (as done in [2]) poses a number of difficult issues: which joints should have their gains increased?; how much should the increase of each gain be?; and at which instant should the gain values return to the original low values? Consider, in practice, there could be situations where both unexpected and expected impacts occur simultaneously, and, it would be laborious to derive answers to these questions in order to handle all cases properly for those types of situations. In the method proposed here, the impacts expected by the model result in reactions that are automatically, intentionally more rigid in the region of the disturbance, without the need of modifying the gains.
6
Implementation
Our system uses the Open Dynamics Engine (ODE) [17] to perform the dynamic simulation and collision detection. Our humanoid model consists of 14 rigid bodies connected with ball joints, totaling 39 internal degrees of freedom plus 6 global degrees of freedom at the model’s root (the pelvis). The animation obtained from the system achieves interactive rates using OpenGL and hardware rendering. On a 1.8 GHz PC with 512 MB of RAM and an NVIDIA video card FX 5200, the average frame rate is 10 fps, when rendering one frame at each thirtieth of a second of the simulation. The character’s perceived compliance and motion capture tracking quality both depend on the choice of gain values (ks and kd ) assigned to the two simulations. Although manually defined, the choice of the small number of gain values (four total) is not difficult and need only be done once. The values are fixed in all test results in this paper. We found useful gain ratios relating the feedback and feedforward torque controllers as ksmain = 0.05 ∗ ksaux and kdmain = kdaux . Note, the gain values in the auxiliary simulation can be defined first since the gains in the main simulation do not affect that choice.
494
R.F. Nunes et al.
To maintain the model balance in the performed tests, external forces and torques are applied to the model’s root (pelvis) and feet (when they contact the floor) to track the captured motions. These forces and torques are generated by high-gain feedback controllers, which are used in both simulations. Although this strategy is not physically correct, the proposed method does not depend on the balance control approach that is used and therefore allows for other strategies as well. Yin et al. [6] suggest a promising balance strategy for locomotion. However, maintaining balance during highly dynamic (athletic) motions like those presented in this paper, remains an open problem.
7
Results
In order to demonstrate the effectiveness of the proposed method, we perform several experiments. For our examples, we use fighting motions, such as punches and kicks motions. As the character tracks these captured motions, we create disturbances by throwing balls that collide with the character’s body parts causing physically simulated impacts – the impacts yield corresponding external forces on the model. This type of interaction is very common in games. The accompanying videos allow a visual assessment of the movements and the method’s quality. In our first test, several objects are collided with the model while it tracks the captured motions. The model reacts to the external disturbances and returns to track the captured motion. Figure 2 shows an example where the model reacts to an impact on the arm and then another on the head. The auxiliary model does not receive the impacts and can be used as a reference for comparison. The original captured motion is shown in the accompanying videos. Our next result compares the capability of the low-gain feedback controllers to track the captured motions: without using the feedforward term (Figure 3a); and using the proposed feedforward term (Figure 3b), obtained from the auxiliary simulation, which uses high-gain feedback controllers. The feedforward term computed using the proposed method is effective and allows low-gain feedback controllers to track the captured motions faithfully, with quality similar to the tracking done by the high-gain feedback controllers used in the auxiliary simulation. Finally, we show that the proposed method can handle expected external disturbances easily. Figure 4a illustrates a situation where the model receives an unexpected impact on the back of the head and, simultaneously, an expected impact on the leg. The expected impact on the leg is also considered on the auxiliary model, as illustrated in the figure. In Figure 4b, the same two impacts are both considered unexpected for purposes of comparison. In the first case, the reaction on the leg occurs in a rigid way as desired, while in the second case the knee acts compliant. The proposed method has presented effective and visually pleasing reactions to impacts in all parts of the model, even when more than one impact occur simultaneously or several impacts occur in sequence. Moreover, the new method handled expected external disturbances, in a simple and effective manner.
Simple Feedforward Control
495
Fig. 2. The model is unexpectedly hit on the arm and after on the head, reacting to the impacts and returning to tracking the captured motion in a flexible and natural way
Fig. 3. (a) Low-gain feedback control without the feedforward term; (b) low-gain feedback control with the feedforward term; (c) original captured motion
Fig. 4. The model receives, simultaneously, an unexpected impact behind the head and an (a – expected; b – unexpected) impact on the leg
496
8
R.F. Nunes et al.
Discussion and Conclusions
This paper addresses the problem of simulating realistic reactions to external disturbances in captured motions, in an easy and automatic fashion. The proposed method employs an auxiliary simulation which uses high-gain feedback control to determine feedforward torque for the character (main) simulation. The controller allows the feedforward term to be obtained at runtime in a general way. The proposed method handles natural reactions to both unexpected and expected external disturbances. Advantages of the proposed technique. Our method presents some absolute advantages and some advantages shared with certain previous methods. Advantages of our method can be listed as follows: – Simple computation of the feedforward term. The computation done by the proposed method is exactly the same as the computation of the feedback torques but in an auxiliary model, thus very minimal implementation is required beyond feedback control alone. – Tracking general trajectories. As opposed to inverse dynamics, feedback controllers produce appropriate torques for general trajectories, including trajectories that do not respect the laws of physics – such as transitions between motions [18,9,2,6] and keyframe inputs (e.g. pose control) [19,6,13]. Using inverse dynamics on these trajectories can lead to unnatural large or undefined torque values. – Online feedforward control. The proposed method is carried out at runtime. This feature allows the system to handle situations in which the trajectories cannot be predicted, such as trajectories synthesized by higher level controllers based on sensors or user input [9]. These trajectories cannot be pre-processed in order to obtain feedforward torques using offline techniques. – Reactions to expected impacts. The proposed method handles expected reactions in a unified manner that is consistent with findings in motor control. By considering expected impacts in the auxiliary simulation the feedforward term automatically anticipates with a behavior visually similar to bracing for an impact. There are several exciting areas of future work for this project. It would be interesting to test the proposed method on synthesized motion, for example, derived from simple transitions between captured motions or generated by traversing motion graphs [20]. Also, it would be interesting to define a general criterion for determining whether or not an impact should be expected (noticed in advance) by the model. Similarly, another direction would be to allow the model to have deliberate anticipatory reactions as in [16], to prevent impacts to vulnerable parts of the body.
Simple Feedforward Control
497
References 1. Magnenat-Thalmann, N., Thalmann, D.: Virtual humans: thirty years of research, what next? The Visual Computer 21, 997–1015 (2005) 2. Zordan, V.B., Hodgins, J.K.: Motion capture-driven simulations that hit and react. In: Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, San Antonio, Texas, USA, pp. 89–96 (2002) 3. Yin, K., Cline, M.B., Pai, D.K.: Motion perturbation based on simple neuromotor control models. In: Proceedings of IEEE Pacific Conference on Computer Graphics and Applications, Canmore, Alberta, CAN, pp. 445–449 (2003) 4. Shapiro, A., Pighin, F., Faloutsos, P.: Hybrid control for interactive character animation. In: Proceedings of IEEE Pacific Conference on Computer Graphics and Applications, Canmore, Alberta, CAN, pp. 455–461 (2003) 5. Zordan, V.B., Majkowska, A., Chiu, B., Fast, M.: Dynamic response for motion capture animation. ACM Transactions on Graphics 24, 697–701 (2005) 6. Yin, K., Loken, K., van de Panne, M.: Simbicon: Simple biped locomotion control. ACM Transactions on Graphics 26 article 105 (2007) 7. Da Silva, M., Abe, Y., Popovic, J.: Simulation of human motion data using shorthorizon model-predictive control. Computer Graphics Forum 27 (2008) 8. Kawato, M.: Internal models for motor control and trajectory planning. Current Opinion in Neurobiology 9, 718–727 (1999) 9. Faloutsos, P., van de Panne, M., Terzopoulos, D.: Composable controllers for physics-based character animation. In: Proceedings of ACM SIGGRAPH, Los Angeles, CA, USA, pp. 251–260 (2001) 10. Sok, K.W., Kim, M., Lee, J.: Simulating biped behaviors from human motion data. ACM Transactions on Graphics 26 article 107 (2007) 11. Pollard, N.S., Zordan, V.B.: Physically based grasping control from example. In: Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, New York, NY, USA, pp. 311–318 (2005) 12. Wrotek, P., Jenkins, O.C., McGuire, M.: Dynamo: Dynamic data-driven character control with adjustable balance. In: Proceedings of ACM SIGGRAPH Video Games Symposium, Boston, USA, pp. 61–70 (2006) 13. Allen, B., Chu, D., Shapiro, A., Faloutsos, P.: On the beat! timing and tension for dynamic characters. In: Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, San Diego, CA, USA, pp. 239–247 (2007) 14. Kry, P.G., Pai, D.K.: Interaction capture and synthesis. ACM Transactions on Graphics 25, 872–880 (2006) 15. Fattal, R., Lischinski, D.: Pose controlled physically based motion. Computer Graphics Forum 25, 777–787 (2006) 16. Zordan, V.B., Macchietto, A., Medina, J., Soriano, M., Wu, C., Metoyer, R., Rose, R.: Anticipation from example. In: Proceedings of ACM Virtual Reality Software and Technology, Newport Beach, CA, USA, pp. 81–84 (2007) 17. ODE: Open dynamics engine (2008), http://www.ode.org/ 18. Wooten, W.L., Hodgins, J.K.: Simulation of leaping, tumbling, landing, and balancing humans. In: Proceedings of IEEE International Conference on Robotics and Automation, San Francisco, USA, pp. 656–662 (2000) 19. Van de Panne, M.: Parameterized gait synthesis. IEEE Computer Graphics and Applications 16, 40–49 (1996) 20. Kovar, L., Gleicher, M., Pighin, F.: Motion graphs. ACM Transactions on Graphics 21, 473–482 (2002)
Markerless Vision-Based Tracking of Partially Known 3D Scenes for Outdoor Augmented Reality Applications Fakhreddine Ababsa, Jean-Yves Didier, Imane Zendjebil, and Malik Mallem IBISC Laboratory – CNRS FRE 3190. University of Evry Val d’Essonne 40, rue du Pelvoux 91020 Evry, France {ababsa,didier,zendjebil,mallem}@iup.univ-evry.fr
Abstract. This paper presents a new robust and reliable marker less camera tracking system for outdoor augmented reality using only a mobile handheld camera. The proposed method is particularly efficient for partially known 3D scenes where only an incomplete 3D model of the outdoor environment is available. Indeed, the system combines an edge-based tracker with a sparse 3D reconstruction of the real-world environment to continually perform the camera tracking even if the model-based tracker fails. Experiments on real data were carried out and demonstrate the robustness of our approach to occlusions and scene changes.
1 Introduction Augmented Reality Systems (ARS) attempt to enhance humans’ perception of their indoors and outdoors working and living environments by complementing their senses with virtual input. Tracking computation refers to the problem of estimating the position and orientation of the ARS user's viewpoint, assuming the user to carry a wearable camera. Tracking computation is crucial in order to display the composed images properly and maintain correct registration of real and virtual worlds. Generally, outdoor augmented reality systems use GPS and inertial sensors to measure, respectively, the position and the orientation of the camera’s view point. However, in reality several events can degrade the quality of the GPS signal like: shadowing and occlusions from buildings, self-occlusions (hand, body) and multiple signal reflections. In the same way, inertial sensors can drift and are disturbed by local magnetic fields being in the working area. A vision-based localization provide an interesting alternative because of its accuracy and robustness against these environmental influences. Several vision-based marker less tracking approaches have been developed last years. Among them, the model-based tracking methods are most interesting for ARS and give the best results [1],[2],[3]. The mean idea of the model-based techniques is to identify, in the images, features from the object model (points, lines, etc.). One can note two main types of approaches for model-based tracking algorithms, namely the edge-based and the textured-based trackers. Both have complementary advantages and drawbacks. An interesting idea is then to combine both approaches in the same process [4],[5]. For example, in Vachetti et al. [6] the proposed method combines edge feature matching and the use of keyframes to handle any kind of camera G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 498–507, 2008. © Springer-Verlag Berlin Heidelberg 2008
Markerless Vision-Based Tracking of Partially Known 3D Scenes
499
displacement. The model information is used to track every aspect of the target object. A set of reference keyframes is created off-line and, if there are too few of them, new frames can be automatically added online. In the same way, the framework presented in Pressigout et al. [7] fuses a classical model-based approach based on edge extraction and a temporal matching relying on texture analysis into a single nonlinear objective function that has then to be minimized. Tracking is formulated in terms of a full scale non-linear optimization. All theses approaches require that the natural points to be tracked belong on the textured area of the 3D model. However, in outdoor applications, the user needs to move freely around a large-scale environment for which it is impossible to have a complete 3D model, so classical hybrid approaches don’t work in this case. Moreover, large variations in lighting, scaling and occlusions often occur in such environments. So, developing a marker less tracking system robust against these problems is a very challenging task. In this paper, we propose an original solution combining two complementary tracking methods: An edge-based tracker with a dynamic sparse 3D key points tracking. In order to improve the accuracy and the robustness of the camera pose estimation we have integrated a M-estimator into the optimization process. The resulting pose computation algorithm is thus able to deal efficiently with incorrectly tracked features. The sparse key points are constructed dynamically during the motion and tracked in the current frame in order to maintain an estimate of the camera pose even if the 3D model of the scene becomes non visible in the image. We propose to use the SIFT algorithm [8] in order to extract a stable set of natural key points which are invariant to image scaling and rotation, and partially invariant to changes in illumination and view point. The remainder of this paper is organized as follows. In section 2, we give the details of our real-time 3D lines and 3D sparse key points trackers. Section 3 presents the robust formulation of the camera pose estimation problem when using point and line features. System performances and results on real data are discussed in section 4, followed by final conclusion in section 5.
2 Description of Our Tracking Framework Our marker less tracking framework is composed of two natural features trackers: a 3D line tracking system and a sparse 3D key points tracking one. 2.1 3D Line Tracker Our line tracking approach is based on the moving edges algorithm (ME) [9]. The ME method is very interesting for our application because edge extraction is not required. Only point coordinates and image intensities are manipulated. This leads to real-time computation. As described in figure (1) the line model is projected with the initial camera pose into the current image plane. Let Pi be the sampled points along the 3D line model. Pi are projected into the 2D image points pi. Then, for each point pi, a local search is performed on a search line in the direction of the normal to the reprojected contour. We attribute to pi one correspondent qi which is the local extremum of the gradient along the search line.
500
F. Ababsa et al.
Sample point Edge correspondent
Projected Search li Fig. 1. Line tracking approach
In our approach we only need to find two image points (q1 and q2) belonging to the object line. This property make our line tracker algorithm robust to severe occlusion. So, when partial occlusion occurs, it is sufficient to detect small parts of the model edges to estimate the camera pose. 2.2 3D Sparse Key Points Tracker The goal of this method is to incrementally construct a sparse set of 3D points of the viewed scene. These points are then tracked in the current frame and used in camera pose estimation process. The algorithm begins by determining the camera poses of the first two images. Local point features are extracted and matched from these two calibrated frames and 3D key points are constructed by triangulation using the computed camera parameters. These 3D points are used to initialize the sparse key points tracker. In addition, as the system operates in incremental way, the new matched points at the current frame are automatically triangulated and their corresponding 3D points are added for the tracking. The effectiveness of this method depends on the robustness of feature points extraction and matching algorithms. So, we propose to use stable natural features generated by the SIFT algorithm [8]. The strong ness of
Fig. 2. SIFT key points extracted from the Saumur Castle image
Markerless Vision-Based Tracking of Partially Known 3D Scenes
501
these features is their invariance to image scaling and rotation, and their partiallyinvariance to changes in illumination and view point which make them suitable for outdoor environments. Moreover, SIFT features are present in abundance (see figure 2) over a large range of image scales, this allows an efficient features tracking in the presence of occlusions. The SIFT algorithm assigns to the extracted key points an image location, a scale, an orientation (in the range of [-π,+π]) and a feature descriptor vector. Feature matching is performed using the Euclidean distance criterion. Thus, two features extracted from two successive images are matched if the Euclidean distance between their descriptors vectors is minimal. At this point, we establish several 2D-3D points correspondences which are used to estimate the camera pose at the current frame.
3 Robust Camera Pose Problem Formulation Throughout this paper, we assume a calibrated camera and a perspective projection model. If a point has coordinates (x, y, z )t in the coordinate frame of the camera, its
projection onto the image plane is (x z , y z ,1)t . In this section, we present the constraints for camera pose determination when using point and line features. 3.1 Point Constraint
Let p i = ( x i , y i , z i )t , i = 1, K , n, n ≥ 3 a set of 3D non-collinear reference points defined in the world reference frame, the corresponding camera-space coordinates q i = x i, , y i, , z i, are given by:
(
)
q i = Rp i + T
(
where R = r1t , r2t , r3t
)
t
(
and T = t x , t y , t z
)t
(1)
are a rotation matrix and a translation
vector, respectively. R and T describe the rigid body transformation from the world coordinate system to the camera coordinate system and are precisely the parameters associated with the camera pose problem.
Let the image point g i = (u i , vi ,1)t be the projection of p i on the normalized image plane. Using the camera pinhole model, the relationship between g i and p i is given by: gi =
1 r3t p i
+ tz
( Rp i + T )
(2)
which is known as the co linearity equation. The point constraint corresponds to the image space error, it gives a relationship between 3D reference points, their corresponding 2D extracted image points and the camera pose parameters as follows:
502
F. Ababsa et al.
⎛ r2t p i + t y ⎞⎟ rtp + t ⎞ ⎛ = ⎜ uˆ i − 1t i x ⎟ + ⎜ vˆ i − t ⎜ r3 p i + t z ⎟⎠ ⎜⎝ r3 p i + t z ⎟⎠ ⎝ 2
E ip
2
(3)
ˆ i = (uˆ i , vˆ i ,1)t are the observed image points. where m 3.2 Line Constraint Given correspondences between 3D lines and 2D lines found in the image, the goal is to find the rotation and the translation matrices which map the world coordinate system to the camera coordinate system. Let L be an object line. Several representation for a 3D line have been proposed. In our approach, we represent the 3D line L by its two end-points p 1 and p 2 (see figure 3). The point p i in world coordinates can be expressed in the camera frame by equation (1). P1 Yw
N
World frame
Ow
L1
m1
P2
l1 m2
Yc
n2
Camera frame
Zc
Xw
Zw
n1 Oc
Image plane
Xc
Fig. 3. Perspective projection of 3D line
Let (x1 , y1 , z1 ) and (x 2 , y 2 , z 2 ) be the camera coordinates of the end-points p 1 and p 2 which project onto the image plane at m1 and m 2 respectively. The projec-
tion plane formed by the image line (m1m 2 ) is given by the plane (Om 1m 2 ) and the r 3D line L1 must lie in this plane. The normal N to the projection plane is given by: r r r N = n1 × n2 (4)
r
r
Where n1 and n2 are the optical rays of the image points line constraint can be formulated as: r E il = N ⋅ (Rp i + T )
m 1 and m 2 . Thus the 3D (5)
Markerless Vision-Based Tracking of Partially Known 3D Scenes
503
The 3D line constraint represents the fact that any point on the 3D line in camera coordinates ideally must lie in the projection plane. This constraint relate both rotation and translation parameters pose to the 3D model and 2D image lines. 3.3 Robust Camera Pose Estimation Since the number of camera pose parameters is 6 (3 for rotation and 3 for translation) the pose problem can be solved if at least 3 features (points or lines) correspondences are available. In the general case, if N and M are the numbers of points and lines correspondences, respectively, the camera pose problem corresponds to the problem of minimizing the following error function :
∑ (E ) + ∑ (E ) N
f 1 ( R, T ) =
p 2 i
i =1
M
l 2 i
(6)
i =1
Where E ip and E il correspond to the point and line constraints defined in equation (3) and (5) respectively. Traditionally, this minimization is performed using a least square approach [10],[11],[12]. However, when outliers are present in the measures, a robust estimation is required. In the pose problem, outliers occur either due to incorrect 2D-3D correspondences or if parts of the 3D model is incorrect. Historically, several approaches to robust estimation were proposed, including R-estimators and Lestimators. However, M-estimators now appear to dominate the field as a result of their generality, high breakdown point, and their efficiency [13]. M-estimators are a generalization of maximum likelihood estimators, they attempt to define objective functions whose global minimum would not be significantly affected by outliers. So, Instead of minimizing ri2 where ri are residual errors, M-estimators minimize its
∑ i
robust version
∑ ρ (r ) . The function ρ(u) is designed to be a continues, symmetric i
i
function with minimum value at u=0. Also ρ(u) must be monotonically non decreasing with increasing u . Many such functions have been proposed. Huber has developed the Iterative Reweighted Least Square (IRLS) method to minimize this robust error. It converts the M-estimation problem into an equivalent weighted least-squares. We have incorporated the IRLS algorithm in our minimization procedure for the camera pose parameters. This is achieved by multiplying the error functions in equation (6) by weights. The objective function to be minimized is then defined by: N
f 2 ( R, T ) =
∑ i =1
( ) + ∑ w (E )
wip ⋅ E ip
2
M
l i
l 2 i
(7)
i =1
Where the weights wip and wil reflect the confidence in each feature and their computation is described in [14].
504
F. Ababsa et al.
Rotations can be represented by several different mathematical entities (matrices, axe and angle, Euler angles, quaternion). However, quaternion have proven very useful in representing rotations due to several advantages above the other representations: more compact, less susceptible to round-off errors, avoid discontinuous jumps. A quaternion representation of rotation R is written as a normalized four dimensional vector q = q 0 q x q y q z where q 02 + q x2 + q 2y + q z2 = 1 . Thus the vector β to
[
(
]
)
estimate is constituted of six parameters, the three components of the translation vector and the three first components of the unit quaternion q:
β = (q x
qy
qz
tx
ty
tz
)
(8)
We have used the Levenberg-Marquardt (LM) algorithm to resolve this optimization problem. The parameters updating is given by:
β i +1 = β i + ( J ′ ⋅ D ⋅ J + λ ⋅ I )−1 ⋅ J ′ ⋅ D ⋅ E i
(9)
where D is a diagonal weighting matrix given by D = diag ( w1 , w2 , K , wn ) . The weights wi are defined by the IRLS algorithm. J is the Jacobian matrix of the objective function f2(R,T) and Ei the feature errors.
4 Results To evaluate the performances of our tracking system, we considered a complex real scene (figure 4-a) representing the castle of Saumur. We recorded several image sequences from a moving camera pointing towards the castle (image size is 480×640
No model available
Modeled part (a)
(b)
Fig. 4. Outdoor scene and 3D models used for evaluation
(c)
Markerless Vision-Based Tracking of Partially Known 3D Scenes
505
pixels). The whole 3D model of the castle is not available, we have only an incomplete 3D model of the northern tower of the castle (figure 4-b). Figure 4-c shows the wire frame model of this tower extracted from its 3D CAD model. In order to initialize the 3D sparse points tracker we used the poses of the two first images to construct the initial set of the 3D key points to be tracked. Figure (5-a) shows the SIFT features matched in theses initial images. 100 % of inliers have been obtained. In figure (5-b) the corresponding 3D points computed by triangulation are illustrated. These 3D points are then combined with the 3D model edges and used by the robust estimator to compute the camera pose.
95
90
85
80
75 5155
950 945
5150 940
5145 5140
(a)
935
(b)
Fig. 5. 3D key points tracker initialization
The 3D sparse points are dynamically re computed. Thus, for each new frame, SIFT features are detected and matched with points detected in previous frame. Obtained inliers are then triangulated in order to generate new 3D sparse points. In our approach, we don’t need neither to memorize constructed 3D sparse points nor to create off-line a set reference key points. The first experiment points out the effectiveness of our hybrid features tracker to estimate the camera pose in several conditions. Figure 6 shows the castle tracked from different distances and view points. In figures (6-a) and (6-b) both 3d spares points and visible 3d edges contribute in camera pose estimation. We can see that the 3D model is correctly reprojected in the images which demonstrates the tracking success. We have also evaluated the image registration error (in pixels) and have noted that the hybrid approach is more accurate than points or lines tracking approaches. In figure (6-c) and (6-d) the whole 3D model is not visible but it is still possible to estimate the camera pose using only 3D sparse points tracking. This demonstrate the robustness of our approach to the non-visibility of the 3D model in the scene and to the partial occlusion of the model. We also evaluated the accuracy of our algorithm in presence of points and lines outliers. Due to the robustness towards outliers of the IRLS method, the estimated camera pose still correct even if there is certain quantity of outliers. For example, in presence of 4 outliers (2 points and 2 lines) we obtained a mean error about 3.11 pixels with a standard deviation of 0.64 pixels.
506
F. Ababsa et al.
Fig. 6. Results of our hybrid tracker
Otherwise, Real-time performance of our tracking approach has been achieved by carefully evaluating the processing time to compute the camera pose. We have implemented our algorithm on an Intel Pentium IV 2.99 GHz PC. The computational time more depends on the number of tracked features (points or lines). An example of current computation times is given in Table 1. Table 1. Computation times of our marker less tracker
SIFT Feature detection and description Feature matching (< 40) Camera pose computation Frame per seconds
85 ms 20 ms 15 ms 8
This table shows that he run time depends largely on the SIFT feature extraction and matching. So, for real-time (20-25 frames per seconds), each one of the three steps (detection, description, matching) should be faster still. In order to speed up the matching step, we fixed the distance ratio at 0.4. This allows to only keep matches in which the ratio of vector angles from the nearest to second nearest neighbour is less than 0.4.
5 Conclusion In this paper we presented an efficient marker less method for camera pose tracking in outdoor environments. The system combines 3D edge and 3D sparse keypoints. A
Markerless Vision-Based Tracking of Partially Known 3D Scenes
507
robust M-estimator is introduced directly in the optimization process by weighting the confidence on each feature. The resulting pose computation algorithm is thus able to deal efficiently with incorrectly tracked features that usually contribute to a compound effect which degrades the system until failure. Experimental results are very encouraging in term of accuracy. They also prove the efficiency of our approach to handle real scenes with model occlusion, illumination and view point changes. Future research efforts will include the keypoints tracker optimization in order to improve real-time performance.
Acknowledgements This research is supported by the ANR grant TL06 (Raxenv project: http://raxenv. brgm.fr/).
References 1. Wuest, H., Vial, F., Stricker, D.: Adaptive Line Tracking with Multiple Hypotheses for Augmented Reality. In: Proceedings of ISMAR 2005, Vienna, Austria, pp. 62–69 (2005) 2. Drummond, T., Cipolla, R.: Real-Time Visual Tracking of Complex Structure. IEEE Trans. Pattern Analysis and Machine Intelligence 24(7), 932–946 (2002) 3. Yoon, Y., Kosaka, A., Park, J.B., Kak, A.C.: A New Approach to the Use of Edge Extremities for Model-based Object Tracking. In: Proceedings of the 2005 IEEE International Conference on Robotics and Automation (ICRA 2005), Barcelonna, Spain, pp. 1883–1889 (2005) 4. Vacchetti, L., Lepetit, V., Fua, F.: Combining Edge and Texture Information for RealTime Accurate 3D Camera Tracking. In: Proceedings of ACM/IEEE Int. Symp. on Mixed and Augmented Reality (ISMAR 2004), Arlington, VA, pp. 48–57 (2004) 5. Pressigout, M., Marchand, E.: Real-time 3D Model-Based Tracking: Combining Edge and Texture Information. In: Proceedings of the 2006 IEEE Int. Conf on Robotics and Automation (ICRA 2006), Orlando, Florida, pp. 2726–2731 (2006) 6. Vacchetti, L., Lepetit, V., Fua, F.: Stable Real-Time 3D Tracking Using Online and Offline Information. IEEE Trans. Pattern Anal. Mach. Intell. 26(10), 1385–1391 (2004) 7. Pressigout, M., Marchand, E.: Real-Time Hybrid Tracking using Edge and Texture Information. Int. Journal of Robotics Research, IJRR 26(7), 689–713 (2007) 8. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the 7th International Conference on Computer Vision, pp. 1150–1157 (1999) 9. Bouthemy, P.: A Maximum Likelihood Framework for Determining Moving Edges. IEEE Trans. Pattern Analysis and Machine Intelligence 11(5), 499–511 (1989) 10. Lowe, D.G.: Fitting Parameterized Three-Dimensional Models to Images. IEEE Trans. Pattern Analysis and Machine Intelligence 13, 441–450 (1991) 11. Haralick, R.M.: Pose Estimation From Corresponding Point Data. IEEE Trans. Systems, Man, and Cybernetics 19(6), 1426–1446 (1989) 12. Lu, C.P., Hager, G., Mjolsness, E.: Fast and globally convergent pose estimation from video images. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(6), 610–622 (2000) 13. Huber, P.J.: Robust Statistics. Wiley, New York (1981) 14. Kumar, R., Hanson, A.R.: Robust Methods for Estimating Pose and a Sensitivity Analysis. CVGIP: Image Understanding 60(3), 313–342 (1994)
Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition in Immersive Environments Anuraag Sridhar and Arcot Sowmya School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia {anuraags,sowmya}@cse.unsw.edu.au
Abstract. In this paper, we present a technique that allows multiple participants within a large-scale immersive, virtual environment to interact with it using pointing gestures. Multiple cameras observing the environment are used along with various computer vision techniques to obtain a 3-D reconstruction of each participant’s finger position, allowing the participants to point at and interact with virtual objects. This use of the pointing gesture provides an intuitive method of interaction, and has many applications in the field of human computer interaction. No markers or special clothing are required to be worn. Furthermore, we show that the system is able to provide robust results in real time.
1 Introduction Within human-computer interaction domains, the area of Immersive Environments has increasingly utilised vision-based tracking. Such systems have proved to be useful in fields ranging from entertainment to education. Installations such as CAVE [1] and AVIE [2] are excellent examples of the current state of the art in immersive technologies. As such systems grow in use and complexity, there is a constant drive to improve the method of interaction available. The large spatial volume and presence of multiple users also necessitate a move away from wired, encumbered technologies. Computer vision based methods are very effective in providing interaction within immersive environments. Utilising vision techniques means that users’ movements and activities can be analyzed from afar, and with current trends in hardware and software, it has become increasingly possible to implement vision-based interaction mechanisms in real time, thereby allowing the replacement of expensive, specialized sensors with cost-efficient cameras. In this paper we focus on the use of pointing gestures for interaction within an immersive environment. The pointing gesture has been shown to be a powerful and intuitive method of interaction with a computer. Current touch screen systems and newly developed multi-touch interaction systems [3], which effectively require the user to point at the screen, demonstrate the usefulness of the pointing gesture, which provides an unencumbered method of augmenting the mouse pointer in mixed reality applications. This paper demonstrates a system which uses multiple cameras to robustly extract pointing gestures of multiple humans within an immersive environment. The system G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 508–519, 2008. c Springer-Verlag Berlin Heidelberg 2008
Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition
509
Fig. 1. Advance Visualisation and Interaction Environment (AVIE)
extends and integrates several tried and tested techniques from existing research in a unique manner to perform the pointing gesture recognition. The novelty in our research comes from the fact that we have applied several previously researched techniques and algorithms to provide a robust application in a real world setting. Furthermore our contribution to the field allows multiple participants to interact with the immersive environment, as opposed to previous systems which only allow for one user to interact with the system at any one time. The paper is laid out as follows. Section 2 gives background on our work, including the resources available and current research in pointing gesture recognition systems, section 3 presents our approach to tackling the problem, section 4 presents the results, and section 5 discusses future extensions.
2 Background 2.1 Resources The immersive environment that we work with consists of a cylindrical projection system that is capable of projecting panoramic, stereoscopic images and videos. The cylinder is 10 metres in diameter and 4 metres high, and contains a set of 12 projectors in 6 stereoscopic pairs, located at the centre of the cylinder. For purposes of providing observation support, the system uses 12 near infrared cameras, with 4 cameras located at the centre pointing directly down, and 8 around the circumference of the cylinder pointing toward the centre. We use Intel’s Open Source Computer Vision Library [4] to do most of the low level image processing tasks. 2.2 The Pointing Gesture The use of the pointing gesture has been shown to be a simple yet effective technique, that uses computer vision to provide human computer interaction. Pavlovic et al. [5] classified pointing gestures as deictic gestures that allow users to communicate with the
510
A. Sridhar and A. Sowmya
computer but, in fact, pointing gestures can also act as manipulative gestures, allowing users to physically move around objects within the artifical world. The pointing gesture has two main benefits over other actions or gestures, within immersive environment applications. The first is that pointing gestures provide a clear, intuitive method of interacting with artificial objects that are displayed within the virtual world. It is very easy to provide precise feedback to the user, in the form of a visible cursor, thus allowing greater control over the interaction. The pointing gesture can also be extracted quite easily from human silhouettes using simple ad-hoc computer vision techniques. This is especially beneficial within immersive environments where infrared imagery, rather than visible light imagery, is often used due to the dark nature of the environment during operation. The result is a lack of colour information which means that silhouette processing algorithms are highly suitable. Furthermore recognition of the pointing gesture does not require complex gesture recognition techniques such as fitting an articulated body model, which can be computationally expensive. This means that pointing gesture recognition can provide great advantages for real time applications. 2.3 Related Work The recognition of pointing gestures for human-machine interaction has received considerable interest in recent years. Kahn and Swain [6] present one of the first such systems, which controls a trash collecting robot, by recognizing a human pointing to a piece of trash. Nickel and Steifelhagen [7] used computer vision techniques to detect pointing gestures in a human-machine interaction system. They used a colour stereo camera pair, combining skin colour detection and depth disparity map, to determine points representing hands and heads. Moeslund et al. [8] also used colour to determine the positions of hands and the head. Yuan [9] presents a system that recognizes multiple pointing gestures for multiple people, for human computer interaction. Their system is a small-scale augmented reality system utilising a single colour camera as a sensor. As mentioned previously, the use of colour is a luxury not available in larger-scale immersive environment systems, which operate in the dark. Malercyzk et al. [10] present an excellent example of a pointing gesture used to control an immersive museum exhibit. Their system used a very simple sensor mechanism, and only performed pointing gesture recognition in a 2D plane. Kehl and van Gool [11] detected fingertip positions, in an immersive environment, by constructing a distance function and mapping points on the silhouette contour to their distance to the contour centre of mass. Local maximas in this distance function were identified as finger and head points. Our system extends Kehl and van Gool’s approach, by improving the method of fingertip detection in 2-D images. The major drawback of all the systems presented above is that they do not have the capability of dealing with multiple people interacting with large scale artificial environments. They all require a fully unoccluded view of the human, even when multiple sensors are used. In immersive environment applications however, it is necessary to allow multiple participants to interact, not only with the display, but also with each other. This is particularly true in projection display systems.
Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition
511
3 System Overview Our research combines multiple tried and tested computer vision techniques in a unique manner, to provide pointing gesture recognition for multiple people, which has not been achieved so far, to our knowledge. The pointing gestures may then be used for human computer interaction within a real world immersive environment application. To this aim, the implemented system determines the location of individuals within the immersive environment and reconstructs their head and finger positions in 3-D, so as to identify where they are pointing on the cylindrical screen. The system uses 12 cameras for this purpose: 4 overhead cameras are located at the centre of the area pointing directly down and 8 side cameras are located on the circumference of the area pointing towards the centre. This separation into overhead and side cameras, is a new and unique technique which improves the robustness of both tracking and gesture recognition. The overhead cameras provide an unoccluded view of the entire environment but lack height information and the side cameras provide accurate 3-D reconstruction, but can suffer from occlusion. The cameras are fully calibrated during an offline training phase using a 3-D calibration rig and Tsai’s calibration model [12]. Our system is parallelized such that clusters of four cameras are first processed independently on slave machines, and the 2-D results are sent to a master machine which reconstructs the required 3-D points, using multiple view geometry algorithms. Each slave machine first performs segmentation on the incoming images. Due to the passive nature of the cameras and the controlled nature of the environment, we use background subtraction. A simple background model is computed by calculating the mean of a series of images of the background. Segmenting the foreground simply involves calculating the absolute difference between the incoming images and the background model, and thresholding. Once segmentation is complete, the system performs unoccluded tracking in the overhead cameras, and for each tracked blob, localizes fingertip positions. In the side cameras, no tracking is performed, and only fingertip positions are determined based on certain heuristics applied to the human body silhouette.
4 Tracking Using Overhead Cameras The overhead cameras are first used to perform unoccluded tracking of people within the environment. Tracking is important, if the system is to handle multiple people pointing. We found that when two or more people come close enough together, their blobs merge in the overhead cameras. When this occurs, in most cases it is impossible to obtain the pointing gesture because boundary based methods fail, or give an unacceptable number of false positives. So the main purpose of tracking within our system is to determine when multiple people are close to each other, detected as a single blob, and to turn off finger point localization in this case. This is a novel technique that ensures that false positives are minimized as much as possible. Although we turn off the pointing gesture recognition in this case, we found that cases where people are close enough to be detected as one blob, only occur when the people are close enough to touch, in the real word.
512
A. Sridhar and A. Sowmya
4.1 Dynamic Graph-Based Tracking To perform tracking within our system, we use a merge-split approach [13] whereby, based on the positions and motion of previous blobs, we determine when groups of blobs have come too close together, and are detected as one blob. We maintain a set of tracked blobs from previous frames, and use Kalman filtering [14] to predict the position and velocity of the blobs in subsequent frames. For each tracked blob we maintain two Kalman states, which are predicted and corrected independently. The first state stores the blob’s centre of mass, and the second state stores the width and height of the blob’s bounding box.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2. Overhead Tracking Results (Green squares represent tracked blobs from previous frames and yellow squares represent detected blobs in the current frame): (a)-(b) 3 persons moving around are tracked as 3 individual blobs, (c) Persons 2 and 3 merge and are tracked as a group, (d) Person 1 joins the group containing persons 2 and 3, (e) Person 1 leaves the group and the system is able to maintain the original id, (f) Person 1 has left the system altogether and Person 2 leaves the group
Once all the tracked blobs have been predicted, to track blobs that have come close together, and hence are detected as one blob, we match each predicted blob to blobs detected in the current frame. To determine this matching we use the dynamical graph matching technique presented by Chen et al. [15]. Specifically, given a set of detected blobs and a set of tracked blobs, we establish a weighting function between each detected and tracked blob, and then use graph theoretic techniques to determine a minimum weight matching between each detected blob and each tracked blob. The weighting function we use, between a detected blob and a tracked blob, is directly proportional to the amount of overlap between the tracked blob’s predicted bounding box and that of the detected blob. This method allows us to effectively determine when two or more people have merged together into one blob. When two or more blobs merge, the system sets the velocity of both Kalman states, for each blob, to zero, and switches to a Lucas and Kanade optical flow tracker [16] that tracks the centre of mass of each individual blob within the merged blob. When
Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition
513
the merged blob splits again, the system switches back to Kalman filter and the above dynamic graph-based tracking. The reason for this switching is that when the blobs merge, the bounding boxes of the merged blobs are ambiguous, so the previous graph matching method (which is dependent on the bounding box areas) fails. In our system we found that a safe assumption is that when blobs merge, their centre of masses will have very little movement, because people can’t move very fast in groups. Thus the optical flow point tracker, that only tracks the centre of mass, suffices during the merge. Figure 2 gives an example of the results achieved from overhead tracking. As can be seen, we are able to maintain a unique identity on each individual person, from the time they enter, to the time they exit the environment, regardless of whether they are in a group or not. 4.2 Fingertip Detection Once the individual blobs are identified within the overhead cameras we perform fingertip detection and localization for each blob. To perform fingertip detection in the overhead cameras, we first compute a distance function between points on the contour, to the centre of mass of the contour. Because the distance function is not necessarily smooth, we apply a box filter to remove segmentation noise. We search for extremal points in this function, and label them as possible fingertips. Furthermore, the passive nature of the cameras allowed us to apply a distance threshold to remove some false positives. Figure 3 presents the result of this extremal point search. We further refine this coarse fingertip location, by searching for extremal points on the blob boundary, within a fixed-size window, surrounding the detected fingertip point. This refinement step considerably improves the standard extremal point search used in [11], resulting in much smoother finger motion. The result of this step is given in Figure 3(b).
(a)
(b)
Fig. 3. Results from extremal point search: (a) Resulting finger position in red. (b) Results from refinement step: Red cross is result from extremal point search, and green cross is the final result after refinement.
5 Fingertip Localization Using Side Cameras In the side cameras, a considerably different scenario is encountered. The side cameras are primarily used for fingertip detection. Although the overhead cameras do provide the (x, y) position of the fingertips quite accurately, for robust height information, it is necessary to obtain the finger’s position from a side view.
514
A. Sridhar and A. Sowmya
5.1 Pointing Finger Localization Kehl and van Gool [11] localized finger points in side cameras using the same local extrema search method used in the overhead cameras. However, we discovered that the morphological operators applied during the contour extraction step affects the fingertip points considerably. This results in fairly noisy finger measurements, leading to severely erratic motion of the fingertips after 3-D reconstruction. To resolve this issue, we used a different method of fingertip localization within the side cameras. Rather than trying to detect a singular point to represent the fingertip, we found it much more robust to localize the entire hand of a person as a single line, and then label the tip of this line as a fingertip. This results in much cleaner fingertip motion detection than other methods. For each silhouette we first compute it’s horizontal projection [17]. We then determine regions within the silhouette that may contain hands, by separating the silhouette into three vertical components, where the middle component consists of the largest horizontal segment in the projection with values greater than the mean of the projection. This middle component is then the body component, and the left and right components are said to contain hands. In this step, we can also identify the head position of the person, which we assign to the topmost point of the middle box, lying on the principal axis of the middle box. These steps are shown in figure 4.
Fig. 4. Side camera fingertip localization. Left to right: Horizontal Projection, Hand Extents (Red circle represents detected head position), Distance Transform with Hand Paths marked in red, Final Results (red lines represent hand lines and green circles represent fingertips).
Once the extents of the hands have been detected, we need to determine the actual hand lines within the rectangular hand segments. We use the distance transform of the silhouette image for this purpose. A connected component algorithm is applied to the distance transformed image, to determine connected paths of high intensity. Any path whose length falls below a threshold is eliminated, and a line fitting algorithm is applied to the remaining paths. The tips of the lines are output as fingertip positions. Figure 4 shows the result of the path detection, and hand limb localization. This technique of hand limb localization provides very good results in localizing fingertip points. We are able to identify fingertips even when a person is pointing with both hands on the same side. Figure 5 shows some of the results we have achieved. As the figure shows, the system is able to deal with multiple hands on the same side and can also handle a skewed body posture.
Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition
515
Fig. 5. Finger and head localization results in side cameras. Top row: Single person. Bottom row: Multiple People.
Although this method does suffer from occlusion when multiple people come close together in one camera, the occlusion issue is resolved when multiple cameras are used. When two or more people are occluded in a side camera and merge into one blob, the system detects all possible hands at the left and right extremes of the blob. The system does not identify hands with persons in the side cameras. This is instead achieved during the 3-D reconstruction phase, where finger points from the overhead cameras are matched to those from the side cameras. Because the overhead cameras have an unoccluded view of the finger points most of the time, this ensures that all pointing fingers are appropriately matched and reconstructed. We have also discovered that, given our setup of 8 side cameras, a pointing finger is always extracted in at least one sidecamera in most cases. These two views from an overhead camera and one side camera are sufficient for a 3-D reconstruction.
6 3-D Reconstruction Once all the images are processed on the slave machines, the results are transferred to the master machine for 3-D reconstruction. We use epipolar geometry to determine point correspondence between 2D finger points. Sparse bundle adjustment is then used to recover the 3-D finger point using the corresponding 2D image points. Finally we use a Kalman Filter to track each reconstructed point over time. The use of the Kalman filter has the added benefit of smoothing out the motion of the points. Due to space limitations, and because it is out of scope for this research, we refer the reader to [18] for a more in-depth discussion on multiple view geometry algorithms. After the points have been reconstructed in 3-D we use the line-of-sight (the line joining the head and
516
A. Sridhar and A. Sowmya
Fig. 6. Final 3-D Reconstruction Results at a given instant in time: Small circle represents head position, big circle represents centre of gravity and crosses represent finger positions
hand positions of the user) and project it to the cylindrical screen to determine where the user is pointing. Figure 6 shows the final results of 3-D reconstruction within the system. As the figure shows, the system can reconstruct both head and finger positions of people even when they are occluded in some cameras.
7 Experiments and Results Our system was implemented using 12 cameras and 4 machines (3 slaves and 1 master). The 12 cameras were separated into groups of 4, with each group sending input to a single 2.4GHz dedicated slave machine. These machines perform the 2D tracking, and each machine runs at real-time (approx. 20fps). An additional master computer is used to perform the 3-D reconstruction and tracking. The slaves connect to the master over a gigabit network connection. Because the master runs independently of the slave, and has the ability to predict the positions of tracked entities, the master is able to run at approx 60fps. Prior to this work, the environment already contained an interaction mechanism in the form of an inertial sensor. This sensor is attached to a wireless remote control, and users are able to interact with the display system by pointing the remote at the screen and clicking. This pointing device has proved to provide a satisfactory user experience,
Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition
(a)
517
(b)
Fig. 7. Results from comparison against inertial sensor. Grey values are from inertial sensor, white values are from pointing system. (a) Height vs. Time for up and down motion (height in meters) (b) θ vs. Time for side-to-side motion (theta in radians).
and has been used in numerous exhibitions and displays, demonstrating it’s usability and effectiveness. However, the inertial sensor’s drawbacks are that it is wired, it is expensive and only one person can use it at any one time, and only one pointer can be controlled. To demonstrate that our system could be used as a pointing device, we show that it is comparable to the inertial sensor in terms of motion of the pointer. During experiments, we asked a user to hold the inertial sensor in one hand and use that hand to point at the screen. We then asked the user to move the hand in a strictly horizontal motion, and then a strictly vertical motion, across the cylindrical screen. We recorded both the motion of the inertial sensor and the tracked finger position. The motions are converted to cylindrical coordiantes, and we ignore the r component, since the radius of the cylindrical screen is fixed. We plotted the θ component against time for the sideto-side motion, and we plot the z component over time for the up-and-down motion. These plots are given in figure 7. As can be seen from the figure, the pointing gesture recognition system is able to follow the inertial sensor system quite closely, and only differs by a constant translation, which can be subtracted easily. Figure 8 shows a real world application that uses our pointing gesture recognition system. The algorithms presented allow a user to control a pointer like object on the screen using their fingers, and interact with a virtual menu and other virtual objects within the scene. Preliminary tests have shown that the gesture recognition system is able to pick up and reconstruct pointing fingers of upto 5 people accurately within the environment. Full usability tests will be performed on the system to determine its full capabilities. The system has proved to be a usable and intuitive system in it’s current state. Our contribution to the literature was to combine multiple existing algorithms, in a unique manner for a practical, real-world scenario. From purely visual observations, our results have shown to be quite promising. Several different users have used the system, and found that controlling the cursor, with their hands, was intuitive and easy. A full usability study and complete evaluation and testing are currently being performed on the system. Due to time and space limitations, we were not able to publish these results in this paper, but plan to do so in a later publication.
518
A. Sridhar and A. Sowmya
(a)
(b)
Fig. 8. Tracking system being used in a real world application: (a) shows a participant interacting with the display system using both hands to control two projected pointers (red circles) and (b) shows a participant controlling a menu item on a 3-D projected menu
8 Conclusion We have presented a real-time, robust pointing gesture recognition system, that builds upon and improves results from previous systems and extends previous work to allow multiple people to interact with a virtual world. The system is able to track markerless pointing gestures of multiple people in 3-D. Furthermore the system has proved to be usable in a real world immersive environment application. Future work will look into determining complete body postures of individuals, and analyzing further gestures to enable additional interaction techniques.
References 1. Cruz-Neira, C., Sandin, D., DeFanti, T.: Surround-screen projection-based virtual reality: the design and implementation of the CAVE. In: Proceedings of the 20th annual conference on Computer graphics and interactive techniques, pp. 135–142 (1993) 2. McGinity, M., Shaw, J., Kuchelmeister, V., Hardjono, A., Del Favero, D.: AVIE: a versatile multi-user stereo 360 interactive VR theatre. In: Proceedings of the 2007 workshop on Emerging displays technologies: images and beyond: the future of displays and interacton (2007) 3. Han, J.: Low-Cost Multi-Touch Sensing through Frustrated Total Internal Reflection 4. Bradski, G.: OpenCV: Examples of use and new applications in stereo, recognition and tracking. In: Proc. Intern. Conf. on Vision Interface (VI 2002) (2002) 5. Vladimir, I., Rajeev, T.: Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review (1997) 6. Kahn, R., Swain, M.: Understanding people pointing: The Perseus system. In: Proceedings of the International Symposium on Computer Vision (November 1995) 7. Nickel, K., Stiefelhagen, R.: Real-time Recognition of 3D-Pointing Gestures for HumanMachine-Interaction. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 557–565. Springer, Heidelberg (2003)
Multiple Camera, Multiple Person Tracking with Pointing Gesture Recognition
519
8. Moeslund, T., Storring, M., Granum, E.: A Natural Interface to a Virtual Environment through Computer Vision-Estimated Pointing Gestures. In: Wachsmuth, I., Sowa, T. (eds.) GW 2001. LNCS, vol. 2298. Springer, Heidelberg (2002) 9. Yuan, C.: Visual Tracking for Seamless 3D Interactions in Augmented Reality. In: Bebis, G., Boyle, R., Koracin, D., Parvin, B. (eds.) ISVC 2005. LNCS, vol. 3804, pp. 321–328. Springer, Heidelberg (2005) 10. Malerczyk, C.: Interactive Museum Exhibit Using Pointing Gesture Recognition. In: Proceedings of the 12-th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, pp. 165–171 (2004) 11. Kehl, R., Van Gool, L.: Real-time pointing gesture recognition for an immersive environment. In: Proceedings. Sixth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 577–582 (2004) 12. Tsai, R.: An efficient and accurate camera calibration technique for 3D machine vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, vol. 374 (1986) 13. Gabriel, P., Verly, J., Piater, J., Genon, A.: The state of the art in multiple object tracking under occlusion in video sequences. Advanced Concepts for Intelligent Vision Systems, 166– 173 (2003) 14. Welch, G., Bishop, G.: An Introduction to the Kalman Filter. In: ACM SIGGRAPH 2001 Course Notes (2001) 15. Chen, H., Lin, H., Liu, T.: Multi-object tracking using dynamical graph matching. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 2 (2001) 16. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proc. DARPA Image Understanding Workshop, pp. 121–130 (1981) 17. Chanda, B., Majumder, D.: Digital Image Processing and Analysis. Prentice Hall of India, Englewood Cliffs (2003) 18. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
Augmented Reality Using Projective Invariant Patterns Lucas Teixeira, Manuel Loaiza, Alberto Raposo, and Marcelo Gattass Tecgraf - Computer Science Department Pontifical Catholic University of Rio de Janeiro, Brazil {lucas,manuel,abraposo,mgattass}@tecgraf.puc-rio.br Abstract. This paper presents an algorithm for using projective invariant patterns in augmented reality applications. It is actually an adaptation of a previous algorithm for an optical tracking device, that works with infrared illumination and filtering. The present algorithm removes the necessity of working in a controlled environment, which would be inadequate for augmented reality applications. In order to compensate the excess of image noise caused by the absence of the infrared system, the proposed algorithm includes a fast binary decision tree in the process flow. We show that the algorithm achieves real time rates.
1
Introduction
Augmented Reality (AR) is a research area interested in applications that enrich the visual information of the real environment where the users interact. Computer vision algorithms are commonly used in AR applications to support the detection, extraction and identification of markers in the real scene and to continuously track these markers, allowing that users change their points of view. The definition of what may be considered a marker has been changed as long as tracking algorithms evolve; patterns of points, planar patterns and specific characteristics of the tracked object may be considered markers. However, the main requirement for markers remains the same: They should have a format that can be easily identified. We propose an algorithm for a tracking pattern based on projective invariants and demonstrate its good performance in the identification and real time tracking of the pattern. The proposed algorithm is based on the algorithm presented in [1], which was developed to support an optical tracking system that works in illumination controlled environments, since it is based on infrared (IR) filtering. We propose changes in this algorithm to allow its use in non-controlled environments for the implementation of AR applications. The main adaptation in the algorithm is the use of a binary decision tree classificator to eliminate a significant part of false markers that appear in the image due to the absence of IR filtering.
2
Related Work
One of the most common solutions for objects identification in AR is the use of ARToolkit-like planar patterns [2]. In addition to the identification of objects, G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 520–529, 2008. c Springer-Verlag Berlin Heidelberg 2008
Augmented Reality Using Projective Invariant Patterns
(a) IR pass filter
(b) no filter + IR light
521
(c) IR cut filter
Fig. 1. Same image with different illumination and filtering setups
this kind of pattern also allows the camera calibration in relation to the pattern, enabling the easy insertion of virtual objects over or around the pattern. One of the drawbacks of this kind of pattern is that it is very “invasive”, becoming very apparent in the scene. A different and more restrictive solution are systems that use IR illumination together with cameras with IR pass filters (filters that allow the passage of high wavelengths only) to detect retro-reflexive patterns in the scene. This combination of IR light and filters was used in [1] to develop a technique for pattern identification and tracking. The problem of making AR with a system that uses IR filters is that the captured image looks like a night-vision goggles image, which is very different from the users’ normal vision – Figure 1(a). An attempt to use the above technique in an AR application was presented in [3]. In that work, the solution to obtain a more “realistic” image was to use lens without an IR filter, a weak IR illuminator to accentuate the retro-reflexive markers, and fluorescent ambient light (that doesn’t emit IR). The work also presents a technique to project virtual objects using patterns of collinear points. This approach, however, presented two problems. The first one is that the image is still preceptively altered due to the absence of the IR cut filter (filter that doesn’t allow the passage of high wavelengths), present in any regular camera. This difference is showed in Figures 1(b) and 1(c). The second problem is that the absence of an IR pass filter introduced a lot of noise in the image, what increased the time spent to find the patterns in the scene, restricting the velocity of camera movements to keep real time rates. The present work proposes a solution for the above mentioned problems and explores the use of collinear patterns to support AR applications. Examples of applications where this proposal may be used are applications for adding or visualizing information over previously marked objects, where the main problem is the necessity of large patterns in the scene, falling again on the “visual invasion” problem of ARToolkit-like patterns.
3
Original Algorithm for Projective Invariant Patterns
In this section we explain the algorithm originally proposed in [1], upon which the algorithm proposed here is based. The original algorithm will be referred as APIP (Algorithm for Projective Invariant Patterns).
522
3.1
L. Teixeira et al.
Tracking Algorithm
APIP was developed to support the process of creation, identification and tracking of specific patterns used in an optical tracking system. This kind of system works in a restrictive interaction ambient, defined by the properties of IR light and retro-reflexive materials used to accentuate the markers from the rest of the scene. APIP may be summarized as: Input image of each captured frame Image binarization Extract circular areas based on contours found in the image Group circular areas using a quadtree For each quadtree branch Generate test patterns of 4 markers. These test patterns are generated by the combination of the circular areas For each generated pattern Execute the collinearity test If collinear Test the value of the projective invariant for this pattern Compare the value of the projective invariant of the candidate with the value of the wanted pattern If equal Save the set of markers that compose the pattern If the pattern is recognized Create a bounding area around the markers so that in the next frame the markers may be extracted in this area 3.2
Main Techniques Used by APIP
Projective Invariants: The implementation of projective invariant properties has been discussed especially in the area of pattern recognition [4]. It is based on the invariant property of cross ratio, particularly in the case of perspective projection. The cross ratio property states that if we have 4 collinear points (A,B,C,D) we can define a cross ratio value based on these points distances |AC| ( |BC| ) according to the following relationship: Crossratio(A, B, C, D) = |AD| . ( |BD| ) Collinearity: Collinearity is a perspective projection invariant characteristic explored in the proposed tracking application. This characteristic, besides having an unique identifier extracted from the projective invariant theory, may generate a specific format that can be used as a filter to discard several candidate groups of 4 points that don’t fit this format. Bounding Area: A second technique used to reduce computational costs is the generation of a bounding area around the pattern position in the current frame. Once the pattern is found and validated, the system creates a bounding area, used as a simple way to predict and restrict the area where a well recognized pattern may appear, based on information from frame t in frame t+1.
Augmented Reality Using Projective Invariant Patterns
3.3
523
Limitations
APIP was developed to work in an environment where image noise problems are reduced due to IR filtering. This avoids the generation of false markers that would hamper its real time performance. The proposal of the present paper is to adapt the algorithm to be used in environments without the IR restrictions. The main problem with this new approach is that the number of false markers are increased due to the absence of the IR system (IR pass filter + IR illumination). We show that with some modifications in APIP we can quickly discard these false markers (or at least the majority of them), keeping the real time performance, necessary in AR applications.
4
Adapted APIP
In order to treat the problem of noise increase, we developed a marker classifier based on machine learning. This classifier is a binary decision tree (BDT) that is used to eliminate the majority of false markers. The BDT is necessary because the algorithm to find ellipses [5] only adjusts the best ellipse within a connected component with any format. In addition to the BDT, we introduce a restriction for the size of the pattern in order to help the optimization considering the specific case of using the projective invariants technique in AR. About the illumination, the system just assume that the light sources are static or only exist ambient illumination. Figure 2 shows an overview of the adapted APIP. 4.1
Binary Decision Tree Classification
BDT is a fast and broadly used form of classification [6]. The tree is generated by passing a list of examples for a generic algorithm of decision tree generation.
Color image
Camera image
binarization
Binary image
Contour extraction and ellipse fitting
Elimination of false ellipses using BDT
Ellipse list
Filtered Ellipse list For each combination of four ellipses Identified pattern list Collinear filter
Size filter
Invariant calculation
Search a possible pattern that matching
Update of bounding area position
Input data
Output data
APIP process
New process
Fig. 2. Adapted APIP process flow
524
L. Teixeira et al.
Each example is a register of the universe about which we want to learn. It is formed by the attributes that describe this example in that universe and the class to which it belongs. The algorithms that need the correct class included in each example during the training are called supervised algorithms. The intermediary nodes of the tree have a reference for an attribute and two child nodes. Leaf nodes store only one of the possible classes in the universe. In order to use the tree to classify a sample (i.e., an example without an associated class), we traverse it from the root and, at each intermediary node, we use the attribute associated to this node to decide the child to go. When we arrive at a leaf node, its associated class is the result of the classification. In this section we describe how to construct a BDT capable of eliminating the majority of false markers in a segmented image. Initially we explain how to choose the attributes to describe an elliptical marker. Then, we show how to generate the training examples (corpus) and suggest a technique to increase the robustness of the corpus. We then define the tree generation algorithm we use and explain how to use the tree, keeping it robust with the markers’ scale variation. Attribute Generation: APIP doesn’t restrict the kind of marker, but to create a smaller and more robust BDT, we are going to be restricted here to elliptical markers. We can use circular or spherical markers in the scene, which will be viewed as ellipses in the image due to perspective distortion. In order to describe image segmented areas in discrete attributes, we may use complex methods that use adjacent angles, such as that proposed in [7]. These methods achieve good results, but it is shown that performance of attribute generation is expensive, on average 20 ms for frame. However, the mentioned methods are proposed to identify complex shapes like hand signs, and we are looking for ellipses that are much easier to define. In order to increase the performance we use an aligned patch of binary image with Δ × Δ around the centroid of the best ellipse that fits into each segmented area. Therefore, a sample has Δ2 binary attributes that indicate whether each pixel of the patch is white or black. Generation of the Training Examples: The set of training examples is generated by making a movie with IR cut lenses over the interest area containing the patterns and the background where the AR application will take place. To reduce the size of the BDT, we use a reduced dataset, since we need the classification as fast as possible. However, we are accepting the possibility of having to make new trainings for different applications. The movie must be recorded with smooth camera movements, so that the bounding area doesn’t loose the pattern, what would demand reinitialization, which is a slow process. After that, the invariants’ values are measured in some frames with different points of view that were manually chosen. With the measured values, we start the APIP process of extracting and classifying the points in each frame as marker or non-marker. Although this classification is 100% reliable, some segmented areas classified as non-marker by APIP may have a format very similar to an ellipse. Since we don’t want that these cases confuse the BDT training, we use a stronger elliptical forms detection algorithm [8] to remove
Augmented Reality Using Projective Invariant Patterns
525
them from the training. We could have changed the class of these points instead of removing them, but we would be assuming the error of the ellipses algorithm, which may identify ellipses that are not very similar to those we want, and potentially augmenting the size of the BDT, unnecessarily. Finally, we generate a file with the attributes for the remaining points and their corresponding classes. This file is the input of the decision tree generation algorithm. Oversampling: We showed that from the binarized image and the detected ellipses we generated the samples to be classified by the decision tree. In this process, each ellipse generates a single sample. In order to generate more training samples without having to capture more images, we use a sample extrapolation process. In [9] homographies were used to extrapolate samples of other perspectives to a patch, but this works well only in planar situations, which is insufficient for our system. Therefore, we use a more limited hypothesis, derived from the assertion that a patch around a centroid not aligned with x and y is equivalent to a camera rotation over the “roll” axis. The Decision Tree Learning Algorithm: Rosten and Drummond [10] succeeded in using ID3 algorithm [11] to classify points with the knowledge of neighborhood color. Then we choose the C4.5 algorithm [12] (an extension of ID3) as the decision tree generator. The learning algorithm generates a tree that can be used directly with the aligned patches of the points detected in real time, but in order to augment the robustness of the process regarding the scale, an additional process was added. The tree theoretically classifies the ellipses of the sizes it was trained. However, there are cases of ellipses larger than those used in the training, that happens when the user gets closer to the interest area. In such cases we use the ellipse’s bounding box to calculate a factor to reduce the ellipse to a size a little smaller than the Δ × Δ patch. This factor is used to guarantee that the ellipse reaches a size compatible to the training and that its contour is well defined. 4.2
Pattern Maximum and Minimum Sizes
The excess of noise may generate false markers that pass through the previous tests. To reduce even more this possibility, we define maximum and minimum sizes for the pattern. We suppose that in an AR application it is normal the necessity of having space for a virtual object to be included in the scene, which allows the definition of a maximum size for the pattern. For the minimum size, we use the assertion that when the object is too far, the possibility of detecting the pattern is small. Since this filter is applied before the ordination of the points, we still don’t have the organization of the pattern, but just a list of points. Therefore, we make these tests calculating the minimum and maximum x and y for the list of points, which define the bounding box of this group of points.
526
5
L. Teixeira et al.
Results
Real time performance is an important requirement of AR applications. For this reason, the evaluation of the proposed algorithm is focused on this requirement. We assume that a real time AR application must be visualized at rates equal or greater than 30 fps, which means that the processing time cannot exceed 33 ms. The processing of an AR application includes not only the pattern identification and tracking, but also complimentary processes, such as the insertion of virtual objects in the scene. Therefore, we assume that the maximum execution time allowed for the proposed algorithm is 15 ms, i.e., half of the maximum processing time allowed. In the following we present an analysis of the individual costs in terms of time of the algorithm’s key processes. At the end we can evaluate the limitations of parameters and conditions to accomplish the 15 ms restriction. 5.1
Image Loading and Markers Extraction
In this section we present some measured times for the processes of loading the image buffer, and the segmentation and extraction of elliptical markers in a synthetic image with the following characteristics: a base circular marker named of type 1, marker of type 2 (with the double size of type 1). Some results on these measures are presented in Table 1. Table 1. Time (in milliseconds) spent to load and detect circular markers 1 10 20 30 40 50 Markers type 1 1.718 1.891 2.078 2.266 2.437 2.625 Markers type 2 1.719 1.953 2.219 2.468 2.703 2.953 Markers type 1+2 —— 1.922 2.140 2.360 2.593 2.813
Based on the results presented in the table, we may calculate the average relative time to extract a marker as 0.02275 ms. The results also show that depending on the number of markers and their sizes, the loading and extraction times vary, but the average time estimative indicates the maximum number of markers that may be treated at each frame. 5.2
Collinearity and Projective Invariants Tests
We also measured average times for collinearity tests and for the calculation and comparison of the projective invariant value for a collinear pattern, for collinearity test the mean time was 0.0000703 ms, and for projective invariant test with ordering pattern points: 0.0001600 ms. These times were calculated as an estimative of the time to evaluate a single collinear pattern composed of 4 markers. As the total number of patterns to be analyzed is obtained by the number of combinations in 4 of the number of markers detected in the image, it is possible to define a formula to find the total
Augmented Reality Using Projective Invariant Patterns
527
cost (in time) and the maximum number of markers that can be present in an image: Cn4 ∗ (Collinearity test + P roj. invariant). As may be seen from the above formula, the algorithm must give priority to the reduction of false markers passed to the process of pattern generation. The linear increment of the number n of detected markers causes an exponential increment in the combination of points to be tested as possible patterns. For this reason, the implementation of the BDT test was necessary to remove a large number of false markers. 5.3
Cross Validation
Results were obtained through 5-fold cross validation tests. The Datasets used were five training films of the same scene, recorded with smooth camera movements throughout the scene. This scene has the objects shown in Figures 3 and 1. Each video has around 16 seconds. We couldn’t use randomly chosen groups of equal sizes because of the necessity of having different perspectives for the training. The Δ used was 15. Classification results are described by four indicators: Number of true markers (TM), number of false markers (FM), number of true non-markers (TNM), number of false non-markers (FNM). The size of the pruned tree will referred by PTSize and the quality evaluators are described by: P recisionN M =
T NM T NM + FNM
RecallN M =
T NM T NM + FM
Table 2. 5-folt cross validation 1 2 3 4 5
PTSize #frames TM FM TNM 1-PrecisionNM 77 611 2424 20 85358 0.000199 81 527 2099 9 76144 0.000276 81 403 1601 11 47126 0.000297 77 438 1715 37 49937 0.000400 89 534 2122 14 68206 0.000337
RecallNM 0.999766 0.999882 0.999767 0.999260 0.999795
We use the BDT to eliminate non-marker ellipses. This is the reason to analyze its recall and precision. The main requirement of BDT training is that precision be very high, because when the BDT obtains false non-markers, projective invariants don’t find the pattern. To achieve real time rates, recall needs to be high enough to remain a manageable number of false markers for projective invariants. Results in Table 2 indicates that these requirements are accomplished. 5.4
Case Study
In this section, we test a real case using the BDT training described in the previous section. We recorded a video of 25 seconds with random movements to allow the capture of diverse views of the scenes showed in Figure 3. In the
528
L. Teixeira et al.
(a)
(b)
(c)
Fig. 3. Scene Objects
200
Markers detected
Valid markers after filtered by BDT
Frame lost
180 Number of markers
160 140 120 100 80 60 40 20 0 1
101
201
301
401 Frame #
501
601
701
Fig. 4. Attribute generation
graphic presented in Figure 4, we show that all captured views contain more than 60 markers in the elipse list and after filtered by the BDT, they decrease to an average of 7 markers in the filtered elipse list. Black lines in the graphic indicate views where markers of the pattern were lost, invalidating the detection. For our experiments we used a modest 1.2Ghz tablet PC, and we obtained the following times: for the executions of the BDT we achieved less than 1 ms per fame analyzed for all frame samples. We also made synthetic tests to find the worst case of Δ = 15 (225 attributes to test) and defined a sample of 400 detected markers per frame. The time needed to filter the sample was 2.4 ms. We concluded that, for the tested hardware, real time rates are achieved if the BDT, the collinearity filter and the size filter allow the passage of no more than 25 markers to be evaluated by the projective invariants.
6
Conclusion
We presented a new algorithm for the detection and tracking of projective invariant patterns, which are used as a support tool to develop AR applications. The advantages of the proposed algorithm is that it uses less “invasive” patterns when compared to planar ones, and allows the detection and tracking in non-controlled environments, differently from IR-based systems. We calculate the computational costs of all the processes that compose the algorithm and show that we can achieve real time rates. We expect to implement complementary algorithms to enhance the identification and tracking process, such as a camera calibration algorithm that uses the information about the markers that compose the patterns.
Augmented Reality Using Projective Invariant Patterns
529
References 1. Loaiza, M., Raposo, A., Gattass, M.: A novel optical tracking algorithm for pointbased projective invariant marker patterns. In: ISVC (1), pp. 160–169 (2007) 2. Kato, H., Billinghurst, M.: Marker tracking and hmd calibration for a video-based augmented reality conferencing system. In: IWAR, pp. 85–94 (1999) 3. Teixeira, L., Loaiza, M., Raposo, A., Gattass, M.: Hybrid system to tracking based on retro reflex spheres and tracked object features (in portuguese). In: X Symposium on Virtual and Augmented Reality - SVR, pp. 28–35 (2008) 4. Meer, P., Lenz, R., Ramakrishna, S.: Efficient invariant representations. International Journal of Computer Vision 26, 137–152 (1998) 5. Fitzgibbon, A., Pilu, M., Fisher, R.: Direct least squares fitting of ellipses. IEEE Trans. Pattern Analysis and Machine Intelligence 21, 476–480 (1999) 6. Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 4–37 (2000) 7. Cho, K., Dunn, S.: Learning shape classes. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 882–888 (1994) 8. Aguado, A., Nixon, M.: A new hough transform mapping for ellipse detection, Technical Report, 1995/6 Research Journal (1995) 9. Ozuysal, M., Fua, P., Lepetit, V.: Fast keypoint recognition in ten lines of code. In: Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8 (2007) 10. Rosten, E., Drummond, T.: Machine learning for high-speed corner detection. In: European Conference on Computer Vision, vol. 1, pp. 430–443 (2006) 11. Quinlan, J.R.: Induction of decision trees. Machine Learning 1, 81–106 (1986) 12. Quinlan, R.J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Series in Machine Learning. Morgan Kaufmann, San Francisco (1993)
Acquisition of High Quality Planar Patch Features Harald Wuest, Folker Wientapper, and Didier Stricker Fraunhofer Institute for Computer Graphics (IGD) Darmstadt, Germany
[email protected]
Abstract. Camera-based tracking systems which reconstruct a feature map with structure from motion or SLAM techniques highly depend on the ability to track a single feature in different scales, different lighting conditions and a wide range of viewing angles. The acquisition of high quality features is therefore indispensable for a continuous tracking of a feature with a maximum possible range of valid appearances. We present a tracking system where not only the position of a feature but also its surface normal is reconstructed and used for precise prediction and tracking recovery of lost features. The appearance of a reference patch is also estimated sequentially and refined during the tracking, which leads to a more stable feature tracking step. Such reconstructed reference templates can be used for tracking a camera pose with a great variety of viewing positions. This feature reconstruction process is combined with a feature management system, where a statistical analysis of the ability to track a feature is performed, and only the most stable features for a given camera viewing position are used for the 2D feature tracking step. This approach results in a map of high quality features, where the the real time capabilities can be preserved by only tracking the most necessary 2D feature points.
1
Introduction
It has been shown that with a given textured 3D model a robust and drift-free camera tracking for augmented reality applications is possible. However, tracking systems often need to work also in partially known scenarios. In this paper we focus on the generation of a model which consists of planar textured patches, which are used for an intensity-based feature tracking. The goal of our approach is a high quality feature map, which consists of planar patch features, which can be tracked under different scales from different camera viewing directions and which are invariant to illumination changes. Such features are useful for a stable and drift-free tracking, but it is also necessary for a precise reconstruction process to track a point feature as long as possible. In contrast to Klein [1], where thousands of low quality features are tracked to estimate the camera pose, in this paper we focus on the acquisition of high quality features which are used for real-time camera-based tracking. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 530–539, 2008. c Springer-Verlag Berlin Heidelberg 2008
Acquisition of High Quality Planar Patch Features
531
Our work is based on the tracking system presented in [2]. A template update strategy is presented which results in a more correct photometric representation of a surface patch. Furthermore, we added the estimation of the surface normal of a patch in the reconstruction process. With such a reconstructed surface normal it is possible to make a precise prediction of the transformation of a template patch which has been temporarily lost. A method of reconstructing the surface orientation of a planar region is presented by Molton et al. [3]. They estimate the surface normal by a gradient-based image alignment technique, where the parameters of the normal vector are used as the degrees of freedom during the minimization of intensity differences. Our method is similar to the one presented by Favaro et al.[4], where an extended Kalman filter is used to iteratively refine the estimate of the surface normal orientation. Since iterative alignment is expensive, especially because of the bilinear interpolation of image intensities, the amount of features is kept at a minimum by only tracking as many features as necessary for a robust pose estimation. We integrated the feature management approach which was introduced in [5] into our tracking system, where a probability distribution for every feature point is created, which provides information, if a feature can be tracked successfully from a given camera position.
2
Planar Patch Tracking
Optical-flow-based alignment techniques like the well known KLT-Tracker [6] have been widely used to track objects, faces [7] or single feature points [8]. They all base on the intensity conservation over time. If I(x, t) is the image intensity at the pixel position x at time t and I(x, 0) = T (x) is a template image, which is extracted out of an image, when it is first observed, then the goal of the image alignment is to minimize 2 [(I(g(x; p)) − T (x)] , (1) x
where g(x; p) is a warping function of an image point x and p the parameter vector of the warp. Changes of illumination can be modeled by adding a contrast compensation factor λ and a brightness correction value δ. The error to be minimized can then be written as: 2 [(λI(g(x; p)) + δ − T (x)] . (2) x
Jin et al. [8] use this model with an affine warping function to track feature points in image sequences robustly. For a correct modeling of a projective camera model the warp function g(x; p) can also be represented by a homography [7].The more precise alignment using a homography warp comes along with higher computational costs due to a higher number of warp parameters.
532
H. Wuest, F. Wientapper, and D. Stricker
With an increasing number of parameters the convergence of the minimization of intensity differences gets less stable. To avoid the minimization to diverge, the alignment can be carried out first with a reduced amount of parameters and then after a successful minimization with the full set of parameters. We use a similar approach as presented by Zinsser [9], where only the translation of a feature is estimated first and then the full affine warping with brightness compensation is applied. In order to track a feature point in different scales, a patch template is acquired in several resolution levels of the image pyramid. For the template feature alignment always that resolution level is selected which has the most similar scale of the predicted warp function g(x; p). If the warp consists of an affine transformation with the affine matrix A, the resolution level l is selected in that way that the following inequality holds: t ≤ det(A)2l < 2t,
(3)
where for the threshold t we choose t = 0.8 which means that the scale of the affine transformation is always in the range between 0.8 and 1.6 The idea of rather choosing a lower resolution level is motivated by the fact that sampling artifacts are smaller at larger scales, which leads to a smaller error of intensity differences. The desired resolution level of the template does not always exist right from the beginning. This can happen if the camera moves towards an object and the warp function of the template patch increases in scale. To be able to continue the tracking in such cases, the stack of template patches is extended with a resolution level which does not exist yet. After a successful tracking step patches are therefore extracted out of the current image with the same warping and illumination parameters as the patches of the other resolution levels. With this extension of the resolution levels of the initial reference patch it is possible to track a single feature in a sequences with strong changes in scale.
3
Updating the Template
Updating the template is often avoided because small alignment errors can accumulate and the template might drift away from the initially extracted patch. However, the initial patch is not always the best visual representation of a surface, since reflections, shadows or strong camera noise can result in a poor representation of an extracted patch. Matthews et al.[10] propose a strategy, where the template is updated, if the parameter difference of the warps from the initial template and the current template is smaller than a given threshold. Thereby the reference template is replaced by the current region of the image, if the parameters of the alignment to the initial patch do not differ significantly from the warp parameters to the current template. In our system we propose a method which does not replace the whole reference template but updates the template image by calculating an incremental intensity
Acquisition of High Quality Planar Patch Features
533
mean for every pixel. To avoid drift, an update is only performed, if the alignment was successful, i.e. if the sum of squared differences between reference patch and current patch is small enough. If C(x) is the number of measurements, which have contributed to calculate the mean intensity of a pixel, then the incremental mean can be computed with T (x) =
1 [(I(g(x; p)) + C(x)T (x)] . C(x) + 1
(4)
After updating a pixel of the template, the value C(x) is incremented by 1. Every pixel of the reference patch needs its own contribution counter C, because it is not guaranteed, that always the whole patch can be updated. This might happen, if the whole patch does not lie completely inside the image, which is often the case especially on higher image pyramid levels. With this update method is is not only possible to refine intensity values of the initially extracted patch, but also to extend areas of a patch, which could not be initialized, because parts were outside of the image, when the feature was observed first. Since the contribution of the current image intensities gets lower with every update, the influence of the first images never gets lost and drift is avoided. If the camera does not move, applying an update of the template in every frame may not lead to the desired results, since a feature is always observed from the same viewing direction and reflections might incorporate into the reference template of the patch. We only perform an update if a significant amount of camera translation has occurred since the last template update. Thereby it is guaranteed that the appearance input for an update step consists of observations with altering camera viewing positions.
4
Template Mask Generation
Sometimes an extracted patch does not always lie totally on a planar surface. Pixels which are not part of the surface are not a useful contribution for the alignment step. For the acquisition of a high quality template it is desired to generate a mask for a patch, which selects only those pixels which really lie on the planar surface on an object. Our approach for the generation of a template mask relies on the analysis of the intensity variance of a pixel. Similar as the incremental mean computation, the update of the intensity variance S 2 (x) of a pixel can be approximated by S 2 (x) =
1 ((I(g(x; p)) − T (x))2 + C(x)S 2 (x) . C(x) + 1
(5)
To decide if a pixel is used for the tracking step, a mask M (x) is created by 1 if S 2 < c, M (x) = (6) 0 otherwise. A good value for the threshold c depends on the camera noise and must be chosen experimentally.
534
H. Wuest, F. Wientapper, and D. Stricker
With the mask M (x) the term to minimize for the robust alignment can be written as 2 M (x) [(λI(g(x; p)) + δ − T (x)] . (7) x
A disadvantage of updating the template or changing the mask is that the intensity gradients and the inverse of Hessian matrix, which is needed for the alignment process, needs to be recomputed after every update.
5
Reconstruction of Surface Normals
5.1
Relation between Pose Difference and Image Homography
We consider two camera frames, where (R0 , t0 ) and (R1 , t1 ) are the pose parameters of the first and the second camera respectively. The world coordinates of a 3D point M w are transformed into the two camera coordinate systems by M c0 = R0 M w + t0
(8)
M c1 = R1 M w + t1 .
(9)
The rotation difference ΔR and the translation difference Δt can be computed by ΔR = R1 R0T (10) Δt = t1 − R1 R0T t0 .
(11)
If nw is a unit normal vector of a plane P in the world coordinate system, this normal vector can be transformed into the coordinate system of the first camera by nc0 = R0 nw . (12) The linear transformation of a 3D point on the plane P from the first camera coordinate system into the second camera coordinate system can be described by M c1 = HM c0 ,
(13)
where
ΔtnTc0 (14) d denotes a homography mapping from M c0 ∈ R3 to M c1 ∈ R3 . The distance d is computed by d = nTc0 X p with any 3D point X p lying in the plane P . For further processing the image homography HI has to be transformed with the given intrinsic camera matrix K into the camera coordinate system with H = ΔR +
HL = K −1 HI K.
(15)
The homography HL and the motion parameters of the camera (R,t) and the structure parameters (n,d) are related by HL = λH = λ(ΔR +
ΔtnT ). d
(16)
Acquisition of High Quality Planar Patch Features
535
The the scale factor between HL and H is defined by λ and can be computed with |λ| = σ2 (HL ), (17) where σ2 (HL ) is the second largest singular value of HL . The proof can be found in [11], p.135. ˜ c0 = dnc0 is non-unit, the following By assuming that the normal vector n equation must be satisfied: 1 ˜ Tc0 . ( HL − ΔR) = Δtn λ
(18)
Transposing and multiplying this equation with Δt results in the least squares ˜ c0 : solution for n 1 1 ˜ c0 = n ( HL − ΔR)T Δt. (19) ΔtT Δt λ ˜ c0 . Since this equaThe unit normal nc0 can be determined by normalizing n tion computes the normal vector nc0 in the coordinate system of the first camera, it must be transformed into the world coordinate system by nw = R0−1 nc0 .
(20)
Now the vector nw can be regarded as a measurement of the surface normal, which will be used for a robust estimation of the surface orientation with a linear Kalman filter. The image homography HI used in equation 15 has to be related with the tracked homography Htracked . A feature is always initialized as a square in the image. With the 2D image position p = (px , py )T , the initial homography is set to ⎛ ⎞ 1 0 px Hinit = ⎝0 1 py ⎠ . (21) 00 1 If Htracked is the homography which results from the template alignment, the homography transforming the initial patch to the current patch can be computed by −1 HI = Htracked Hinit . (22) If the camera has a small field of view, the measured Htracked can also be approximated by an affine transformation. 5.2
Quality of the Normal Estimation
In equation (14) it can be seen that if the translation vector Δt is zero, the normal vector nc0 can be anything to satisfy the equation. The homography simply consists of the camera rotation ΔR. Therefore it is very essential for the estimation of the surface normal that a significant translation occurs between the different camera frames.
536
H. Wuest, F. Wientapper, and D. Stricker
Since the translation vector Δt depends on the scale from the scene, it cannot be used directly as a quality measure. To avoid scaling considerations, we use the norm
1
( HL − ΔR)
(23)
λ
as a measure of camera translations which is necessary to generate the homography HL . This norm is used as a measurement quality for a linear Kalman filter to estimate the normal orientation of a patch over time.
6
Prediction of Lost Features
If a feature got lost, because it was occluded or moved out of the image, all the tracking parameters of the feature need to be predicted with the current camera pose so that the tracking of a template patch can be successful. With the reconstructed surface normal nw , the 3D position M w and the parameters of the current camera, the homography of an image patch can be calculated. Since the equation nTc0 M c0 = nTc0 (R0 X p + t0 )
(24)
must be satisfied for any point X p on the plane P , the predicted homography ˜ can be calculated by H ˜ = ΔR + H
ΔtnTc0 . nTc0 (R0 M w + t0 )
(25)
For the homography, which maps a point in the image of the first camera to the second camera, we get ˜ I ∼ K HK ˜ −1 . H (26) To obtain the homography in the second image, i.e. the image of the current ˜ I has to be transformed by the initial homography Hinit . Finally camera pose, H the prediction of the homography in the current camera image can be computed ˜ I Hinit . by H ∗ = H Since the illumination can change significantly if the feature has not been observed for a long time, the illumination parameters also need to be predicted. If μt and μ0 are the mean intensity values of the current and the initial patch, and if σt and σ0 are their standard deviations respectively, then the contrast λ and brightness δ can be predicted by λ=
σt σ0
δ =μt −
(27) σt μ0 . σ0
(28)
For a robust alignment and a stable convergence behavior with the predicted warp only the translation is estimated first, and then the full parameter set is used for the alignment of the template patch.
Acquisition of High Quality Planar Patch Features
7
537
Feature Management
If more features are located inside the current frame, we only want to track a subset of all available features in order to reduce the computational cost. To know which features are worth tracking from a given camera position, a tracking probability for every feature is maintained, which gives information about the ability to track a feature at a certain camera position. This tracking probability is updated after every tracking step with the information if the tracking was successful or if the tracking failed. The probability distribution is approximated with a mixture of Gaussians of constant size. Only a certain number of features is selected which have the highest tracking probability at the current camera position. For features which could not be tracked at a certain position in the previous frames, e.g. if the feature was occluded, the tracking probability is rather low. Therefore it can be avoided to track feature points which are not very likely to be tracked successfully. More details can be found in [5]. With such a feature management every feature in the map knows the probability to be tracked successfully at a certain camera position after a certain amount of training.
8
Experimental Evaluation
The template mask generation is evaluated by analyzing a single feature which does not totally lie on a planar surface. In Figure 1(a), such a feature which is located on the corner of a well textured box, can be seen. The current intensities of the extracted patch, its incremental mean, the variance and the generated mask are shown in the other images of Figure 1. After moving around the camera for a while, the variance of areas, which are not part of the patch feature’s plane, increase significantly and the patch mask clearly represents only the planar part of the template.
(a)
(b)
(c)
(d)
(e)
Fig. 1. Illustration of the mask generation. In (a) the scenario can be seen, (b) shows the currently extracted patch feature, (c) the incremental mean, (d) the variance and (e) the mask of the patch.
538
H. Wuest, F. Wientapper, and D. Stricker
(a)
(b)
Fig. 2. In (a) a frame of the image sequence is shown together with the tracked template patches. In (b) the reconstructed 3D patch and their surface normal direction are drawn. The blue rectangle illustrates the corresponding plane of the feature.
For the evaluation the presented camera tracking system is tested with an image sequence of a desktop scene. The camera tracking is initialized with a reference image and a randomized trees keypoint classification [12]. The tracking is tested with both updating the template and keeping the template as it is when it was captured at its first appearance. With both methods the camera can be traced throughout the whole scene. To evaluate the benefits of the template update, the mean sum of squared distance per pixel after a successful alignment is analyzed. When no update is performed the average SSD of the residual is 73.43 per pixel, with updating the template we measured a smaller average residual of 49.01 per pixel. The updated template is therefore a better representation of the planar textured region. With very strong camera noise the update strategy might be especially beneficial. When updating the template, the tracking success rate is raised from 79.84% to 86.04%. To analyze the quality of the surface normal reconstruction, the 3D patches together with their normals are rendered. A frame of the image sequence and the reconstructed planar patches together with their surface normal directions are shown in Figure 2. When a feature is observed first, the normal vector points towards the camera position, but when the camera is moved around the scene, the estimate of the normal vectors converges towards the true orientation of the surface normal direction of a feature. At some points, where only a poor alignment of a template patch is possible, i.e. on edges or object borders, the reconstructed normal cannot be estimated correctly.
9
Conclusion
We have presented a camera tracking system with the main focus on the acquisition of planar template features. By updating the patch after a successful alignment the average error of image intensities between the template and the current image could be reduced, whereas the average rate of tracking successes
Acquisition of High Quality Planar Patch Features
539
could be raised. The reconstruction of the surface orientation of a patch feature helped to make a more precise prediction of the template in an image. If a lost feature reappears in an image under a different camera viewing direction, it is possible to continue tracking that feature. Future work will consist of separating the acquisition and the tracking of features in different threads. Thereby the tracking could be performed at a higher frame rate, and the computational more costly reconstruction at a lower frame rate.
References 1. Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: Proc. Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2007), Nara, Japan (2007) 2. Bleser, G., Wuest, H., Stricker, D.: Online camera pose estimation in partially known and dynamic scenes. In: ISMAR, pp. 56–65 (2006) 3. Molton, N.D., Davison, A.J., Reid, I.D.: Locally planar patch features for real-time structure from motion. In: Proc. British Machine Vision Conference, BMVC (2004) 4. Favaro, P., Jin, H., Soatto, S.: A semi-direct approach to structure from motion. The Visual Computer 192, 1–18 (2003) 5. Wuest, H., Pagani, A., Stricker, D.: Feature management for efficient camera tracking. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part I. LNCS, vol. 4843, pp. 769–778. Springer, Heidelberg (2007) 6. Shi, J., Tomasi, C.: Good features to track. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 1994), pp. 593–600 (1994) 7. Buenaposada, J.M., Baumela, L.: Real-time tracking and estimation of plane pose. In: ICPR, vol. 2, pp. 697–700 (2002) 8. Jin, H., Favaro, P., Soatto, S.: Real-Time feature tracking and outlier rejection with changes in illumination. In: IEEE Intl. Conf. on Computer Vision, pp. 684– 689 (2001) 9. Zinßer, T., Gr¨ aßl, C., Niemann, H.: Efficient feature tracking for long video sequences. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 326–333. Springer, Heidelberg (2004) 10. Matthews, I., Ishikawa, T., Baker, S.: The template update problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 810–815 (2004) 11. Ma, Y., Soatto, S., Kosecka, J., Sastrys, S.S.: An invitation to 3D vision, from images to models. Springer, Heidelberg (2003) 12. Lepetit, V., Lagger, P., Fua, P.: Randomized trees for real-time keypoint recognition. In: Conference on Computer Vision and Pattern Recognition, San Diego, CA (2005)
Level Set Segmentation of Cellular Images Based on Topological Dependence Weimiao Yu1 , Hwee Kuan Lee1 , Srivats Hariharan2, Wenyu Bu2 , and Sohail Ahmed2 1
Bioinformatics Institute, #07-01, Matrix, 30 Biopolis Street, Singapore 138671 2 Institute of Medical Biology, #06-06, Immunos, 8A Biomedical Grove, Singapore 138648
Abstract. Segmentation of cellular images presents a challenging task for computer vision, especially when the cells of irregular shapes clump together. Level set methods can segment cells with irregular shapes when signal-to-noise ratio is low, however they could not effectively segment cells that are clumping together. We perform topological analysis on the zero level sets to enable effective segmentation of clumped cells. Geometrical shapes and intensities are important information for segmentation of cells. We assimilated them in our approach and hence we are able to gain from the advantages of level sets while circumventing its shortcoming. Validation on a data set of 4916 neural cells shows that our method is 93.3 ± 0.6% accurate.
1
Introduction and Background
Biological science is in the midst of remarkable growth. Accompanying this growth is the transformation of biology from qualitative observations into a quantitative science. This transformation is driving the development of bio-imaging informatics. Computer vision techniques in bio-imaging informatics have already made significant impacts in many studies [1,2]. Cellular microscopy is an important aspect of bio-imaging informatics. It have its unique traits and bring new challenges to the field of computer vision. Advances in digital microscopy and robotic techniques in cell cultures have enabled thousands of cellular images to be captured through High Throughput Screening and High Content Screening. Manual measurement and analysis of those images are subjective, labor intensive and inaccurate. In this paper, we developed an efficient algorithm for the segmentation of cells in highly cluttered environment, which is a ubiquitous problem in the analysis of cellular images. Accurate segmentation of the cellular images is vital to obtain qualitative information on a cell-by-cell basis. Cellular images are usually captured by multichannel fluorescent microscopes, in which one channel detects the nuclei. Since nuclei contain important information, they generally serve as references for cellular image segmentation. During the past 15 years, many efforts have been made on automatic segmentation of nuclei from fluorescent cellular images, such as G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 540–551, 2008. c Springer-Verlag Berlin Heidelberg 2008
Level Set Segmentation of Cellular Images Based on Topological Dependence
541
simple thresholding [3], watershed algorithm [4]∼[5], boundary based segmentation [6], flexible contour model for the segmentation of the overlapping and closely packed nucleus [7]. Other related works on the automatic analysis of cellular images can be found in [8,9]. Deformable models, also known as active contour, are popular and powerful tools for cell segmentation tasks. Among all the active contour models, level set formalism has its superior properties, such as ease of implementation, regionbased, robust to noise and no self-intersection, etc. Two concepts of level set approach were discussed in O. Stanley’s original paper [10]. First, a level set function in a higher dimensional space is defined to represent the regions, which provided us with a non-parameterized model for segmentation. Second, the curves are evolved according to their mean curvature. Thereafter, D. Mumford and J. Shah proposed their functional variation formulation to optimize the segmentation of piecewise smooth images in [12]. Then Chan-Vese enhanced the level set approach for region based image segmentation [13]. A comprehensive review of level set approach for image processing are available in [14] and [15]. One long-claimed merit of level set methods is its ability to automatically handle topological changes. However, this merit becomes a liability in many cellular image segmentations, because non-dividing cells can only contain one nucleus. In a highly clustered image, as shown in Fig. 1, level set segmentation of the cells (green channel) will results in many segments with multiple nuclei. The contours of the snake model and geodesic active contours model can in principle generate one cell segment per nucleus, but they need to be parameterized and the node points may not be uniformly distributed along the length of the contours. Thus they cannot capture the subtle details of irregular cell outlines. In this work, we prefer the level set formulation since it is non-parameterized. We develop a method to enforce the condition that one cell segment contains only one nucleus. Watershed approach was first proposed in [16] and widely applied for cell segmentation. Watershed approach was combined with the level set formulation to segment cellular images and preserve the known topology based on sought seeds in [17]. Similar seeds-based segmentation approach in [18] uses one level set function for each individual corresponding cell to prevent the merging of different cells. Simple point concept is applied to prevent the merging of the cell segments during the evolution of level set function in [19]. However, a well-known problem of watershed approach is over-segmentation. Other methods have been proposed to overcome this problem, such as rule-based merging [21] and markercontrolled watershed correction based on Voronoi diagrams [20]. Generally, the cellular images are acquired by multi-channel fluorescent microscopes and nuclei are captured by one of the channels. Fig. 1 shows a few examples of the cellular images captured by two channel fluorescent microscopy. As shown in Fig. 1, each cell consists of one nucleus and cells of irregular shapes are crowded and touch each other. In this work, we first segment the nuclei, since they are inside of the cell membrane and generally well separated. The found nuclei serve as seeds for cell segmentation. We present a novel cell segmentation approach based on the concept of topological dependence, which will
542
W. Yu et al.
(a)
(b)
(c)
(d)
Fig. 1. Captured cellular images with detected nuclei. Detected seeds are outlined in blue and the geometric centers are marked by red dots.
be introduced shortly. In our approach, the level set curves propagate faster in the regions of brighter image intensities. Hence the dynamics of the level set curves incorporate essential information for cell segmentation. The utilization of such dynamics in our approach is presented for the first time in literature. The watershed lines are evolved dynamically based on the topological dependence at each time step to segment the crowded cells with irregular morphology. The remainder of the paper is structured as follows. Section 2 will provide the definition of topological dependence. The level set formulation for two phase segmentation will be presented in the length of Section 3. The dynamic watershed transformation and the topological dependence preservation will be discussed in Section 4. In Section 5, we will present our experimental results. The conclusion in Section 6 will finalize this paper.
2
Topological Dependence
In this paper, we use the images of two channels to illustrate our approach. Generalization of our approach to the images of more than two channels is trivial as long as we use one channel as reference. The nucleus is stained in blue and cell cytoplasm is stained in green. We define the images on a finite subset in the two dimensional Euclidean space Ω ⊂ R2 . f n (x, y) : Ω → R and f c (x, y) : Ω → R represent the intensities of nucleus and cytoplasm at (x, y) respectively. We call f n (x, y) and f c (x, y) Nucleus Image and Cell Image. The superscripts ‘n’ and ‘c’ represent ‘nucleus’ and ‘cell’. Both of the functions are normalized to [0, 1].
Level Set Segmentation of Cellular Images Based on Topological Dependence
543
The segments of nucleus and cell images form connected regions in Ω. Due to the limitation of space, we give a brief statement to define the connected region: Connected region: A set of points π ⊆ Ω form a connected region if for any two different points (x1 , y1 ) ∈ π and (x2 , y2 ) ∈ π, there exists a path Γ connecting (x1 , y1 ) and (x2 , y2 ) such that Γ ⊆ π. The segmentation of nucleus is relatively easy, since they are better separated. After the segmentation of nuclei, we obtain a set of connected regions, e.g. segments of nuclei, denoted by ωin ,where i = 1, 2, ...L. The topology of the nuclei is then determined. Each cell segment should contain exactly one nucleus segment. In order to describe this constraint in a rigorous mathematical framework, we introduce the concept of topological dependence: Topological dependence: a set of connected regions πi , i = 1, 2, ...L is said to be topologically dependent with another set of connected regions θi , i = 1, 2, ...L if: θi ⊆ πi
i = 1, 2, ...L
(1)
Note that our definition of topological dependence is different from homeomorphism[11]. Topological dependence is more relaxed. Due to the limitation of space, we will not discuss in details here.
3
Level Set Segmentation
Mumford-Shah model of level set formalism is applied to obtain the segmentation of nucleus images and the cell images, which is given by [12]: E(φ, c1 , c2 ) =μ · length{φ = 0} + ν · area{φ ≥ 0} + λ1 |u(x, y) − c1 (φ)|2 dxdy φ≥0 + λ2 |u(x, y) − c2 (φ)|2 dxdy
(2)
φ<0
where u(x, y) is the image intensity. μ, ν, λ1 and λ2 are parameters to regularize the contour length, area, foreground and background respectively. c1 and c2 are constants to be determined through the optimization. They are determined by Ê Ê c1 =
φ≥0 Ê
u(x,y)dxdy
dxdy φ≥0
and c2 =
u(x,y)dxdy . dxdy φ<0
φ<0 Ê
In our work, the length and area
parameters μ and ν are set to zero in order to allow irregular contours and varying sizes of nuclei and cells. In general, we may set the parameters accordingly when a priori knowledge on the length and area are available. The optimal solution of Mumford-Shah model is given by the Euler-Lagrange equation, which is an iterative procedure: φt+Δt = φt + Δt · δε [−λ1 (u(x, y) − c1 (φt ))2 + λ2 (u(x, y) − c2 (φt ))2 ]
(3)
544
W. Yu et al.
t is the artificial time used for the evolution of the level set function and Δt is the time step. δ is a regularized delta function defined in [13]. The selection of the parameters is important to achieve a good segmentation. We will discuss the parameter selection on λ1 , λ2 and Δt in more details in Section 5. In order to segment the nuclei, we initialize the level set function for nucleus image as: n f (x, y)dxdy n,t=0 n φ (x, y) = f (x, y) − Ω (4) Ω dxdy Substitute u(x, y) by Nucleus Image f n (x, y) and then evolve the level set function φn,t using Eq.(3). After the iterations converged, the set of points {(x, y) ∈ Ω|φn,t (x, y) ≥ 0} form L connected regions that define the nucleus segments ωin , where i = 1, 2, · · · L. L is the number of detected nuclei. The pixels belonging to ωin are labeled by the integer i. All remaining pixels are labeled by 0 to represent the background. The detected nuclei will serve as seeds for the cell segmentation. After the nuclei are segmented, we need to include the information of the nuclei into the cell segmentation. The level set function for the cell image segmentation is defined based on ωin : φc,t=0 = fˆc (x, y) − 1 where,
fˆc (x, y) =
1 f c (x, y)
if (x, y) ∈ otherwise
(5) L i
ωin
(6)
Since f c ∈ [0, 1], fˆc (x, y) ∈ [0, 1]. Substitute u(x, y) by fˆc (x, y), then the level set function for the cell segmentation φc,t is evolved according to Eq.(3). Unlike the works in [18] where each individual cell has one corresponding level set function, we use only one level set function to segment all cells in order to achieve better computational efficiency. In order to utilize the image intensity variation for cell segmentation, we initialize the level set function for cell segmentation according to Eq. (5) instead of traditional distance function. In addition, such initialization also ensures that the zero level sets start from the nuclei and evolve outwards with a speed related with the image intensity, e.g. brighter regions will be segmented as foreground earlier.
4
Preservation of Topological Dependence
The evolution of the level set function alone cannot ensure topological dependence between the cell segments and the nucleus segments. Dynamic watershed lines is applied to preserve such topological dependence. Let’s define the segments of cells at time t as ωic,t . At t = 0, ωic,t=0 forms L connected regions. According to the definition of φc,t=0 , we know that ωic,t=0 = ωin , i = 1, 2, · · · L.
Level Set Segmentation of Cellular Images Based on Topological Dependence
545
This indicates ωic,t=0 is topologically dependent with ωin . Under the condition that ωic,t is topologically dependent with ωin at some time t, we may calculate the watershed lines W t by: W t ={(x, y) ∈ Ω : dmin [(x, y)|ωic,t ] = dmin [(x, y)|ωjc,t ], for some i = j, where i, j = 1, 2, ...L, } where: dmin [(x, y)|ω c,ti ] =
min
(x ,y )∈ωic,t
(x − x )2 + (y − y )2
(7)
(8)
The obtained watershed line at time t will be used to preserve the topological dependence between ωic,t and ωin at time t + Δt. Preserving topological dependence and recovering the correct segmentation consist of a series of re-labeling steps. Firstly, the connected regions that do not contain any nucleus segment are removed at each iteration, as shown by the gray region in Fig. 2(b). If the remaining connected regions is topologically dependent with ωin , then these regions will take the labels of ωin and we denoted them as the cell segments ωic,t+Δt . If the topological dependence is violated, relabel the connected regions as “unknown” and the background as “0”. Then, obtain the intersection of the
(a)
(b)
(c)
(d)
Fig. 2. Illustration of dynamic watershed lines and preservation of the topological dependence. Nucleus segments are black. Labels of different regions at different time are indicated by random colors. The dotted line represents the watershed line W t . The level set function evolves from t in (a) to t + Δt in (b). Thereafter re-labeling are carried out in (c) and (d) to eliminate the residual regions and preserve the topological dependence.
546
W. Yu et al.
connected regions and W t , which forms a set of common boundaries {st+Δt , st+Δt , · · · }. Two of such common boundaries are illustrated by st+Δt 1 2 1 t+Δt and s2 in Fig. 2(b). Consider regions separated by the common boundaries as different connected regions, then we obtain a new set of connected regions βkt+Δt , k = 1, 2, ...K. (Note that K ≥ L). If ωpn ⊆ βqt+Δt , for some p and q, we label βqt+Δt with the labels of ωpn . Not all βkt+Δt can be re-labeled according to the above condition. Unlabeled regions are known as residual regions. An iterative procedure is given to find the correct labels of residual regions: Re-labeling of the residual regions: Any residual region must be created by some common boundaries. One side of those common boundaries must be adjacent to this given residual region and the other side is adjacent to some other region that may or may not be successfully re-labeled previously. A given unlabeled residual region will take the label of the adjacent region that shares the longest common boundary, which is denoted by st+Δt l,max . If all regions adjacent to this given residual region are unknown, then this residual region cannot be re-labeled in the current iteration. Iterate this procedure until all unknown residual regions are re-labeled. We illustrate the preservation of topological dependence in Fig. 2, in which the nucleus segments are indicated in black and the cell segments in red, blue and green. The dotted line represents the watershed line. Fig. 2(a) shows three cell segments that are topologically dependent with the nucleus segments at time t. Fig. 2(b) shows that when the level set function evolves to time t + Δt, the connected regions of the zero level set is no longer topologically dependent with the nucleus segments. The gray region that does not contain any nucleus is removed. In Fig. 2(c), the remaining connected regions are separated using the watershed line W t calculated at the previous time step t. This produces seven connected regions β1t+Δt , β2t+Δt , · · · β7t+Δt . β1t+Δt , β2t+Δt and β3t+Δt contain nucleus segments and are re-labeled according to the corresponding nucleus segments ωin . β4t+Δt , β5t+Δt , β6t+Δt and β7t+Δt are re-labeled using the procedure described in Re-labeling of the residual regions. Note that β4t+Δt is relabeled with the same integer as β1t+Δt because the common boundary st+Δt is 1 longer than st+Δt . After the topological dependence is preserved, the watershed 2 lines are updated according to the new cell segments ω1t+Δt , ω2t+Δt and ω3t+Δt based on Eq. (7).
5
Experimental Results
We applied our segmentation approach to a neural cell study. In this study, we want to automatically and quantitatively measure the length of the neurite. Accurate segmentation of the neural cell is a prerequisite to measure the length of neurites and extract quantitative information on a cell-by-cell basis. As shown in Fig.1, neurites are thin long structures that growth radially outwards from the cells. More than 6000 images are acquired from fixed neural cells with DAPI
Level Set Segmentation of Cellular Images Based on Topological Dependence
547
stain for nucleus and FITC stain for cell cytoplasm. Zeiss Axiovert 200M widefield fluorescent microscope of two channels with Motorized XY Stage is applied to capture the cellular images. The original images are captured at 20X magnification with 1366X1020 pixels of 12 bits accuracy. The camera is CoolSnap CCD Camera and the resolution is 0.31 μm/pixel. It is important to select the proper parameters for cell segmentation, such as Δt in Eq.(3), λ1 and λ2 in Eq.(2). We choose a big time step Δt = 10 to compromise accuracy and computation. To verify that using Δt = 10 does not introduce significant numerical errors, we performed our segmentations on eight randomly selected images with three different time steps Δt=1, 5 and 10. We use Adjusted Rand Index [24] to compare the segmentations of Δt=1 and Δt=5 with the segmentation of Δt=10. Adjusted Rand Index is 0.9944± 0.0025 for Δt=1 vs. Δt=10 and 0.9952±0.0024 for Δt=5 vs. Δt=10. Results show that using a big time step does not introduce significant numerical errors. Regarding the regularization parameters, we set λ1 = 1 and λ2 = 50 for the cell segmentation such that we may preserve the continuity of weakly connected neurites. We choose the image in Fig. 1(a) and show ωic,t at different time t in Fig. 3. ωin are illustrated by the black regions and their geometrical centers indicated by red dots. Different segments of the cells are shown by random colors. The watershed line is illustrated by black solid lines. The segments of cells ωic,t start from the nucleus segments ωin at t = 0 and evolve outwards with a speed related with the variation of image intensity. The watershed lines W t also evolve with time t dynamically based on the constraint of topological dependence. The final segmentation results of the cellular images in Fig. 1 are shown in Fig. 4. Although the cells are irregular and clumpy, our approach can successfully separate them. 89% of 6000 cellular images can be segmented by our approach within 1 minute on a desktop with 2.0GHz CPU and 1Gb RAM. In order to testify and validate our approach, we compared our approach with CellProfiler [25] and MetaMorph. CellProfiler is one of the popular cellular image analysis freeware developed by the Broad Institute of Harvard and MIT. MetaMorph is a commercial software specially developed for cellular image analysis by MDS Inc. The parameters for CellProfiler are suggested by the software developers and the parameters for Metamorph are tuned by a service engineer from MDS. 100 images containing a total of 4916 cells were randomly selected from our database. They are segmented by CellProfiler, MetaMorph and our algorithm to generate 300 segmented images. These segmentations were then divided into 15 sets of 20 images each. They were randomly shuffled. Two reviewers grade these segmented images without knowing which algorithm was applied. They marked how many cells are segmented incorrectly. After the blind evaluation, we count the number of incorrectly segmented cells for each approach. The results are shown in Table 1. Our approach achieved the best performance, which is about 2.5% better than CellProfiler and much higher than MetaMorph. CellProfiler tends to oversegment the cells when the shapes of the nuclei and cells are irregular. MetaMorph seems could not detect fine structures of the neurites.
548
W. Yu et al.
(a) t = 0
(c) t = 180
(b) t = 120
(d) t = 200
Fig. 3. Dynamical evolution of segments and watershed lines at different time t. Nuclei segments are shown in black and outlined by blue. Their geometric centers are illustrated by red dots. Watershed line are shown by the black line. Different cell segments are indicated by random colors.
(a)
(b)
(c)
(d)
Fig. 4. Segmentation results of our approach. The morphology of the cells in the captured image is complicated. They are clumpy and touched with each other. Our approach can successfully segment them.
Level Set Segmentation of Cellular Images Based on Topological Dependence
549
Table 1. Comparison of Segmentation Results
6
Approach
Accuracy
MetaMorph CellProfiler Our Approach
74.16%±1.02% 90.85% ±0.56% 93.25% ±0.57%
Conclusion
The cell segmentation is non-trivial. It still remains a challenging problem in many bio-imaging informatics applications. Many segmentation algorithms could not properly segment cells that are clumpy and touch each other, especially when the intensity contrast at the boundaries is low and their geometrical shapes are irregular. We proposed a novel segmentation approach for the cellular images captured by two-channel microscope. The proposed approach combines the advantages of level set and watershed method in a novel way based on the concept of topological dependence. Utilization of the dynamics of level set curves in our method is presented for the first time in the literature. Another novelty of our method is that the watershed lines evolve dynamically at each time step t based on the topological dependence, which is essential to prevent merging of cell segments. This constraint also solved the over-segmentation problem of watershed approach. We applied our approach on more 6000 cellular images of neural cells. According to the validation of 100 randomly selected images including 4916 cells, our segmentation method achieved better performance than CellProfiler and MetaMorph. We use only one level set function to segment all the cells in an image, hence our algorithm is more efficient than the work in [18] where each cell is associated with an individual level set function1 . Segmentation of the cells from the cellular images captured by multi-channel microscope is a common and important problem in many bio-imaging applications. Our approach is developed based on the assumption that the images are captured by two-channel microscope, however, it can be easily generalized and applied to cellular images of multi-channel if the nuclei are captured by one of the channels. Although our approach is not suitable to segment overlapped cells, many biological assays seed and resuspend cells into monolayer such that the cells do not overlap with each other. However, overlapped cells might happen in some other applications and the problem itself is interesting and worth further investigating.
Acknowledgement The authors would like to appreciate Dr. Anne Carpenter’s help to provide the parameters for CellProfiler. 1
Video demonstrations and Matlab source codes are available at http://web.bii.astar.edu.sg/∼yuwm/ISVC2008/.
550
W. Yu et al.
References 1. Choi, W.W.L., Lewis, M.M., Lawson, D., et al.: Angiogenic and lymphangiogenic microvessel density in breast carcinoma: correlation with clinicopathologic parameters and VEGF-family gene expression. Nature, Modern Pathology 18, 143–152 (2005) 2. Bakal, C., Aach, J., Church, C., Perrimon, N.: Quantitative Morphological Signatures Define Local Signaling Networks Regulating Cell Morphology. Science 316, 1753–1756 (2007) 3. Lerner, B., Clocksin, W., Dhanjal, S., Hulten, M., Christopher, M.B.: Automatic signal automatic signal classification in fluorescence in-situ hybridisation images. Bioimaging 43, 87–93 (2001) 4. Lockett, S., Herman, B.: Automatic detection of clustered fluorescence-stained nuclei by digital image-based cytometry. Cytometry Part A 17, 1–12 (1994) 5. Malpica, N., De Solorzano, C.O., Vaquero, J.J., Santos, A., Vallcorba, I., GarciaSagredo, J.M., Del Pozo, F.: Applying watershed algorithm to the segmentation of clustered nuclei. Cytometry Part A 28, 289–297 (1997) 6. De Solorzano, C., Malladi, R., Lelievre, S., Lockett, S.: Segmentation of nuclei and cells using membrane related protein markers. Journal of Microscopy 201, 404–415 (2001) 7. Clocksin, W.: Automatic segmentation of overlapping nuclei with high background variaion using robust estimation and flexible contour model. In: Proceedings of 12th International Conference on Image Analysis and Processing, vol. 17-19, pp. 682–687 (2003) 8. De Solorzano, C., Santos, A., Vallcorba, I., Garciasagredo, J., Del Pozo, F.: Automated fish spot counting in interphase nuclei: Statistical validation and data correction. Cytometry Part A 31, 93–99 (1998) 9. Clocksin, W., Lerner, B.: Automatic analysis of fluorescence in-situ hybridisation images. In: Electronic Proceedings of the Eleventh British Machine Vision Conference (2000) 10. Stanley, O., Sethian, J.A.: Fronts propagating with curvature-dependent speed: Algorithim based on hamilton-jacobi formulation. Journal of Computational Physics 79 (1988) 11. Engelking, R.: General topology. Heldermann (1988) 12. Mumford, D., Shah, J.: Optimal approximations by piecewise smooth functions and associated variational problems. Community of Pure Applied Mathematics 42, 577–685 (1989) 13. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Transactions on Image Processing 10, 266–277 (2001) 14. Tai, X.C., Lie, K.A., Chan, T.F., Stanley, O.: Image Processing Based on Partial Differential Equations. Springer, Heidelberg (2006) 15. Sethian, J.: Level Set Methods and Fast Machining Methods: Evolving Interface in Computational Geometry, Fluid Mechanics, Computer Vision and Material Science. Cambridge University Press, Cambridge (1999) 16. Blum, H.: A transformation for extracting new descriptor of shape. In: WathenDunn, W. (ed.) Models for the Perception of Speech and Visual Form. MIT Press, Cambridge (1967) 17. Tai, X.C., Hodneland, E., Weickert, J., Bukoreshtliev, N.V., Lundervold, A., Ger, H.: Level Set Methods For Watershed Image Segmentation. In: Sgallari, F., Murli, A., Paragios, N. (eds.) SSVM 2007. LNCS, vol. 4485, pp. 178–190. Springer, Heidelberg (2007)
Level Set Segmentation of Cellular Images Based on Topological Dependence
551
18. Yan, P., Zhou, X., Shah, M., Wang, S.T.C.: Automatic Segmentation Of HighThroughput RNAi Fluorescent Cellular Images. IEEE Transaction on Information Technology in Biomedicine 12(1), 109–117 (2008) 19. Han, X., Xu, C., Prince, J.L.: A Topology Preserving Deformable Model Using Level sets. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 765–770 (2001) 20. Zhou, X., Liu, K.Y., Bradblad, P., Perrimon, N., Wang, S.T.C.: Towards Automated Cellular Image Segmentation For RNAi Genome-wide Screening. In: Duncan, J., Gerig, G. (eds.) Proc. Med. Image Comput. Comput.-Assisted Intervention, pp. 885–892. Palm Springs, CA (2005) 21. Whlby, C., Lindblad, J., Vondrus, M., Bengtsson, E., Bjorkesten, L.: Algorithm for Cytoplasm Segmentation Of Fluorescent Labelled Cells. Analytical Cellular Pathology 24(2-3), 101–111 (2002) 22. Xiong, G., Zhou, X., Degterev, A., Ji, L., Wong, S.T.C.: Automated neurite labeling and analysis in fluorescence microscopy images. Cytometry Part A 69A, 495–505 (2006) 23. Bixby, J.L., Pratt, R.S., Lilien, J., Reichardt, L.F.: Neurite outgrowth on muscle cell surfaces involves extracellular matrix receptors as well as ca2+dependent and independent cell adhesion molecules. In: Proceedings of the National Academy of Sciences, vol. 84, pp. 2555–2559 (1987) 24. Hubert, L., Arabie, P.: Comparing Partitions. Journal of the Classification 2, 193– 218 (1985) 25. Carpenter, E.A., Jones, R.T., Lamprecht, R.M., Clarke, C., Kang, H.I., Friman, O., et al.: Cellprofiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biology 7 (2006)
A Novel Algorithm for Automatic Brain Structure Segmentation from MRI Qing He, Kevin Karsch, and Ye Duan Department of Computer Science, University of Missouri-Columbia, Columbia, MO, USA, 65211 {qhgb2,krkq35}@mizzou.edu,
[email protected]
Abstract. This paper proposes an automatic segmentation algorithm that combines clustering and deformable models. First, a k-means clustering is performed based on the image intensity. A hierarchical recognition scheme is then used to recognize the structure to be segmented, and an initial seed is constructed from the recognized region. The seed is then evolved under certain deformable model mechanism. The automatic recognition is based on fuzzy logic techniques. We apply our algorithm for the segmentation of the corpus callosum and the thalamus from brain MRI images. Depending on the specific features of the segmented structures, the most suitable recognition schemes and deformable models are employed. The whole procedure is automatic and the results show that this framework is fast and robust. Keywords: Segmentation, deformable models, clustering, corpus callosum, thalamus.
1 Introduction Medical image segmentation has become crucial to the practice of medicine, but accurate, fully automatic segmentation of medical images continues to be an open problem. Two types of segmentation algorithms have appeared in previous research. One is region-based approach such as clustering, which assigns membership to pixels according to homogeneity statistics. Since there is no easier way to distinguish boundaries and interior pixels of the object, this method can lead to noisy boundaries and holes in the interior. The other class of methods is boundary-based technique such as snakes [1]. It attempts to align an initial deformable boundary with the object boundary by minimizing an energy functional which quantifies the gradient features near the boundary. The main drawback of these methods is their sensitivity to the initial conditions. To avoid being trapped in local minima, most of these algorithms require the model to be initialized near the solution or supervised by high-level guidance, which makes the automation of these methods difficult. Many model based segmentation methods have attempted to make use of substantial prior knowledge of the anatomical structures, such as shape, orientation, symmetry, and relationship to neighboring structures. Several examples can be found in [2-4]. Most of these methods however requires significantly user interaction to incorporate prior knowledge into deformable models. Recently McInerney et al. [5] G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 552–561, 2008. © Springer-Verlag Berlin Heidelberg 2008
A Novel Algorithm for Automatic Brain Structure Segmentation from MRI
553
proposed a impressive deformable organism for medical image analysis that combined deformable models and the concepts in artificial life. These organisms are endowed with intelligent sensors and aware of their behaviors during the segmentation process. A subgroup of boundary-based methods is based on active shape models [6] and active appearance models [7], which incorporate prior information to constrain shape and image appearance by learning their statistical variations. These methods are automatic, but as pointed out in [5], the effectiveness of model constraints is still dependent on appropriate initialization, and the models may latch onto nearby spurious features. Some examples can be found in [8, 9]. Several approaches [10-15] have integrated region-based and boundary-based techniques into one framework which can offer greater robustness than each technique alone. Among them, Jones and Metaxas [14] proposed a hybrid approach based on fuzzy affinity clustering. Object boundaries are estimated based on fuzzy affinity metric of multiple image features. These estimations are recursively used to guide the deformable models and updated by the new model fit. They later [15] extended the work in [14] by an automatic initialization for multiple objects. In this paper, we propose an automatic algorithm for segmenting brain structures such as the corpus callosum (CC) and the thalamus from MRI based on the integration of deformable models with region-based clustering. Our algorithm is inspired by [14,15] and follows the same spirit. However the detailed implementation is significantly different. For example, our clustering and deformable models are two sequential procedures, while they perform region-based estimation and deformable models recursively. Although we all used concepts from fuzzy logic, we use it for the purpose of recognition while they used it for affinity clustering. Our seed initialization and deformable models are also different from [14, 15]. Our framework can be summarized as follows. 1. 2. 3.
A k-means clustering is performed based on the image intensity. A hierarchical recognition scheme is used to recognize the structure to be segmented, and an initial seed is constructed from the recognized region. The seed is evolved under certain deformable model mechanism.
2 Integration of Deformable Models and Clustering We perform a k-means clustering using the following distance metric:
D =|| I ( x1 , y1 ) − I ( x 2 , y 2 ) ||
(1)
where I is the image intensity function. Pixel locations can also be used in the distance metric together with intensity, but intensity alone seems good enough for our purpose. Other clustering methods can also be used and we choose k-means because it is simple. We use eight clusters for all the CC images and two clusters for thalamus images because it fits our recognition algorithms as well as the object’s grayscale. A series of recognition schemes are then performed to recognize the interested objects from the clustering results. This results in an initial seed region whose boundary is very close to the object to be segmented. Fig. 1 and Fig. 2 show the segmentation procedures of the CC and the thalamus. More details will be discussed later.
554
Q. He, K. Karsch, and Y. Duan
(a)
(b)
(c)
(d)
Fig. 1. (a) part of an original image (b) recognition result (c) initial seed (d) final contour
(a)
(b)
(c)
(d)
Fig. 2. (a) ~ (d) are defined the same as in Fig. 1
Different deformable models are used for CC and thalamus. The boundary of the CC on the sagittal MR image is well defined by the gradient, but the existence of the fornix often makes a single active contour fail because it is almost the same brightness as the CC. We develop a fornix detection scheme in the next section which requires explicit representation of the boundary contour, thus the snake is used for deformation. The explicit contour C(p,t) evolves according to the following partial differential equation [24]:
r ∂ C ( p , t ) / ∂ t = F n , C ( p ,0 ) = C 0 ( p )
(2)
r
where n is the unit normal vector of C(p,t), F is the speed function, and t can be considered as the time parameter. The speed function is defined as:
r F = (v + εk ) g − γ (∇g • n )
(3)
where k is the curvature, v is a constant, ε , γ are coefficients, and g is a decreasing function of the image gradient. Since the seed contour is very close to the boundary and free of fornix, a simple snake can guide it to the boundary very well. The parameters in (3) need to be carefully selected in order to obtain the best performance of snakes. However, we find that one set of parameters can work for all MRI data acquired under the same condition. In our experiment, v =2, ε =0.2 and γ =0.1. On the contrary, the thalamus on the axial view usually has a simple oval shape, but the boundary is often blurred. We employ a simplified version of the level set equation in [23], which is a variation of the Mumford-Shah functional.
∂φ / ∂t = μ∇(∇φ / | ∇φ |) − λ1 (u 0 − c1 ) 2 + λ 2 (u 0 − c 2 ) 2
φ is the level set function, ∇(∇φ / | ∇φ |)
(4)
u0 is the original image, c1 and c2 are the mean intensities of the two regions, and μ , λ1 , λ 2 are is the mean curvature,
A Novel Algorithm for Automatic Brain Structure Segmentation from MRI
555
parameters. This formula allows the level set to converge using the pixel intensity inside and outside of the level set instead of gradient, which is well suited for the case of the thalamus boundary. Similar to the snakes, the local minima of level set can also be avoided because of the close-to-boundary initialization, and one set of parameter values can serve well for data with the same type. We set μ = 0.5, λ1 = 1, λ2 = 1 in our experiment.
3 Automatic Model Initialization In order to find the initial region for the deformable models from the clustering results, we design a hierarchical recognition scheme which searches the interested object from coarse level to fine level, and then an initial seed for the deformable model is generated from the recognition results. Since CC and thalamus are quite different in intensity and shape features, direct and indirect schemes are designed to recognize CC and thalamus respectively, but they both follow the same pipeline. 3.1 Coarse-Level Recognition In the coarse-level recognition, the goal is to find an interested cluster from all clusters generated by k-means. Since the intensity of the CC is higher than most other parts of the image, we can leave out the clusters with relatively low intensity. In order to take into account the image noise which can have higher intensity than the CC, we first select the top three clusters with highest intensities, and then select the one with the largest area (number of pixels) as the CC cluster, because we observe that the noise regions are usually very small compared with the CC. Fig.3(a) shows the selected cluster which contains the CC. Since the intensity of the thalamus is not distinguished from other parts of the image, it’s difficult to directly recognize the cluster that contains it. However, we notice that the ventricle (Fig.2(a)) is always on top of each thalamus, which has the lowest intensity in the image. We come up with an indirect recognition scheme which infers the position of each thalamus from each ventricle. Therefore, only two clusters are needed in the k-means clustering since the intensity of the ventricles is very low, and there are no other dark regions connected to the ventricles. The cluster recognition is thus reduced to finding the cluster with the lower intensity. 3.2 Fine-Level Recognition The cluster in Fig. 3(a) also includes other parts of the image which need to be removed. We first split the cluster into numerous sub-clusters by connected components algorithm [16], so that each sub-cluster is a connected component. The goal of this level is to find a sub-cluster which contains the structure to be segmented. Fuzzy information granulation [17] is applied here, which is a powerful tool for multi-criteria decision making, such as in [18]. Granulation can partition an object into a collection of granules. The granules are associated with fuzzy attributes, and the attributes have fuzzy values. For each attribute, we define a membership function [19–21] to calculate its fuzzy values. The membership function μ A ( x ) lies in the range [0,1] indicating the degree to which x belongs to the fuzzy set A. In [17], the intersection
556
Q. He, K. Karsch, and Y. Duan
( ∧ ) of two fuzzy sets is defined as the minimum value of the two sets. The goal is to find a granule with the highest intersection of the fuzzy values of all the attributes. CC Recognition: In the recognition of the CC sub-cluster, the granules are the subclusters, and we measure three fuzzy attributes for each granule: area, length, width. The length and width are measured on the oriented bounding rectangle of each subcluster. These three attributes can well characterize the shape distinction of the CC. Fig. 3(a) shows the bounding rectangles of two sub-clusters; one contains the CC and the other is a noise region. A trapezoid membership function is constructed for each attribute in the training phase, with the parameters being the mean and standard deviation of each attribute. The recognized CC sub-cluster is shown in Fig. 1(b). Ventricle Recognition: We notice that the two ventricles are almost symmetric, so we recognize them together based on the symmetry. The granules are all pairs of subclusters, and we construct five attributes for each granule based on five measurements of each single region. Besides the three measurements used above, the orientation and the center of the bounding rectangle are also measured (Fig. 3(b)), because the two ventricles almost have the same angle with the vertical line and their centers lie on the same horizontal line. Except the center of the bounding rectangle, we define each of the other four attributes as the ratio of the corresponding measurements of the two regions. Ratios greater than 1 are inversed so that they lie in [0,1]. The difference of the two centers in the vertical direction is normalized by the average length of the two regions, and subtracting the normalized difference from 1 is the last attribute. Any negative values are chopped to 0 so that they are also within [0,1]. A linear membership function passing the origin with slope 1 is used for all attributes since they are already in [0,1]. The intersection of the five fuzzy sets is defined as a weighted mean instead of the minimum value, because the similarities between the two ventricles are not equal in the five measurements. There can be larger difference in their shape and size than in their orientation and altitude, so we set higher weights for the angle and the center, and lower weights for the other three. The recognized ventricle sub-cluster is shown in Fig. 2(b).
(a)
(b)
(c)
Fig. 3. (a) the CC cluster (black) and the oriented bounding rectangles of the CC and a noise region (b) the ventricle cluster (black) and the additional measurements of one ventricle (c) three distinct points on the fornix (A,B,C)
A Novel Algorithm for Automatic Brain Structure Segmentation from MRI
557
3.3 Seed Initialization This section describes how to construct an initial seed from the recognized sub-cluster for the initialization of the deformable model. CC Initialization: Since explicit active contour is to be used for CC segmentation, a boundary contour is extracted from the recognized CC sub-cluster using 2D marching cube [25], and a Laplacian smoothing is performed on the contour. Before contour extraction, we first dilate the region a certain number of steps based on the image size and then erode it the same number of steps using standard mathematical morphology in order to fill the holes inside the region. The extracted CC contour may contain the fornix, because in most cases the fornix is grouped into the same connected component as the CC. Inspired by [4], we detect the two points connecting the fornix and the CC (A, B in Fig. 3(c)) and then connect them to remove the fornix, but our detection scheme is different from [4]. Fuzzy information granulation is again used for this task, and the granules are the points on the contour. Similar to the boundary landmark detection in [22], we measure x, y coordinates and curvature of each point, which constitute three fuzzy attributes. The coordinates are translated to the coordinate system based on the bounding rectangle (Fig. 3(c)), and normalized by the length and width of the rectangle respectively. There may be many candidates with full membership of A, because the curvature of A is not as distinct as B. We design an inference scheme which uses the point correlation to infer the next point based on the detected points. B and C are first detected separately based on the three fuzzy attributes. Point A is then inferred after we find B and C. Two additional attributes are used besides the above three attributes. One is the arc length between A and C, and the other is that A is always to the left of B, which is represented by a crisp set. The intersection of the five fuzzy sets is used as the score to select point A. In case there is no fornix connected to the CC, the score of every point will be very low. We set a minimum threshold (0.2 in our experiment) for the score and if all scores are less than the threshold, we believe there is no fornix. The final seed contour (Fig.1(c)) is constructed by connecting A and B. We record the positions of A and B before contour deformation. Any new points generated during the deformation which are in between the two positions are set inactive, so that the contour will not be attracted to the fornix again. Thalamus Initialization: A point inside the thalamus is recognized indirectly from the ventricle. We trace down a certain distance from the center of each ventricle to locate a seed point inside each thalamus. The distance can be equal to the length or the height of the ventricle, and either way can usually make the point inside the thalamus. We perform k-means again with eight clusters, and the cluster that the seed point is in is the thalamus cluster. Because the intensity of the thalamus is not distinct from the intensities of its neighboring structures, the resulting cluster may leak out of the thalamus boundary. To overcome this, we shrink the cluster using a sequence of morphology iterations interlaced with connected components. The root node for the connected components algorithm at each iteration is the seed point. Thus, during erosion, the cluster converges to the seed point while removing parts of the cluster that are not part of the thalamus. In our implementation, the number of morphology iterations is based on the size of the image. To complete the seed, we dilate the cluster
558
Q. He, K. Karsch, and Y. Duan
the same number of steps to grow the seed back to a larger percentage of its original size (Fig. 2(c)).
4 Results and Discussion We perform our algorithms on different sagittal MR images for CC segmentation and axial images for thalamus segmentation. Fig. 4 shows some final results. The image intensity varies a lot among these images, but it does not affect the results since our algorithm is independent of the image brightness. We use the measurements in [4] to quantitatively evaluate our segmentation results. We denote the correct segmentation result as Ctrue , our segmentation result as Cseg , and | • | as the area enclosed within the result. The following measurements are calculated. 1) False negative fraction (FNF), which indicates the fraction of structure included in the true segmentation but missed by our method:
FNF =| Ctrue − C seg | / | Ctrue | 2) False positive fraction (FPF), which indicates the amount of structure falsely identified by our method as a fraction of true segmentation:
FPF =| C seg − C true | / | C true |
3) True positive fraction (TPF), which indicates the fraction of the total amount of structure in the true segmentation that is overlapped with our method:
TPF =| C seg ∩ C true | / | C true | 4) Dice similarity:
Dice = 2× | C seg ∩ C true | /(| C true | + | C seg |) 5) Overlap coefficient:
overlap =| C seg ∩ C true | / | C true ∪ C seg | The last two measurements range from 0 to 1 with 1 indicating a perfect agreement between Ctrue and Cseg . We use the results of manual segmentation by a trained expert as the ground truth ( Ctrue ), and compare our results ( Cseg ) with the ground truth. The experiment is performed on 16 CC images and 5 thalamus images (10 thalami), and the mean and standard deviation of each measurement on the CC and the thalamus are listed in Table 1. The results on CC are comparable with the results in [4], but our algorithm does not require careful user initialization as in [4]. Overall, this algorithm has a better performance on CC than on thalamus because the CC boundary is more distinguishable, but the both results can show high accuracy of this algorithm. The algorithm may fail in some rare cases of CC images, where the fornix appears very unusual. Fig.5 shows one example where the fornix is unusually connected to the posterior most. Since the location of the fornix falls out of the normal range, our algorithm can hardly find a correct fornix. In this case the user has to click the points A and B (Fig.3(c)) to locate the fornix.
A Novel Algorithm for Automatic Brain Structure Segmentation from MRI
559
Table 1. Quantitative validation results CC FNF FPF Dice TPF overlap
Mean 0.08 0.01 0.92 0.90 0.84
Std 0.06 0.09 0.05 0.06 0.08
Thalamus Mean Std 0.10 0.09 0.05 0.11 0.90 0.07 0.89 0.09 0.77 0.10
Fig. 4. Final results on different images (top: CC; bottom: thalamus)
Fig. 5. The case when the algorithm fails to detect the fornix
5 Conclusion We have described an automatic segmentation algorithm which combines deformable models and k-means clustering. By using a hierarchical recognition scheme, we are able to produce a seed that is more adapted to our target shape. This close-to-target seed can greatly improve the speed and robustness of deformable models, while the clustering and recognition procedures add only 2-3 seconds on modern CPUs. Results are shown by the examples of the corpus callosum and the thalamus, and the quantitative validation shows the accuracy of our algorithm. The only failure case is the unusual appearance of the fornix, which is very rare. In the future, we plan to incorporate more object features in our recognition scheme, in order to make it more robust. Furthermore, we will extend this framework to dealing with 3D segmentation. Acknowledgments. This work is supported in part by a NIH pre-doctoral training grant for Clinical Biodetectives, Thompson Center Research Scholar fund, Department of Defense Autism Concept Award, NARSAD Foundation Young Investigator Award.
560
Q. He, K. Karsch, and Y. Duan
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988) 2. Liang, J., McInerney, T., Terzopoulos, D.: United snakes. Medical Image Analysis 10, 215–233 (2006) 3. McInerney, T., Sharif, M.R.: Sketch initialized snakes for rapid, accurate and repeatable interactive medical image segmentation. In: 3rd IEEE International Symposium on Biomedical Imaging: Nano to Macro, pp. 398–401 (2006) 4. He, Q., Duan, Y., Miles, J., Takahashi, N.: A Context-Sensitive Active Contour for 2D Corpus Callosum Segmentation. International Journal of Biomedical Imaging 2007, Article ID 24826, 8 pages (2007) 5. McInerney, T., Hamarneh, G., Shenton, M., Terzopoulos, D.: Deformable organisms for automatic medical image analysis. Medical Image Analysis 6, 251–266 (2002) 6. Cootes, T., Cooper, D., Taylor, C., Graham, J.: Active shape models—their training and application. Computer Vision and Image Understanding 61(1), 38–59 (1995) 7. Cootes, T., Beeston, C., Edwards, G., Taylor, C.: A Unified Framework for Atlas Matching Using Active Appearance Models. In: Proc. Image Processing in Medical Imaging Conf., Visegrad, Hungary, pp. 322–333 (1999) 8. Taron, M., Paragios, N., Jolly, M.P.: Uncertainty-driven non-parametric knowledge-based segmentation: the corpus callosum case. In: VLSM, pp. 198–207 (2005) 9. Stegmannac, M.B., Daviesb, R.H., Ryberg, C.: Corpus callosum analysis using MDLbased sequential models of shape and appearance. In: Proceedings of SPIE, vol. 5370, pp. 612–619 (2004) 10. Chakraborty, A., Duncan, J.S.: Integration of boundary finding and region-based segmentation using game theory. In: Bizais, Y., et al. (eds.) Information Processing in Medical Imaging, pp. 189–201. Kluwer, Dordrecht (1995) 11. Chakraborty, A., Worring, M., Duncan, J.S.: On multifeature integration for deformable boundary finding. In: Proc. Intl. Conf. on Computer Vision, pp. 846–851 (1995) 12. Ronfard, R.: Region-based strategies for active contour models. Intl. J. of Computer Vision 13(2), 229–251 (1994) 13. Zhu, S.C., Lee, T.S., Yuille, A.L.: Region competition: Unifying snakes, region growing, and bayes/mdl for multiband image segmentation. In: Proc. Intl. Conf. on Compute Vision, pp. 416–423 (1995) 14. Jones, T.N., Metaxas, D.N.: Segmentation Using Deformable Models With Affinity-Based Localization. CVR Med. (1997) 15. Jones, T.N., Metaxas, D.N.: Image Segmentation based on the Integration of Pixel Affinity and Deformable Models. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 23-25, p. 330 (1998) 16. Ronsen, C., Denijver, P.A.: Connected components in binary images. In: The detection problem, Research Studies Press (1984) 17. Zadeh, L.A.: Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 90, 111–127 (1997) 18. Hata, Y., Kobashi, S., Hirano, S., Kitagaki, H., Mori, E.: Automated Segmentation of Human Brain MR Images Aided by Fuzzy Information Granulation and Fuzzy Inference. IEEE Transactions on Systems, Man, and Cybernetics—PART C: applications and reviews 30(3) (2000) 19. Kandel, A.: Fuzzy Expert Systems. CRC, Boca Raton (1992) 20. Pedrycz, W.: Control and Fuzzy Systems. Research Studies Press Ltd., London (1993)
A Novel Algorithm for Automatic Brain Structure Segmentation from MRI
561
21. Kruse, R., Gebhardt, J., Klawonn, F.: Foundations of Fuzzy Systems. Wiley, New York (1994) 22. Bello, M., Ju, T., Carson, J., Warren, J., Chiu, W., Kakadiaris, I.A.: Learning-based segmentation framework for tissue images containing gene expression data. IEEE Transactions on Medical Imaging 26(5) (2007) 23. Chan, T.F., Vese, L.A.: Active Contours without Edges. IEEE Transactions on Image Processing 10(2) (2001) 24. Belyaev, A.G., Anoshkina, E.V., Yoshizawa, S., Yano, M.: Polygonal curve evolutions for planar shape modeling and analysis. International Journal of Shape Modeling 5(2), 195– 217 (1999) 25. Lorensen, W.E., Cline, H.E.: Marching Cubes: A High Resolution 3D Surface Construction Algorithm. In: Proceedings of SIGGRAPH 1987, vol. 21(4), pp. 163–1694 (1987)
Brain Lesion Segmentation through Physical Model Estimation Marcel Prastawa and Guido Gerig Scientific Computing and Imaging Institute University of Utah, Salt Lake City, UT 84112, USA {prastawa,gerig}@sci.utah.edu
Abstract. Segmentations of brain lesions from Magnetic Resonance (MR) images is crucial for quantitative analysis of lesion populations in neuroimaging of neurological disorders. We propose a new method for segmenting lesions in brain MRI by inferring the underlying physical models for pathology. We use the reaction-diffusion model as our physical model, where the diffusion process is guided by real diffusion tensor fields that are obtained from Diffusion Tensor Imaging (DTI). The method performs segmentation by solving the inverse problem, where it determines the optimal parameters for the physical model that generates the observed image. We show that the proposed method can infer reasonable models for multiple sclerosis (MS) lesions and healthy MRI data. The method has potential for further extensions with different physical models or even non-physical models based on existing segmentation schemes.
1
Introduction
Brain lesions are loosely defined as abnormal tissue structures that are somehow damaged. Their appearance is highly correlated with many degenerative disorders, such as multiple sclerosis (MS), lupus, and stroke. Quantitative analysis of brain lesions, as observed through MRI, is becoming an important part for clinical diagnosis, planning of treatment, and evaluation of drug efficacy. Fully automatic lesion segmentation methods that are objective and reproducible are necessary to facilitate the study of large population studies involving brain lesions. Automatic segmentation of lesions from MRI data is a challenging problem. Lesions typically have weakly defined borders against the surrounding structures (which can be white matter by itself, or both white and gray matter). The shapes of the individual lesions generally follow the definition of the fiber tracts in white matter. The FLAIR (Fluid-Attenuated Inversion Recovery) channel, the typical MRI modality used for lesion detection, is almost always corrupted with an
This work is part of the National Alliance for Medical Image Computing (NA-MIC), funded by the National Institutes of Health through Grant U54 EB005149. Information on the National Centers for Biomedical Computing can be obtained from http://nihroadmap.nih.gov/bioinformatics
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 562–571, 2008. c Springer-Verlag Berlin Heidelberg 2008
Brain Lesion Segmentation through Physical Model Estimation
563
artifact where the ventricle boundaries appear bright. This artifact is caused by the combination of the pulsation of the ventricles and the long echo time (TE) for the FLAIR protocol [1]. It is difficult to account for all these problems in lesion segmentation schemes without using a good model of the underlying physical process or some empirical rules. Many methods have been proposed for automatic lesion segmentation, particularly for MS lesions. Zijdenbos et al. proposed a method based on a neural network classifier [2]. Thirion et al. [3] and Rey et al. [4] segmented lesions based on the local deformation between scans at different time points. van Leemput et al. [5] proposed a method that detects lesions as outliers from the intensity distributions of healthy tissue. However, most of the lesion segmentation schemes do not use physical models of pathology. Physical modeling has been successfully applied for solving the registration problem for pathology. Mohamed et al. [6] used a tumor growth model to estimate the deformation map between a healthy subject and a tumor subject. The registration is driven by manually selected landmarks within the brain. We propose a method for segmenting lesions from brain MRI by estimating the underlying physical model. The physical model used in this paper is the classic reaction-diffusion process, where the diffusion process is guided by a diffusion tensor field [7]. Each diffusion tensor describes the likely diffusion directions of water within a local region. Segmentation is performed by determining the parameters for the physical model that best matches the observed images. We demonstrate the results of applying the method on an multiple sclerosis (MS) lesion dataset and a healthy dataset. In the MS lesion data, we show that the inferred model for lesions generates images that are similar to the observed MR images. In the healthy data, we show that the method correctly infers that the lesion model is inappropriate. Our method is not only useful for segmenting the observed lesions in the MR images, as it also estimates the underlying lesion formation process which may have potential for further analysis.
2
Method
The method computes a model for the observed multimodal MR images IO = c {IO } where c indexes the different modalities (in this paper T1w and T2w). The model M is composed of two components: a physical model for pathology, and an image generation model. Both models uses the information contained in the template data, shown in Figure 1. The template data contains information about the expected MR intensities, the anatomical structure (in the form of a set of spatial priors), and the expected diffusion tensor field. We use the MR images and anatomical information provided by the Montreal Neurological Institute (MNI) through the BrainWeb interface [8]. The BrainWeb data does not contain the diffusion tensor image information, so we align mean diffusion tensor of a separate population to the BrainWeb data. The diffusion tensor field for the population are created using the method described by Goodlett et al. [9]
564
M. Prastawa and G. Gerig
Fig. 1. The template data used for inferring the physical model and the image generation model. The model is based on subject 04 from the twenty normals BrainWeb database. Top row, from left to right: axial view of the T1w image, T2w image, and the white matter prior. Bottom row, from left to right: the fractional anisotropy (FA) values that show the structure of the fibers as observed through the diffusion tensors, the mean diffusivity (MD) values that show the magnitude of the diffusion tensor field, and a 3D view of a portion of the tensor field where each tensor is represented by an ellipsoid.
All image alignment or registration are performed using affine transformations and deformable transformations parametrized using B-splines with the mutual information image match metric [10]. 2.1
Physical and Image Models
We assume that the lesion formation process is modulated by the underlying fiber structure, so the pathological process of lesions is modeled using the following reaction-diffusion equation: ∂φ = −∇ · (αD∇φ) + βφ ∂t
(1)
where φ(x, t) denotes the lesion probability at location x and time t, D denotes the diffusion tensor field, α denotes the diffusion modifier, and β denotes the reaction coefficient. In this model, α is a scalar that represents the rate of spread for lesions and β is a scalar that represents the local growth rate for lesions. The solution for φ is obtained by discretization and interpolation using the finite element method [11]. As part of the physical model, we also specify the points where lesions begin to appear. The set of n lesion seed points are denoted by Xseeds = {xk |1 ≤ k ≤ n}. Each individual lesion seed point xi is used to define an initial local probability
Brain Lesion Segmentation through Physical Model Estimation
565
φ(i) that is evolved independently with different parameters αi and βi . The initial lesion probability φ(i) (x, t = 0) is defined using the following equation: 1 if |x − xi | < (i) φ (x, t = 0) = (2) 0 otherwise. The final lesion probability is the accumulation of the probabilities associated with each seed point after applying the reaction-diffusion process: plesion (x) = φ(i) (x, t = ts ) (3) i
where ts is the simulation time. In addition to the physical model for pathology, we also model the appearance of MRI presenting lesions. We synthesize MR images IM from the physical model M that is constrained by Equation 1. More precisely, given the lesion probability field φ, the synthesized MR images are computed using the following equation. c IM (x) = (1 − plesion (x)) × ITc (x) + plesion (x) × Lc
(4)
where IT is the set of MRI images from the template, and Lc is the expected lesion appearance for channel c. We assume that the simulation time ts is known or can be reasonably approximated. The lesion appearance in each channel is computed as follows: (J(x) − J wm ) Lc = ITwm + range(IT ) × Ex∈lesion (5) range(J) where ITwm is the white matter intensity for the template, E[ ] denotes the expectation function, J is a training image with marked lesion regions, and J wm is the white matter intensity for the training image J. We formulate the segmentation problem as an inverse problem. Given the observed image IO , segmentation is equivalent to determining the optimal parameters for the reaction-diffusion equation (Equation 1) that generates the synthetic image IM that best matches the observation. The image match function in our algorithm is the linear combination of the mutual information image match metric, which is chosen as MR images are not normalized or standardized across different scans: c c match(IO , IM ) = wc MI(IO , IM ) (6) c
where each wc is a predefined weight for channel c that determines the relative influence of each channel in the estimation. The mutual information for images with modality c is the information-theoretic measure c c c c c c MI(IO , IM ) = H(IO ) + H(IM ) − H(IO , IM )
(7)
where H(·) denotes entropy. The mutual information image match metric has been successfully used for aligning 3D medical images [10].
566
2.2
M. Prastawa and G. Gerig
Optimization Scheme
The optimization scheme for match(IO , IM ) starts with an initial definition of a set of seed points. The points can be drawn at random or chosen using some heuristics to obtain most likely candidates. In this paper, the initial selection of lesion seed points Xseeds is performed by first partitioning the image into relevant regions by applying the watershed transform to the magnitude of the multivariate gradient [12]. We then select the region centers that have high white matter probability and T2w intensity larger than the expected gray matter T2w intensity. More precisely, after partitioning IO into a set of regions Ri where Ri ∩ Rj = 0, ∀i = j, we pick locations x¯i = |R1i | x∈Ri x as lesion seed points where 1 1 (T 2) (T 2) p(x|wm) > 0.9 and IO (x) > median(IO |gm). (8) |Ri | |Ri | x∈Ri
x∈Ri
This condition ensures that the method chooses the centers of regions with high white matter probability and are brighter than the median of the T2w image samples obtained by sampling the prior p(x|gm). The condition associated with the T2w image channel is used to exclude samples that are unlikely to be lesions for efficiency. However, the method can proceed without enforcing the T2w intensity constraint. Due to the combinatorial cost in computing the model parameter M with the optimal image match measure, we propose the following greedy algorithm: 1. Initialization: Set Xseeds = {x¯i }, following the condition in Equation 8. Initialize IM using template image intensities, and initialize the lesion probability plesion (x) ← 0. 2. Seed insertions: For each seed point xi ∈ Xseeds , the algorithm attempts to insert a single lesion generated by evolving φ(i) (xi , 0) = 1 following Equation 1. Each point xi is associated with different reaction-diffusion parameters αi and βi , which are initialized by setting αi = 1 and βi = 0.5. (a) Position update: The point xi is perturbed using Brownian motion to determine the optimal insertion location in the local neighborhood (b) Model parameter update: Optimize αi and βi so that they maximize the image match function match(IO |IM ). IM is generated using the global lesion probability combined with the probability φ(i) (x, t) generated by evolving the probability defined around the lesion seed point xi using the reaction-diffusion process governed by αi and βi . (c) Global update: If the insertion of point xi with optimal parameters αi and βi results in the increase of match(IO , IM ), the algorithm updates the global lesion probability: plesion (x) ← plesion (x) + φ(i) (x, t). The algorithm also updates the synthetic images IM by applying Equation 4. Otherwise, the point xi is rejected and the global lesion probability and IM are unchanged.
Brain Lesion Segmentation through Physical Model Estimation Initial
Final
Truth
Initial
Final
567
Truth
Fig. 2. Behavior of the algorithm for different initial seed points. Left block: a proper seed point that is inserted into the final model, the lesion model parameters have been optimized to fit the observed image. Right block: an incorrect seed point that is rejected by the algorithm. Top row: the T2w image intensities. Bottom row: the lesion probabilities.
Figure 2 shows the result of the seed insertions and the optimizations for the local reaction-diffusion parameters for a single lesion seed point, where an incorrect seed point is not inserted and the the underlying lesion probability from a correct seed point becomes more sharply defined. The seed point is rejected due to the reduced image match and the lesion probabilities in the region are unchanged for the case where we have an improper seed point. The lesion appearance is optimized to fit the observed data for the case where we have a correct seed point.
3
Results
We have applied the method to two datasets. The first dataset is the MS lesion BrainWeb MRI, and the second dataset is the original healthy BrainWeb MRI. Figure 3 show the input MR images, the synthesized MR images, and the lesion probability generated by the method. The ground truth segmentations are compared against the computed lesion segmentations using a volumetric measure and a spatial overlap measure. The volumetric measure is the total lesion load (TLL), which is computed as the sum of the lesion probabilities: x p(lesion|x). The spatial overlap measure is the Dice similarity coefficient (DSC), which is defined for two binary segmentations A and B as the ratio between the volume of overlap regions and the average volume: |A ∩ B| DSC(A, B) = 2 . (9) |A| + |B| We compute the lesion binary mask by thresholding the lesion probabilities, where probability values > 0.5 is considered a lesion object. The DSC overlap metric tend to assign higher penalty for errors in the segmentation of small structures. A good segmentation of a large brain structure such as the white matter typically yield a DSC value above 90%.
568
M. Prastawa and G. Gerig
For the BrainWeb MS lesion data, the ground truth TLL is 6697.5 and the TLL for the probability generated by our method is 5395.3. This shows a difference of approximately 20% compared to the ground truth TLL. The amount of spatial overlap between the automatic segmentation results and the ground truth is measured to be 64%. The DSC value for our lesion segmentation is lower than the ideal value of 90% due to the fact that lesions typically appear as multiple small, spatially disconnected structures. Given this spatial configuration, the
(a)
(a)
(b)
(c)
(e)
(f)
(b)
(c)
(e)
(f)
(d)
(d)
Fig. 3. Results of the method proposed to this paper for the BrainWeb MS lesion data (top) and the original healthy BrainWeb data (bottom). The individual images for each row are: axial views of the (a) input T1w image, (b) the input T2w image, (c) the synthesized T1w image, (d) the synthesized T2w image, (e) the lesion ground truth, and (f) the computed lesion probability.
Brain Lesion Segmentation through Physical Model Estimation
569
DSC measure tends to accumulate penalties for the segmentation mismatches of each small lesion object. For the healthy BrainWeb data, the method correctly rejected all seed points and generated zero lesion probabilities. Our optimization scheme was able to determine that including anatomical changes due to lesion will result in a dataset that does not match the healthy subject dataset.
4
Discussion
This paper presents a new approach for segmenting lesions from multimodal brain MR images, where segmentation is performed by inferring the physical model for the underlying pathology. Compared to other lesion segmentation methods, our approach provides an estimate of the underlying lesion formation process in addition to the segmentation of lesions in MRI. Preliminary results show that the method is promising for automatic lesion detection. Our method generates 3D segmentation results, as shown in Figure 4. Quantification and localization of brain lesions in 3D might help clinicians to assess status and progress of the disease, and eventual drug efficacy as observed through white matter changes in the brain. Our approach can be generalized to more sophisticated physical and image models. One advantage of our method is that it computes the growth parameters αi and βi for each lesion seed point in addition to the segmentation. This may have potential applications in longitudinal analysis of lesion image data. The results demonstrate that our method has the potential of being applied to mixed populations of subjects with and without lesions. The accuracy of the lesion segmentation is limited since we have not incorporated information from
Fig. 4. 3-D views of the segmented structures from the BrainWeb MS images. Lesions are shown as the bright gray structures, and the lateral ventricles are shown in dark gray for anatomical reference.
570
M. Prastawa and G. Gerig
the subject-specific diffusion tensor fields. The mismatch of the diffusion tensor information between the DTI atlas (i.e., the population model) and the actual diffusion properties of the subject may result in suboptimal estimation of the lesion probabilities. We also detect false positives and false negatives, as shown in Figure 3. We expect that future improvements of the greedy optimization scheme, combined with a better strategy for selecting the initial seed points, will reduce the number of false positives and false negatives. In the future, we intend to extend the method in several ways. For example, we plan to use extended physical models with a deformation model. Another extension to the model would be to determine the optimal reaction and diffusion coefficients α and β based on spatial location (e.g., as defined by an anatomical parcellation). We also plan to incorporate the FLAIR image channel by modeling the pulsation of the ventricular system. The current model can also be extended by incorporating longitudinal information. In particular, the segmentation of a previous time point can be used to drive the segmentation of a different time point [3,4]. We currently use high resolution diffusion tensor information by averaging over a population. To improve the accuracy of our segmentations, we will need to incorporate the subject specific diffusion tensor information when they are available. The models used in this segmentation framework need not be restricted to physical models. We plan to explore the combination of statistical segmentation schemes and machine learning models into the framework.
References 1. Bakshi, R., Caruthersa, S.D., Janardhana, V., Wasay, M.: Intraventricular csf pulsation artifact on fast fluid-attenuated inversion-recovery MR images: Analysis of 100 consecutive normal studies. AJNR 21, 503–508 (2000) 2. Zijdenbos, A., Forghani, R., Evans, A.: Automatic quantification of MS lesions in 3D MRI brain data sets: Validation of INSECT. In: Wells, W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 439–448. Springer, Heidelberg (1998) 3. Thirion, J.P., Calmon, G.: Deformation analysis to detect and quantify active lesions in three-dimensional medical image sequences. IEEE TMI 18, 429–441 (1999) 4. Rey, D., Subsol, G., Delingette, H., Ayache, N.: Automatic detection and segmentation of evolving processes in 3D medical images: Application to multiple sclerosis. Medical Image Analysis 6, 163–179 (2002) 5. van Leemput, K., Maes, F., Vandermeulen, D., Colchester, A., Suetens, P.: Automated segmentation of multiple sclerosis lesions by model outlier detection. IEEE TMI 20, 677–688 (2001) 6. Mohamed, A., Zacharaki, E.I., Shen, D., Davatzikos, C.: Deformable registration of brain tumor images via a statistical model of tumor-induced deformation. MedI.A 10 (2006) 7. Clatz, O., Sermesant, M., Bondiau, P.Y., Delingette, H., Warfield, S.K., Malandain, G., Ayache, N.: Realistic simulation of the 3d growth of brain tumors in mr images including diffusion and mass effect. IEEE Transactions on Medical Imaging 24, 1344–1346 (2005)
Brain Lesion Segmentation through Physical Model Estimation
571
8. Cocosco, C.A., Kollokian, V., Kwan, R.S., Evans, A.C.: BrainWeb: Online interface to a 3D MRI simulated brain database. Neuroimage 5 (1997) 9. Goodlett, C., Davis, B., Jean, R., Gilmore, J., Gerig, G.: Improved correspondence for dti population studies via unbiased atlas building. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4191, pp. 260–267. Springer, Heidelberg (2006) 10. Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 16, 187–198 (1997) 11. Hughes, T.J.R.: The finite element method: linear static and dynamic finite element analysis. Dover (2000) 12. Lee, H.C., Cok, D.R.: Detecting boundaries in a vector field. IEEE Trans. Signal Processing 39, 1181–1194 (1991)
Calibration of Bi-planar Radiography with a Rangefinder and a Small Calibration Object Daniel C. Moura1,2 , Jorge G. Barbosa2 , Jo˜ ao Manuel R.S. Tavares3,4, 5 and Ana M. Reis 1
Instituto de Eng. Biom´edica, Lab. de Sinal e Imagem, Campus da FEUP, Portugal 2 U. do Porto, Faculdade de Engenharia, Dep. Eng. Inform´ atica, Portugal 3 Instituto de Engenharia Mecˆ anica e Gest˜ ao Industrial, Campus da FEUP, Portugal 4 U. do Porto, Faculdade de Engenharia, Dep. de Eng. Mecˆ anica e Gest˜ ao Industrial Rua Dr. Roberto Frias, 4200-465 Porto, Portugal {daniel.moura,jbarbosa,tavares}@fe.up.pt 5 SMIC, R. Pedro Hispano, 881, 4250-367 Porto, Portugal
Abstract. In this paper we propose a method for geometrical calibration of bi-planar radiography that aims at minimising the impact of calibration objects on the content of radiographs, while remaining affordable. For accomplishing this goal, we propose a small extension to conventional imaging systems: a low-cost rangefinder that enables to estimate some of the geometrical parameters. Even with this extension, a small calibration object is needed for correcting scale. The proposed method was tested on 17 pairs of radiographs of a phantom object of known dimensions. For calculating scale, only a reference distance of 40mm was used. Results show a RMS error of 0.36mm with 99% of the errors inferior to 0.85mm. We conclude that the method presented here performs robustly and achieves sub-millimetric accuracy, while remaining affordable and requiring only two radiopaque marks visible in radiographs.
1
Introduction
Nowadays, Computer Tomography (CT) is the gold standard for 3D reconstructions of bone structures. However, CT scans may not be used for accurate reconstructions of large bone structures, such as the spine, because of the high doses of radiation that are necessary. Additionally, when compared to radiography, CT scans are more expensive, more invasive, less portable, and require patients to be lying down. Therefore, using radiography for obtaining 3D reconstructions and accurate measurements remains an interesting alternative to CT. Currently, it is possible to do 3D reconstructions of the spine [1,2], pelvis [3], distal femur [4] and proximal femur [5] with minimal radiation by subjecting the patient just to two radiographs (bi-planar radiography). For achieving this, all these methods require a calibration procedure that must be executed for every examination in order to capture the geometry of the x-ray imaging system. Usually, this calibration is performed using very large calibration apparatus that G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 572–581, 2008. c Springer-Verlag Berlin Heidelberg 2008
Calibration of Bi-planar Radiography with a Rangefinder
573
surround the patient and introduce undesirable objects in radiographs. These apparatus are neither practical nor affordable. Not surprisingly, efforts have been made to use smaller calibration objects [6,7,8], or even to eliminate them at all [9,10]. Currently, and to our knowledge, no method is capable of accurate reconstructions without using calibration objects. Kadoury et al. were able to calculate angular measures from spine radiographs without using any calibration object, but absolute measures scored very poor results and were considered to be unreliable [10]. As for methods that use small calibration objects, reconstruction errors remain considerably higher than when using large apparatus, and a significant number of undesirable objects is still visible in radiographs. The method proposed in this paper intends to show that it is possible to obtain accurate calibrations of bi-planar radiography using small calibration objects that produce minimal changes to radiographs. For helping accomplishing this goal, we propose a small extension to conventional systems of x-ray imaging: a rangefinder that provides good estimates of some of the geometrical parameters of the system. For accurately assessing this method, experiments were conducted on a phantom object of known geometry. The remaining of this paper is structured as follow: section 2 describes the proposed method, section 3 presents the validation experiments and its results, and section 4 discusses these results and present the conclusions of this work.
2
Methods
This section describes the proposed method. It starts with a general introduction to the problem of radiography calibration and presents a well known but insufficient solution for solving it without calibration objects. Then, an extension to this solution is proposed, which is based on using a rangefinder to determine some of the calibration parameters. Unfortunately, such extension only enables to determine up to scale reconstructions, and therefore we also propose using a small calibration object to correct scale. 2.1
Radiography Calibration
In bi-planar x-ray systems, the projection of a 3D point in each of the two radiographs may be calculated as: ⎡ ⎤ ⎡ ⎤ X wi ·ui ⎢Y ⎥ ⎣ wi ·vi ⎦ = Mi · ⎢ ⎥ for i = 1, 2, (1) ⎣Z ⎦ wi 1 where for each acquisition i, M is the calibration matrix that describes the projection of the 3D point (X, Y, Z) into image coordinates (u, v) subjected to a scaling factor w. For flat x-ray detectors, M may be modelled as: ⎡ ⎤ fi /s 0 upi 0 R t Mi = ⎣ 0 fi /s vpi 0⎦ · Ti i , (2) 0 1 0 0 1 0
574
D.C. Moura et al.
where f is the focal distance (the distance between the x-ray source and the detector), s is the known sampling pitch of the detector, (up , vp ) is the principal point (2D projection of the x-ray source in the image), and R and t define the geometrical transformation (represented by a 4 × 4 matrix) that aligns the object coordinate system with the source coordinate system. More precisely, t is a translation vector that may be decomposed in (tx , ty , tz ) and R is a 3×3 rotation matrix that depends of three angles: an α rotation around the X axis, a β rotation around the Y axis, and γ rotation around the Z axis. The goal of the calibration procedure is to find the optimum values of the calibration parameters: ξi = (fi , upi , vpi , txi , tyi , tzi , αi , βi , γi )
for i = 1, 2.
(3)
When not using calibration objects, this is usually done by minimising the retroprojection error of a set of point matches marked in the two images [9,10]. This problem may be formulated as a least-squares minimisation: ⎛ ⎞ 2 n 2 ⎝ min pij − prj (ξi , tri (ξ1 , ξ2 , p1j , p2j )) ⎠ , (4) ∗ ∗ ξ1 ,ξ2
i=1 j=1
where n is the number of marked point matches, pij is the j th point marked in image i, prj is the 2D projection of a 3D point as defined in equations 1 and 2, tri is a triangulation operation that calculates the 3D coordinates for a given point match, and ξ1∗ , ξ2∗ are the optimised parameters for image 1 and 2 respectively. Calibration may be accomplished using a standard nonlinear least-squares minimisation algorithm. This class of algorithms needs an initial solution for the calibration parameters, which are then iteratively updated towards reducing the sum of squared distances between the marked and retro-projected points. Unfortunately, the search space of solutions is very large and this procedure gets easily trapped in local minima. 2.2
Narrowing the Search Space of Solutions
In order to reduce the search space of solutions, a laser rangefinder was attached to the x-ray machine that allows to estimate the focal length (f ) and to have an initial guess of the distance between the object and the x-ray source (tz ). Figure 1 illustrates how the device is attached to the x-ray machine. The device is only capable of measuring the distance between the x-ray machine and the table (dm ). In order to calculate f , two more parameters need to be determined: ds – the distance from the x-ray source to the plane of the x-ray device where the x- rays come out, and dd – the distance from the table to the x-ray detector. These parameters are fixed for a given imaging system but may be difficult to measure directly. Therefore, we propose finding them indirectly using a radiopaque planar grid (calibration phantom) of known dimensions (figure 3). The procedure described below only needs to be executed once for a given imaging system.
Calibration of Bi-planar Radiography with a Rangefinder
575
Fig. 1. Illustration of a conventional radiographic imaging system with a laser rangefinder attached (left), and the coordinate system of reference (right)
First, the grid should be placed in the table of the imaging system, roughly centred (tx 0mm, ty 0mm), and roughly parallel to the detector with no tilt (α 0◦ , β 0◦ , γ 0◦ ). This positioning is easy to accomplish and enables to have a good initial guess for most of the calibration parameters. Parameters (up , vp ) may be defined as the centre of the image, and parameters f and tz may be redefined in the following way: f = ds + dm + dd ,
(5)
tz = ds + dm − ot /2,
(6)
where ot is the phantom thickness. However, because the calibration object is planar and because of its positioning, there is an infinite range of solutions for ds and dd that enables to accomplish optimal projections of the grid. This happens because increasing one of these parameters may be compensated by increasing the other. Therefore, several radiographs at different distances dm should be acquired, and, for each of them, the relation between ds and dd should be determined. Then, by crossing results, one may find the point where ds and dd are the same for all setups (figure 2). For finding this relation for a given setup, we ran several optimisations with different values of ds (within an admissible range determined using direct measurements). By fixing ds in each optimisation it is possible to determine the correspondent dd for that setup (figure 2). Therefore, for each setup, and for each tested ds , the set of parameters to optimise is: ζ = (dd , up , vp , tx , ty , α, β, γ).
(7)
Once again, this problem may be formulated as a least-squares minimisation, but, this time, we minimise the projection error of the known 3D positions of the corners of the phantom squares: n 2 min pi − prj (ζ, dm , ds , ot , Pi ) , (8) ∗ ζ
i=1
576
D.C. Moura et al.
ds (mm)
82 80
dm = 909mm
78
dm = 873mm dm = 811mm
76
d = 779mm
74
dm = 712mm
m
dm = 614mm
72
d = 506mm m
d = 409mm
70
m
68 200
250
300
350
dd (mm)
Fig. 2. Variation of ds in function of dd for each setup
where n is the number of 2D points (corners of the phantom) visible on the radiograph, pi is the ith 2D point, Pi is the corresponding 3D coordinate on phantom model, prj is the 2D projection of a 3D point as defined in equations 1 and 2 (with the change that f and tz defined in equation 2 should be calculated using equations 5 and 6 respectively), and ζ ∗ is the optimised set of parameters. Once ds and dd are calculated for a given system, it is straightforward calculating the focal length and the distance between the x-ray source and the table where the target is positioned just by using a rangefinder. This last value may be used for calculating an initial guess of the target position, since it is usually located near the table. 2.3
Correcting Scale
In our experiments, using the rangefinder only enabled determining up to scale solutions for the 3D coordinates. For correcting scale, a reference measure is needed. Such measure may be obtained using a small calibration object composed by only two radiopaque parts placed at a known distance, which should be attached to the patient and be visible in both radiographs. The scaling factor may be calculated as the ratio between the real distance between the two radiopaque objects and the distance between the reconstructed 3D coordinates of the same objects.
3 3.1
Results Determining ds and dd
For determining parameters ds and dd we built a phantom made of stainless steel (AISI 304) with dimensions of 380x380x1mm and laser cut squares of 20.0 ± 0.2mm (figure 3). Eight radiographs of the phantom were acquired at different distances (dm ranging 409–909mm) while keeping all other parameters constant. For each radiography the distance between the x-ray device and the table (dm )
Calibration of Bi-planar Radiography with a Rangefinder
577
was measured using a laser rangefinder (typical error of ±1.5mm, maximum error of ±3.0mm, and range of operation of 0.05–50m). For the optimisation process, we used the coordinates of the corners of the phantom. For each image the 2D coordinates of the corners where extracted semi-automatically with the Camera Calibration Toolbox for Matlab1 and then optimised with OpenCV2 . Figure 2 shows the relation between ds and dd for each setup. We choose the value of ds where the standard deviation of dd was minimum (which was 0.03mm). Parameter dd was calculated by averaging the values of dd of all setups for the selected ds . 3.2
Assessment of the Calibration Method
For evaluating the proposed calibration method, we used the same phantom from the previous procedure. This time, eight radiographs were taken with the phantom at different positions and with different orientations. Distance dm was the same for all setups (dm = 909mm), resulting in a focal length of f = 1257.7mm and in a distance between the x-ray source and the table of 1183.0mm. Film size was of 14 ×17 (355.6 × 431.8mm), scanned with a sampling pitch of 175.0 μm/pixel, resulting in images with a resolution of 2010 × 2446 pixels. The eight radiographs were combined in a total of 17 pairs (out of 28 possible combinations). Only pairs of radiographs with considerably different pose were considered. Pairs of radiographs with near pose were discarded because are less tolerable to triangulation errors (when pose is similar triangulation lines tend to intersect at infinity because they are close to parallel). As stated previously, the minimisation algorithm requires an initial estimate of the calibration parameters. Parameters (up , vp ) were initialised with the 2D coordinates of the image centre, and t and R where roughly estimated in the following way: – tx and ty were always initialised with zero (we assumed that the object was roughly centred in the radiograph, even if it was not); – tz was initialised with ds + dm when its centre was near the table; when the centre was farther away from the table due to its pose, half of the object width was subtracted; – α, β, and γ were roughly provided in a 10◦ resolution scale. Figure 3 shows three examples of the initial guesses for three radiographs, and table 1 summarises the errors for all of them. These errors were calculated by comparing the initial guess for each radiograph with the parameters that achieved optimal solutions (minimum projection error) when projecting a 3D model of the phantom on the correspondent radiograph. As stated in the previous section, the calibration process tries to minimise the retro-projection errors of a set of point matches marked in a pair of images. 1 2
http://www.vision.caltech.edu/bouguetj/calib doc/ (accessed Dec-2007) http://www.intel.com/technology/computing/opencv/ (accessed Dec-2007)
578
D.C. Moura et al.
(0, 0, 1183) (0◦ , 0◦ , 0◦ )
(0, 0, 993) (60◦ , 0◦ , 0◦ )
(0, 0, 993) (0◦ , −40◦ , 10◦ )
Fig. 3. Example of three radiographs of the phantom and the initial guess for (tx , ty , tz ) in mm and (α, β, γ) in degrees Table 1. Absolute errors for position (tx , ty , tz ) and orientation (α, β, γ) of the initial guess for each radiograph tx ty tz α β γ (mm) (mm) (mm) (deg) (deg) (deg) Mean SD Max
17.7 7.2 30.5
21.0 13.4 43.2
21.9 21.6 58.6
0.9 1.6 3.8
1.7 1.6 3.9
2.4 2.5 6.9
We used the corners of the squares of the phantom that were visible in both images, which in average originated 199 point matches. These points were semiautomatically extracted using the previously described procedure. The full set of point matches was used in the first two experiments for evaluating the method with maximal input. In addition, a third experiment was conducted to evaluate the effect of using smaller sets of points. The optimisation algorithm that was used was the Matlab large-scale algorithm for nonlinear least-squares problems, which shown a superior performance than the conventional approach of using the Levenberg-Marquardt algorithm (e.g. some of the works that use this last approach: [7,8,9,10]). For simulating the calibration object that should be utilised for determining the scaling factor, reference distances of 40mm were used, which correspond to two consecutive squares of the phantom. For the first experiment we tested scaling with 50 reference distances uniformly distributed by the part of the phantom that was visible in both radiographs. First, the 3D coordinates of every point match were calculated using the optimised parameters. Then, the 3D points were scaled (using one reference distance) and aligned with a 3D model of the phantom. Finally, 3D errors were computed as the Euclidean distances between the calculated points and the corresponding points in the model. Figure 4(a) shows an histogram of the errors for the complete experiment. RMS error was 0.36mm and 99% of the errors were inferior to 0.85mm. A second experiment was carried out for testing the proposed procedure in less optimal conditions of point matches identification. This was done by adding uniformly distributed noise to the 2D coordinates of the phantom’s corners that
Calibration of Bi-planar Radiography with a Rangefinder
579
4
x 10
4
4
3.5
3.5
3 2.5 2
3 2.5 2
1.5
1.5
1
1
0.5
0.5
0
Mean +/− 2SD
4.5
3D error (mm)
Number of observations
5
Error count Mean = 0.31mm RMS = 0.36mm Mean + SD = 0.49mm Mean + 2SD = 0.66mm
4.5
0
0.5
1
1.5
2
0
0
5
10
15
2D coordinates noise (pixels)
3D absolute error (mm)
(a)
(b)
Fig. 4. Histogram of the 3D errors of the phantom reconstruction (a), and 3D errors vs. noise in landmarks location (b) Table 2. 3D reconstruction error vs. number of point matches and noise in point matches location Number of matches
3D RMS reconstruction errors (mm) No noise ±5 pixels noise ±10 pixels noise
199 67 33 23
0.36 0.36 0.38 0.39
0.93 0.98 0.98 1.21
1.76 1.80 1.86 2.10
were previously extracted using the Camera Calibration Toolbox for Matlab and OpenCV. Then, the previous experiment was repeated (17 pairs of images × 50 reference distances) starting with no noise, and adding up to ±15 pixels to each coordinate of every point. We decided not adding noise to the two points that defined the reference distance because in real cases these points would be represented by two radiopaque objects that would be easy to identify accurately. Results are presented in figure 4(b) showing the relation between pixel noise and mean 3D error. A final experiment was made to determine if the method was capable of achieving good results with smaller sets of points. We tested using sets of 67, 33 and 23 average point matches, which are in the range of values that are typically used in this kind of procedures. For each set, the first experiment was repeated with no noise added to point matches, and then with uniformly distributed noise of ±5 and ±10 pixels. Table 2 shows the results for this experiment.
580
4
D.C. Moura et al.
Conclusions
This paper presents a method for bi-planar radiography calibration that uses a low-cost rangefinder for narrowing the search space of solutions. This enhancement enables to improve the calibration performance of conventional x-ray imaging systems without affecting radiographs and with minor inconvenience for patients and technicians. Results show that this method is robust and offers submillimetric accuracy even when the initial guess of the calibration parameters is rather rough. Results also show that the quality of the calibration depends of the quality of point matches identification in radiographs. However, this dependence is close to linear for the tested range of noise, and when uniformly distributed noise of ±5 pixels is added to each coordinate of every point, the RMS error remains inferior to 1mm. This demonstrates the robustness of the proposed method since it achieves acceptable errors even when using a pessimist distribution of noise where all values are equiprobable. Additionally, when decreasing the average number of point matches to 33 the method shows a very small increase of the error, enabling to achieve almost the same results with much less input. This number of point matches is inside the range of typical values used in other calibration procedures of this kind. For achieving these results a small calibration object is needed (we used a 40mm reference distance in our experiments). Its only role is determining the scaling factor that should be applied for obtaining real-world units. Thus, this object may be discarded if the goal is only to obtain shape, or angular and relative measurements. When compared to other works [6,7,8], the proposed calibration object has less impact in the content of radiographs (only two radioopaque marks are needed). Results may not yet be compared because the other studies use anatomical structures, either in vivo or in vitro, and therefore, score higher errors. A possible disadvantage of this technique is that it requires an estimation of the rotation and translation parameters. This may not be a problem for some kind of examinations, such as spine radiography, where frontal and lateral radiographs are usually acquired. In these cases an initial estimation is very simple to obtain because the patient is typically with 0◦ rotation for all axis in the first radiograph and then, for the second radiograph, he/she only experiences a 90◦ rotation around one of the axis. Either way, for best results the two acquisitions should not be taken with near poses for preventing triangulation errors, which is the case of spine radiographs. Ongoing experiments with simulation of spine radiography using in vivo data show that the length of the calibration object should be higher (10–12cm) but reconstruction errors remain lower that the published for in vivo and in vitro studies.
Acknowledgements The first author thanks Funda¸c˜ao para a Ciˆencia e a Tecnologia for his PhD scholarship (SFRH/BD/31449/2006). The authors would also like to express their gratitude to Instituto de Neurociˆencias and to SMIC.
Calibration of Bi-planar Radiography with a Rangefinder
581
References 1. Mitulescu, A., Skalli, W., Mitton, D., De Guise, J.: Three-dimensional surface rendering reconstruction of scoliotic vertebrae using a non stereo-corresponding points technique. European Spine Journal 11, 344–352 (2002) 2. Pomero, V., Mitton, D., Laporte, S., de Guise, J.A., Skalli, W.: Fast accurate stereoradiographic 3d-reconstruction of the spine using a combined geometric and statistic model. Clinical Biomechanics 19, 240–247 (2004) 3. Mitton, D., Deschˆenes, S., Laporte, S., Godbout, B., Bertrand, S., de Guise, J.A., Skalli, W.: 3d reconstruction of the pelvis from bi-planar radiography. Computer Methods in Biomechanics & Biomedical Engineering 9, 1–5 (2006) 4. Laporte, S., Skalli, W., de Guise, J., Lavaste, F., Mitton, D.: A biplanar reconstruction method based on 2d and 3d contours: Application to the distal femur. Computer Methods in Biomechanics & Biomedical Engineering 6, 1–6 (2003) 5. Baudoin, A., Skalli, W., Mitton, D.: Parametric subject-specific model for in vivo 3d reconstruction using bi-planar x-rays: Application to the upper femoral extremity. Comput.-Assisted Radiol. Surg. 2, S112–S114 (2007) 6. Cheriet, F., Delorme, S., Dansereau, J., Aubin, C., De Guise, J., Labelle, H.: Intraoperative 3d reconstruction of the scoliotic spine from radiographs. Annales de Chirurgie 53, 808–815 (1999) 7. Kadoury, S., Cheriet, F., Laporte, C., Labelle, H.: A versatile 3d reconstruction system of the spine and pelvis for clinical assessment of spinal deformities. Medical & Biological Engineering & Computing 45, 591–602 (2007) 8. Cheriet, F., Laporte, C., Kadoury, S., Labelle, H., Dansereau, J.: A novel system for the 3-d reconstruction of the human spine and rib cage from biplanar x-ray images. IEEE Transactions on Biomedical Engineering 54, 1356–1358 (2007) 9. Cheriet, F., Dansereau, J., Petit, Y., Aubin, C., Labelle, H., De Guise, J.A.: Towards the self-calibration of a multiview radiographic imaging system for the 3d reconstruction of the human spine and rib cage. International Journal of Pattern Recognition & Artificial Intelligence 13, 761–779 (1999) 10. Kadoury, S., Cheriet, F., Dansereau, J., Labelle, H.: Three-dimensional reconstruction of the scoliotic spine and pelvis from uncalibrated biplanar x-ray images. Journal of Spinal Disorders & Techniques 20, 160–167 (2007)
Identification of Cell Nucleus Using a Mumford-Shah Ellipse Detector Choon Kong Yap and Hwee Kuan Lee Bioinformatics Institute, #07-01, Matrix, 30 Biopolis Street, Singapore 138671
Abstract. Detection of cell nucleus is critical in microscopy image analysis and ellipse detection plays an important role because most nuclei are elliptical in shapes. We developed an ellipse detection algorithm based on the Mumford-Shah model that inherits its superior properties. In our ellipse detector, the active contours in the Mumford-Shah model are constrained to be non-overlapping ellipses. A quantitative comparison with the randomized Hough transform shows that the Mumford-Shah based approach detects nucleus significantly better on our data sets.
1
Introduction
The emphasis of biological science is shifting from qualitative descriptions to accurate quantitative measurements either directly from chemical assays or from microscopy images. In particular, cell culture, cell transfection techniques and robotic stages have enabled high-throughput image screens for drugs and gene candidates; where thousands of images are acquired and analyzed. Ellipse detection is a particularly important step in image analysis of cells because the nuclei of cells are usually elliptical in shape. Detecting the nucleus accurately is extremely important as the nucleus provide a reference point to which different components and protein expressions in cells can be identified. Many works have been done on nucleus detection, segmentation using active contours [1,2,3] and detection using edge maps [4,5,6,7] are two common approaches. In this paper, we propose to use the a-priori knowledge that nucleus are elliptical in shape and detects ellipses instead of the cell nucleus directly. Hence we map the problem of nucleus detection into an ellipse detection problem. Accurate nucleus outline can then be elucidated with snake models using detected ellipse outline as starting conditions. Our ellipse detector for cell nucleus is designed to work with high noise, low contrast and cluttered environment. Our method capitalized on the strength of the Mumford-Shah model [8], which is a highly successful model for segmentation of noisy images and illusive objects. We modify the model to segment only elliptical objects. In the proposed method, no edge map and no preprocessing are required. Noises are averaged out through the Mumford-Shah energy functional. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 582–593, 2008. c Springer-Verlag Berlin Heidelberg 2008
Identification of Cell Nucleus Using a Mumford-Shah Ellipse Detector
2
583
Related Work
Several nucleus detectors are developed based on image edges [4,5,6,7]. The drawback in these methods is the need for good elliptical edges. Once a good edge map is obtained, many generic ellipse detection methods would be able to detect the cell nuclei. Another popular approach to nucleus detection is the use of active contours [1,2,3]. However these methods require good estimates of the nucleus as initial conditions. Our proposed approach overcomes the drawbacks of both edge based and active contours nucleus detectors. Firstly, being region based, our method does not need an edge map. When the nuclei are occluded or clustered together, the edge map would not produce the elliptical shapes of the nucleus. When the noise level is high, preprocessing such as denoising and removal of spurious edges is needed. Secondly, our proposed method uses a stochastic optimization routine that requires no initial conditions. Finally, active contours can be used as a second step to refine the nucleus segmentation once our proposed method detects the nucleus to a reasonable accuracy. Among the most popular ellipse detectors are those that are based on the Hough transform [9,10,11,12,13]. There are also several investigations using symmetry properties to obtain more globally optimal detection of ellipses [14,15,16]. The use of symmetry reduces the effects of noise and the dimension of the search space. Another popular approach maps the ellipse detection problem into an optimization problem and uses genetic algorithm to search for the optimal ellipse parameters [5,17,18]. Lastly, detecting ellipses using statistical analysis [19,20,21] has also received considerable amount of attention. However, all these ellipse detectors depend on a good edge map whereas our approach does not require any edge detection. Another related area of research is segmentation with shape priors [22,23,24] on the Mumford-Shah model. Such work could potentially been used for nucleus detection. Although our proposed work also uses the Mumford-Shah model, it is distinctively different from studies involving shape prior on the Mumford-Shah model for the following reasons. 1. Multiple objects: The number of ellipses is automatically detected by our algorithm. In [22], only one object can be detected at any one time and in [24] the number of objects need to be known a-priori. 2. Initial conditions: Our ellipse detector does not require any initial conditions. Starting with an image with no ellipse, ellipses are subsequently detected and inserted. Shape prior approaches require reasonable initial guess for the level set functions. 3. Multi-phase: Our method is truly multi-phased; the number of phases is equal to the number of ellipse detected plus one (for the background). The methods described in [22,23] are restricted to two phase segmentation. 4. Location and arrangements: Our method detects ellipses that are scattered in arbitrary ways. The method described in [23] requires that the shape prior be placed exactly at the locations of the desired objects.
584
3
C.K. Yap and H.K. Lee
Model
We make use of the Mumford-Shah model [8] to detect ellipses where the active contours of the original model are constrained to be non-overlapping ellipses. Detected ellipses are treated as the foreground and the rest of the image is considered to be the background. The ellipses and their edge regions are defined by a convolution of characteristic functions χi with a square window ω of size 3 × 3, ϕi (x, y) = χi (x, y) ∗ ω(x, y), i ∈ {1, . . . , N } where N is the total number of detected ellipses and the characteristic function χi (x, y) = 1 if, [(x − hi ) cos θi + (y − ki ) sin θi ]2 [−(x − hi ) sin θi + (y − ki ) cos θi ]2 + ≤ 1 (1) 2 ai b2i and χi (x, y) = 0 otherwise. The variables ei = (hi , ki , ai , bi , θi ) specify the location (hi , ki ), size (ai , bi ) and orientation θi of the ith ellipse. The region of the ith ellipse is identified by ϕi (x, y) = 1 and its edge region is identified by 0 < ϕi (x, y) < 1. Hence the background region is identified by i ϕi (x, y) = 0. The number of ellipses detected is to be automatically deduced by our algorithm. With the regions defined, the Mumford-Shah objective function is, Eν,λ =
N i=1
+ν
[u(x, y) − ci ]2 dxdy ϕi =1
N i=1
+λ
[u(x, y) − ci ϕi ]2 dxdy
0<ϕi <1
È
N i
[u(x, y) − c0 ]2 dxdy
(2)
ϕi =0
where N is the total number of ellipse detected, u(x, y) is the input image, ν and λ are adjustable parameters and ci and c0 are constants defined as, ⎧ u(x,y)ϕi (x,y) dxdy ϕ >0 ⎪ if ci > c0 ⎨ i u(x, y) dxdy ϕ (x,y) dxdy ϕi =0 ϕi >0 i c0 = ci = ⎪ dxdy ⎩ ϕi =0 ∞ otherwise
È
Ê
È
Ê
The first term in Eq.(2) corresponds to the ellipse foreground. The second term specifies the “fitness” of a region to an ellipse. This term is small if a bright region in the image has edges that are well fitted with an ellipse. The third term consist of the background contributions. ci are constants defined as the mean intensity of pixels within the ellipse if it is brighter than the background and infinitely large otherwise. This effectively restricts detected ellipses to be brighter than the background. The edge weight (ν > 0) and background weight (λ > 0) serve as regularizing factors. Smaller background weight results in only the brightest ellipse being detected. While the edge weight controls how well a bright region is to be fitted with an ellipse. Ellipses are not allowed to overlap.
Identification of Cell Nucleus Using a Mumford-Shah Ellipse Detector
(a)
(b)
(c)
(d)
(e)
(f)
585
Fig. 1. Inserting Ellipse Step (a) Original image (b) First ISODATA (c) After removing small connected components and ISODATA on large connected components again (d) Perimeter edge (e) Total pool of ellipses estimated through Hough transform (Ellipses in bold are of highest Hough score) (f) Ellipses found by simplex and Monte Carlo insertion step
The problem has now become finding the most appropriate 5N parameters, ei = (hi , ki , ai , bi , θi ) for i = 1, · · · , N , such that the Mumford-Shah objective function is minimum subjected to the constraint of non-overlapping ellipses. Number of ellipses N for a given image is unknown and would be found out automatically through the algorithm. At this point, we would like to highlight that the number of phases in the Mumford-Shah model is equal to N + 1, which is also determined dynamically by the algorithm.
4
Optimization
We employ the basin hopping Monte Carlo [25,26] method to search for the globally optimized solution of the Mumford-Shah model. Starting from an image with no ellipse, the algorithm iterates through three hopping steps, insert, move and split, until a predetermined number of iterations is reached. In these Monte Carlo iterations, any step that result in decreasing the Mumford-Shah energy (Eq. (2)) is accepted; otherwise it will be rejected. Inserting ellipse: The background of the image would first undergo ISODATA [27] global thresholding to isolate bright regions in the background. The output of ISODATA is connected components with various sizes. Tiny connected components smaller than a predetermined minimum ellipse size are mostly due to noises and are removed. Connected components larger than a predetermined maximum ellipse size undergo ISODATA again, to detect the area that is “brighter than it’s surrounding”. ISODATA stops when all connected components are smaller than the maximum ellipse size. Connected components are identified using 4-connected neighbors. Approximate locations of ellipses are then estimated
586
C.K. Yap and H.K. Lee
Fig. 2. Monte Carlo steps for moving (left) and splitting (right) ellipses. In the left figure, dotted line represents the original ellipse position, dashed line represents the ellipse after a random move. Solid line is the final ellipse after simplex minimization.
based on the perimeter of the connected components using Hough transform [9]. This approximation step is acting as a rough estimation for ellipses for the subsequent energy optimization process. Three ellipses are chosen for each perimeter. One of the ellipses carries highest accumulated Hough score. The other ellipses are selected with probability given by: Pk = Hk / i Hi where Hk is the Hough score for k-th ellipse. The total pool of ellipses selected are lined up in a random order. Propose to insert ellipses in this random order and apply simplex [28] minimization to each inserted ellipse using Eq. (2). Ellipse will be inserted if energy decreases due to insertion. Otherwise ellipse will be rejected. A sequence of events involved in ‘inserting ellipse’ step is illustrated in Fig. 1. Moving ellipse: Propose to move each ellipse by randomly changing their parameters (Δh,Δk,Δa,Δb,Δθ). Δh, Δk, Δa and Δb are drawn uniformly from the interval [−(¯ a + ¯b)/2, (¯ a + ¯b)/2]. a ¯ and ¯b are the averages of the the minor and major axes of all ellipses. Δθ is drawn uniformly from the interval [−1, 1] radians. Next perform simplex minimization starting with these new parameters using Eq. (2). This propose move is accepted if it resulted in a decrease in the Mumford-Shah energy, otherwise the move is rejected and the ellipse will maintain its current state. A schematic diagram is shown in Fig. 2 on how it works. Each ellipse is being proposed to move randomly 10 times in one Monte Carlo iteration. Splitting ellipse: Propose to split each ellipse into two along its major axis. The point of splitting is randomly selected within 10% offset from both ends (Fig. 2) and bears the same minor axis length. Ellipse forms two new ellipses due to splitting. The new ellipses would undergo energy optimization simultaneously using simplex on all ten ellipse parameters h1 , k1 , a1 , b1 , θ1 , h2 , k2 , a2 , b2 , θ2 of the two newly generated ellipses. The split would be accepted if it resulted in a decrease in the Mumford-Shah energy, otherwise it is rejected and the ellipse maintains its un-splitted state. Each ellipse is being proposed to split 10 times in one Monte Carlo iteration.
Identification of Cell Nucleus Using a Mumford-Shah Ellipse Detector
4.1
587
Algorithm
Our algorithm can be summarized as follows: 1. Start with an image with no ellipse and assign all pixels as belonging to the curr background. Calculate the Mumford-Shah energy Eν,λ . 2. Inserting ellipses: (a) ISODATA the image background and generate a pool of estimated ellipses that are lined up in a random order. (b) For each ellipse, perform simplex optimization on the ellipse parameters curr and calculate the optimized energy Eν,λ . Insert this ellipse if Eν,λ <Eν,λ curr and set Eν,λ ← Eν,λ . 3. Moving ellipses: (a) Propose to change the ellipse parameter randomly and perform simplex minimization to calculate the optimized energy Eν,λ . Accept the propose curr curr move if Eν,λ <Eν,λ and set Eν,λ ← Eν,λ . (b) Repeat step (3a) 10 times for each ellipse. 4. Splitting ellipses: (a) For each ellipse, propose to split it along the major axis to generate two new ellipses and perform simplex optimization to obtain the optimized curr curr energy Eν,λ . Accept the split step if Eν,λ <Eν,λ and set Eν,λ ← Eν,λ . (b) Repeat step (4a) 10 times for each ellipse. 5. Repeat process (2)-(4) until a predetermined number of Monte Carlo steps is reached. We set the number of Monte Carlo step to 10 ∼ 30 steps for all calculations in this paper. Calculations of the Mumford-Shah energy utilize much of the computational time. Using a bounding box, we could restrict the energy calculation to within this local region instead of integrating through the whole image. This procedure increases the computational efficiency for energy calculation by about 10 fold. Depending on the number of ellipses detected, total calculation time for a 128 × 128 image ranging from 1 to 10 minutes on a 3GHz Pentium IV processor.
5
Results and Discussion
We evaluate our method against the randomized Hough transform [10,11]. The Hough transform is one of the most well known algorithms in object detection and the randomized Hough transform is a popular improvement of the Hough transform. This would be a good benchmark for evaluating our method. As a supplement to verify that active contours cannot detect nucleus that touches each other, we also benchmark our validation against the Chan-Vese algorithm [29]. We manually label the ellipses in our data sets and score the ellipse detectors based on the manually labeled ground truth. To find the discrepancies of our ellipse detector from the manually labeled ground truth, a matching table is constructed with matrix elements given by, sij =
|ei ∩ e˜j | , |ei ∪ e˜j |
(3)
588
C.K. Yap and H.K. Lee
(a) Our method
(b) Edge map
(c) Rand. Hough Trans.
Fig. 3. Detection of ellipse for red blood cells. Imperfect edge map (b) causes the randomized Hough transform method (c) to fail. Parameters for our ellipse detector are, λ = 0.08, ν = 0.208. Table 1. Performance of Mumford-Shah model and RHT. Denoising using a Wiener filter before the edge detection for RHT. No denoising step is needed when detecting ellipses using our proposed method. For image (a) ν = 1, λ = 0.85, (b) ν = 0.4, λ = 0.4 and (c) ν = 0.4, λ = 0.4. Parameters for RHT are adjusted to give the best possible results. Our Method
(a) accuracy = 0.712
(b) accuracy = 0.530
(c) accuracy = 0.670
Randomized Hough Transform
accuracy = 0.702
accuracy = 0.470
accuracy = 0.592
where ei is the set of pixels belonging to the ith detected ellipse and e˜j is the set of pixels belonging to the j th ellipse in the labeled image. sij is the ratio of intersection area divided by the area of the union. Our matching score for each image is then defined as, accuracy =
1 max sij n M∈M (i,j)∈M
(4)
Identification of Cell Nucleus Using a Mumford-Shah Ellipse Detector
589
Table 2. Comparison of our method with randomized Hough transform. In this data set, ν = 0.35 and λ = 0.15. Parameters for RHT are adjusted to get the best possible results. The bottom row includes one segmented image using Chan-Vese algorithm to illustrate that the standard level set approach does not identify ellipses correctly. Our Method
Rand. Hough Trans.
Our Method
Rand. Hough Trans.
accuracy = 0.78
accuracy = 0.59
accuracy = 0.77
accuracy = 0.63
accuracy = 0.77
accuracy = 0.63
accuracy = 0.73
accuracy = 0.49
accuracy = 0.73
accuracy = 0.60
accuracy = 0.64
accuracy = 0.59
accuracy = 0.60
accuracy = 0.43
accuracy = 0.55
accuracy = 0.36
accuracy = 0.82
accuracy = 0.43
Chan-Vese: accuracy = 0.29
M is the set of all one-to-one mappings between detected ellipse and manually labeled ellipse. n is a normalizing factor defined by the maximum of the number of detected ellipses and the number of labeled ellipses. It penalizes algorithms that generates lots of false positives and false negatives. The accuracy score ranges from 0 to 1 and equals one for a perfect match.
590
C.K. Yap and H.K. Lee
We modify the randomized Hough transform (RHT) code downloaded from Inverso’s website1 to allow for ellipse rotation. Detailed information on these parameters is available in Inverson’s program manual. To compare with the best of what RHT can do, we improve the RHT significantly by extensive preprocessing on the image to get as good an edge map as possible. The images are first denoised using Wiener filter with a matlab function wiener2, follow by a Canny edge detector with the matlab function canny (with default parameters) to detect edges. Spurious edges, such as edges with less than 10 pixels and straight edges are removed. The matlab function for blob analysis (regionprops) is use to remove all edges with eccentricity that is greater than 0.95. Lastly, we adjusted the parameters for RHT to get the best possible result. In spite of optimizing the RHT extensively, our method still performs better. Fig. 3 shows qualitatively that the proposed method performs better than the randomized Hough transform (RHT). The image consists of red blood cells that are adjacent to each other. The edge map does not represent the outline of the cells well. Hence, RHT is unable to identify the cells. In contrast, our method is region based and is able to capture information within the whole area of the cells2 . Table 1 shows the results of ellipse detection on breast cancer cells. These images are noisy, especially so for image (c). No preprocessing is required for our method, however, denoising is an important step for RHT. Nevertheless, our method achieved higher accuracy than RHT in all three images. We next perform more quantitative comparison of our method, RHT and Chan-Vese’s implementation of active contours. We use a data set consisting of 33 images containing 320 nucleus. In this data set, nucleus touches each other and the intensity of the nucleus are non-uniform. We choose this data set for the following reasons: 1. Images of this nature are ubiquitous in bioimaging. 2. The multi-phase property of our model will be tested since nucleus are of non-uniform intensity. 3. The ability to separate touching ellipses with the edge term in our modified Mumford-Shah model will be tested since many nucleus touches each other. To test the robustness of our model, we only use one set of parameters, i.e. ν = 0.35 and λ = 0.15 for all images in the data set. Robustness is an important property especially in the case of high-throughput screens with thousands of images. Table 2 shows nine images from this data set chosen uniformly among images that produce the best and the worst accuracies. Our model can detect most nucleus accurately except for a few very dim ones. RHT did not perform as well in spite of extensive preprocessing. Table 2 also shows one result from the Chan-Vese method to verify that standard active contours method does not work for nucleus detection. Fig. (4) shows a plot of accuracies for all three approaches. For clarity, we sort the accuracy scores of our method (black circles) 1 2
RHT code and manual obtained from: http://www.saminverso.com/res/vision/ The image is downloaded from www.epitomics.com/images/products/1670ICC.jpg
Identification of Cell Nucleus Using a Mumford-Shah Ellipse Detector
591
1 Our Method
RHT
Accuracy
0.8 0.6 0.4 0.2 0
0
10
20
30
Image ID Fig. 4. Ellipse detection accuracy for all 33 images in the data set. For clarity, scores for our method is sorted. Our method (filled circles) scored much better in many images than randomized Hough transform (squares) and Chan-Vese (cross). Randomized Hough transform (RHT) scored slightly better in four images. The image indicated by a blue oval is shown on the top left hand corner. Visually, our ellipse detector produces similar results with RHT. Arrows indicates the score for images in Table 2.
in descending order. Our method performs much better for most of the images and RHT (squares) performs slightly better than our approach in four out of 33 images. Active contours (cross) failed to achieve good results for all images. Insert shows the image corresponding to the data point marked with an ellipse. RHT performs better for this data point, but visually, our ellipse detector produces similar results with RHT. The vertical arrows indicate the corresponding data points for images shown in Table 2.
6
Conclusion
We developed an ellipse detection method for detecting objects in biological images. Our ellipse detector capitalizes on the strengths of the Mumford-Shah model. Hence, our ellipse detector does not need any preprocessing even for very noisy images and in cluttered environment. Quantitative comparisons with the randomized Hough transform (even with extensive preprocessing) show that the performance of our method is much better. Furthermore, our ellipse detector is robust to noise because of the averaging effect of the integral in the Mumford-Shah
592
C.K. Yap and H.K. Lee
model. The Mumford-Shah model consider the whole image, including the background as an input. A drawback on this method is that when there are only a few very small ellipses in a large background. The background term in the MumfordShah model will dominate making ellipse detection more difficult. In reality, such situation is infrequent for microscopy images. The computational speed of our ellipse detector does not poses serious problems in real applications. For cell counting in bioimaging does not require real time calculations. For our future work, the ellipse detector can be generalized to detect overlapping arbitrary shapes with a template using affine transforms. Lastly, we would like to thank Li Bin and Sohail Ahmed for providing us the image data set.
References 1. Bamford, P., Lovell, B.: Unsupervised Cell Nucleus Segmentation with Active Contours. Signal Processing 71, 203–213 (1998) 2. Hu, M., Ping, X., Ding, Y.: Automated Cell Nucleus Segmentation using Improved Snake. In: Int. Conf. Image Processing (ICIP 2004), vol. 4, pp. 2737–2740 (2004) 3. Murashov, D.M.: A Two-level Method for Segmenting Cytological Images Based on Active Contour Model. Pattern Recognition and Image Analysis 18, 177–192 (2008) 4. Yang, F., Jiang, T.: Cell Image Segmentation with Kernel-Based Dynamic Clustering and an Ellipsoidal Cell Shape Model. J. Biomedical Informatics 34, 67–73 (2001) 5. Jiang, T., Yang, F.: An Evolutionary Tabu Search for Cell Image Segmentation. IEEE Transactions on Systems, Man and Cybernetics B: Cybernetics 32, 675–678 (2002) 6. Mouroutis, T., Roberts, S.J., Bharath, A.A.: Robust Cell Nuclei Segmentation using Statistical Modelling. Bioimaging 6, 79–91 (1998) 7. Lee, K., Street, N.: A Fast and Robust Approach for Automated Segmentation of Breast Cancer Nuclei. In: Proc. IASTED Int. Conf. Comp. Graphics and Imaging, pp. 42–47 (1999) 8. Mumford, D., Shah, J.: Optimal approximation by piecewise smooth functions and associated variational problems. Comm. Pure Appl. Math. 42, 577–685 (1989) 9. Duda, R.O., Hart, P.E.: Use of the Hough Transformation to Detect Lines and Curves in Pictures. Comm. ACM 15, 11–15 (1972) 10. Xu, L., Oja, E.: Randomized Hough Transform: Basic mechanisms, algorithms and computational complexities. CVGIP: Image Understanding 57, 131–154 (1993) 11. Xu, L., Oja, E., Kultanen, P.: A new curve detection method: Randomized Hough transform. Pattern Recognition Letters 11, 331–338 (1990) 12. McLaughlin, R.A.: Randomized Hough Transform: Improved Ellipse Detection with Comparison. Pattern Recognition Letters 19, 299–305 (1998) 13. Ballard, D.H.: Generalizing the Hough Transform to Detect Arbitrary Shapes. Pattern Recognition 13, 111–122 (1981) 14. Sewisy, A.A., Leberl, F.: Detection ellipses by Finding Lines of Symmetry in the Images via an Hough Transform Applied to Straight Lines. Image and Vision Computing 19, 857–866 (2001) 15. Ho, C., Chen, L.: A Fast Ellipse/Circle Detector Using Geometric Symmetry. Pattern Recognition 28, 117–124 (1995)
Identification of Cell Nucleus Using a Mumford-Shah Ellipse Detector
593
16. Lei, Y., Wong, K.C.: Ellipse Detection based on Symmetry. Pattern Recognition Letters 20, 41–47 (1999) 17. Yin, P.: A new Circle/Ellipse Detector using Genetic Algorithms. Pattern Recognition Letters 20, 731–740 (1999) 18. Yao, J., Kharma, N., Grogono, P.: Fast Robust GA-Based Ellipse Detection. In: Proc. 17th International Conference on Pattern Recognition (ICPR 2004), vol. 2, pp. 859–862 (2004) 19. Ji, Q., Haralick, R.M.: A Statistically Efficient Method for Ellipse Detection. In: Proc. International Conference on Image Processing, vol. 2, pp. 730–734 (1999) 20. Fitzgibbon, A., Pilu, M., Fisher, R.B.: Direct Least Square Fitting of Ellipses. IEEE Transactions on Pattern Analysis and Machine Intelligence 21, 476–480 (1999) 21. Cheng, Y.C., Lee, S.C.: A New Method for Quadratic Curve Detection Using KRANSAC with Acceleration Techniques. Pattern Recognition 28, 663–682 (1995) 22. Chan, T., Zhu, W.: Level Set Based Shape Prior Segmentation. Computer Vision and Pattern Recognition 2, 1164–1170 (2005) 23. Cremers, D., Sochen, N., Schnorr, C.: Towards Recognition-based Variational Segmentation using Shape Priors and Dynamic Labeling. In: Griffin, L.D., Lillholm, M. (eds.) Scale-Space 2003. LNCS, vol. 2695, pp. 388–400. Springer, Heidelberg (2003) 24. Rousson, M., Paragios, N.: Shape Priors for Level Set Representations. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 78–92. Springer, Heidelberg (2002) 25. Wales, D.J., Doye, J.P.K.: Global optimization by basin-hopping and the lowest energy structures of Lennard-Jones clusters containing up to 110 atoms. J. Phys. Chem. A 101, 5111–5116 (1997) 26. Law, Y.N., Lee, H.K., Yip, A.M.: A Multi-resolution Stochastic Level Set Method for Mumford-Shah Image Segmentation, UCLA CAM report, 07-43 (2007), http://www.math.ucla.edu/applied/cam/ 27. Ridler, T.W., Calvard, E.S.: Picture thresholding using an iterative selection method. IEEE Tran. Syst. Man Cybern. SMC-8, 630–632 (1978) 28. Nelder, J.A., Mead, R.: A simplex method for function minimization. The Computer Journal 7, 308–313 (1965) 29. Chan, T.F., Vese, L.A.: Active Contours without Edges. IEEE Trans. Img. Processing 10, 266–277 (2001)
Evaluation of Brain MRI Alignment with the Robust Hausdorff Distance Measures Andriy Fedorov1 , Eric Billet1 , Marcel Prastawa2, Guido Gerig2 , Alireza Radmanesh3 , Simon K. Warfield4 , Ron Kikinis3 , and Nikos Chrisochoides1 1
Center for Real-Time Computing, College of William and Mary, USA Scientific Computing and Imaging Institute, University of Utah, USA 3 Surgical Planning Laboratory, Harvard Medical School, USA 4 Computational Radiology Laboratory, Harvard Medical School, USA
2
Abstract. We present a novel automated method for assessment of image alignment, applied to non-rigid registration of brain Magnetic Resonance Imaging data (MRI) for image-guided neurosurgery. We propose a number of robust modifications to the Hausdorff distance (HD) metric, and apply it to the edges recovered from the brain MRI to evaluate the accuracy of image alignment. The evaluation results on synthetic images, simulated tumor growth MRI and real neurosurgery data with expertidentified anatomical landmarks, confirm that the accuracy of alignment error estimation is improved compared to the conventional HD. The proposed approach can be used to increase confidence in the registration results, assist in registration parameter selection, and provide local estimates and visual assessment of the registration error.
1
Introduction
The objective of this work is the development of a novel metric for evaluating the results of pairwise mono-modal Non-Rigid Image Registration (NRR). An important feature of the proposed metric is the quantitative measure of the misalignment between the two images, with the goal to estimate the registration error. The specific application where we consider such metric to be of particular importance is the non-rigid registration of brain Magnetic Resonance Imaging (MRI) data during image-guided neurosurgery. This work is motivated by the difficulty of selecting the optimum parameters for NRR and the lack of “ground truth” that can be used for intra-operative evaluation of the registration results during the course of the neurosurgery. Image registration for image-guided neurosurgery aims the alignment of the high-quality pre-operative MRI data with the scans acquired intra-operatively (lower-quality images), for subsequent visualization of the registered data to assist with the tumor targeting during the resection. Non-rigid registration [1] is essential for this application because the brain shift cannot be recovered accurately using rigid or affine transformations [2]. A number of methods for non-rigid
This work was supported in part by NSF grants CSI-0719929 and CNS-0312980, and by John Simon Guggenheim Memorial Foundation.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 594–603, 2008. c Springer-Verlag Berlin Heidelberg 2008
Evaluation of Brain MRI Alignment with the Robust HD Measures
595
registration of brain MRI have been developed [3,4]. One of the challenges in NRR is the evaluation and validation of the registration results, e.g., see Christensen et al. [5]. Before a registration method is used in clinical studies, it can be validated on the cases, where the “ground truth” is available (e.g., phantoms, animal studies, cadavers, manually pre-labeled datasets) [1]. However, validation of the results obtained during the patient studies is quite limited. In the case of non-rigid registration of brain MRI, a widely accepted approach to validation requires an expert (ideally, a group of independent experts) to identify landmark points in the pairs of images before and after the registration, or outline the corresponding anatomical regions. The accuracy can then be assessed by the overlap of the corresponding regions and landmarks. Although this approach provides possibly the best precision, the accuracy estimation is available only at the landmark locations. The procedure is also very time-consuming, and is not practical when the results need to be validated within the short period of time (e.g., during the neurosurgery, when the time allowed for NRR is limited to 5-10 minutes), or when it is not feasible to have expert involved (e.g., when the results from a large number of different registration methods need to be compared, or while performing large-scale parametric studies). Because of the difficulties in performing intra-operative validation for patient studies, the results of NRR can be assessed by evaluating certain formalized metrics. Christensen et al. [5] summarize a number of such success criteria. However, not all of these criteria can be applied intra-operatively (e.g., the relative overlap metric requires segmentation of both images). Those criteria that can be evaluated intra-operatively (i.e., transformation transitivity and consistency metrics) cannot be used to estimate local alignment error. Finally, although the failure of NRR can sometimes be detected, there is no sufficient metric to conclude success of the NRR. This provides motivation for the development and study of new metrics for the NRR accuracy assessment. In this paper we consider the use of an image similarity metric, different from the one minimized during NRR, as the measure of image alignment to assess the NRR accuracy. The concept of performing NRR assessment in such a way was previously suggested in [2,6]. One of the common deficiencies of many image similarity metrics (e.g., Normalized Cross-Correlation, or Mutual Information) is that their value does not quantify the alignment error in terms of Euclidean distance. Thus, given the selected similarity metric is robust and reliable, one can track the improvement in image alignment, but cannot speculate about the degree of misalignment. Our approach to address this deficiency uses similarity metrics that are derived from the definition of the Hausdorff Distance (HD) [7]. The Hausdorff distance is a very common measure in pattern recognition and computer vision to measure mismatch between the two sets of points. A number of methods have been proposed to identify features (points, edges, lines) in medical images [8,9]. The resulting feature images can be used as input for the HD measure. The HD is not based on point correspondence, which makes it somewhat tolerant to the differences in the two sets of features compared. However, it is highly sensitive to noise. A large number of robust modifications
596
A. Fedorov et al.
to the HD have been proposed to suppress the noise and improve robustness of the HD, since one of the first papers in this direction by Huttenlocher et al. [7]. The most recent surveys of the modifications to the HD are available in [10,11]. The value of the HD is derived from Euclidean distances between the two point sets. Nevertheless, the HD has found limited use as a measure for the evaluation of image alignment. Peng et al. [12] used a robust version of the HD to register outlines of brain in two dimensions. Morain-Nicolier et al. [13] applied the HD to quantify brain tumor evolution. Finally, Archip et al. [2,6] assess the performance of non-rigid image registration of brain MRI with the 95% partial HD, but do not discuss the reliability of this approach. To the best of our knowledge, the HD-based approach to image alignment assessment has not been comprehensively evaluated before. A number of the robust versions of the HD exist, but they have not been evaluated for 3d images and in the context of NRR for medical imaging. In this paper we evaluate the recent advances in the development of robust HD, and apply these techniques to evaluate the accuracy of pairwise alignment for brain MRI subject to nonrigid deformation. Based on the results of our evaluation, the presented approach significantly improves the accuracy of the previously used alignment evaluation metrics based on the conventional HD. The implementation of the presented approach is available as open source software, accompanied by the detailed description of the parameters we used to obtain the reported results on publicly available BrainWeb data [14].
2
Methods
The methods developed in this paper focus primarily on registration assessment for image guided neurosurgery. The objective of the NRR is to align the preoperative image with the intra-operative data. Consequently, the objective of the evaluation procedure is to confirm that alignment indeed improved following NRR, and quantify the level of mis-alignment before and after registration. The first image, which is called fixed image, is acquired intra-operatively, and shows the brain deformation. For the purposes of assessment, the second image can either be the floating image (pre-operative image rigidly aligned with the fixed image), or the registered image (the result of registering the floating and target images). By evaluating the alignment of fixed vs. floating and fixed vs. registered images we can assess the error of alignment before and after registration, respectively. However, the formulation of the problem remains the same. Given two images, I and J , the objective is to derive the point-wise alignment error. Let A and B be the binary images with the feature points extracted from I and J respectively, and A = {a1 , . . . , an } and B = {b1 , . . . , bm } be the sets of points that correspond to the non-zero voxels in A and B, respectively. Next we consider the sequence of the HD-based similarity measures with increasing robustness. The directed HD between the two sets of points h(A, B) is defined as the maximum distance from any of the points in the first set to the second set, and the HD between the two sets, denoted H(A, B), is the maximum of these distances [7]:
Evaluation of Brain MRI Alignment with the Robust HD Measures
597
h(A, B) = max(d(a, B)), where d(a, B) = mina − b, a∈A
b∈B
H(A, B) = max(h(A, B), h(B, A)). In the case of perfect correspondence between the points in the sets A and B (i.e., point a in the set A corresponds to the same image feature as point b in the set B), H(A, B) would be the maximum (global), alignment error between the two images. This is the first problem in using the HD for alignment assessment, as it can only estimate the maximum error. The second problem obviously comes from the sensitivity of the metric to noise and lack of point correspondence: the estimated value of the error will not correspond to the maximum error in the general case. Simple versions of the robust HD measure were proposed to alleviate this problem. The partial Hausdorff distance is defined as a quantile of the ranked distances between the two point sets, originally proposed by Huttenlocher et al. [7]. Archip et al. [2,6] use 95%-HD, which is defined as the 0.95-quantile partial distance between the two sets. However, 95%-HD is a global measure, and does not allow to assess the error locally without modifications to the calculation procedure. The local-distance map (LDMap) proposed by Baudrier et al. [11] for the of 2d images extends the definition of the HD, and allows to derive the local measure of dissimilarity between the two binary images: ∀x ∈ R3 : Hloc (x) = |1A(x) − 1B(x) | × max(d(x, A), d(x, B)),
(1)
where A(x) is the voxel value at location x, and 1A(x) is a function which has value 1 if A(x) is non-zero, and 0 otherwise. Hloc is symmetric, and it is connected to the conventional HD definition by the relation H(A, B) = max(Hloc (A, B)) [11]. The advantage of Hloc (LDMap) is that it can be used for localized estimation of the alignment error. Ideally, the value of Hloc should be the same as the distance between the corresponding points in the images. However, because there is no point correspondence used in the HD definition, the values of Hloc would significantly deviate from the values of the alignment error, in the general case. We attempt to add the notion of point correspondence to the definition of the LDMap by using greyscale modification of the HD originally proposed by Zhao et al. [10] for matching 2d images corrupted by noise. We transform the input binary images, produced by the feature detection procedure, into greyscale A˜ and ˜ These greyscale images have the same size as the initial binary images, with B. each voxel initialized to the number of non-zero voxels in the neighborhood of the corresponding binary image pixel. A 2d example of greyscale image construction ˜ be the set of points that correspond to non-zero is shown in Figure 1. Let B ˜ ˜ where g is the greyscale value at pixels in B. The directed distance d(ag , B), voxel a in the greyscale image computed from A, is now defined as the distance ˜ with the greyscale value within the tolerance t from a to the closest voxel in B ˜ from the value of g in B (we used t = 2): ˜ = min ag − bg , g − t ≤ g ≤ g + t. d(ag , B) ˜ bg ∈B
598
A. Fedorov et al.
Fig. 1. Left: Binary image. Right: Corresponding greyscale image.
The greyscale local HD can now be computed based on this updated point distance definition: ˜ d(xg , B)). ˜ ∀x ∈ R3 : GHloc (x) = |1A(x) − 1B(x) | × max(d(xg , A),
(2)
We define the greyscale Hausdorff distance (GHD) GH(A, B) between the two binary images as GH(A, B) = max(GHloc (A, B)). Additional processing of the greyscale HD can increase its robustness further. We define robust greyscale HD locally based on the least trimmed squares robust statistics [15] on the values of GHloc in the region around each feature point. The robust greyscale HD RGHloc (x) is calculated as the average of the ordered values of GHloc (x) in the fixed size window centered at x, after discarding 20% percent of the top distance values within this window (trimmed mean value). Similarly to the HD and GHD, we define the robust greyscale Hausdorff distance (RGHD) RGH(A, B) = max(RGHloc (A, B)). Prior to edge detection, we smooth the input images using edge-preserving anisotropic diffusion (variance 1.0, conductance 0.5, time-step 0.0625), followed by adaptive contrast equalization [16]. Without such preprocessing the edges detected in the images can have very small overlap even with the perfect alignment. Edge detection is done with the Canny edge detector [8]. We use adaptive selection of the edge detection thresholds based on the binary search to have similar number of edges in both images. Insight Toolkit (ITK) [16] is used for all image processing operations. The reader is referred to [14] for the details of parameter selection and the open source implementation of the presented technique.
3
Results
We evaluate the presented methods using three benchmarks: (1) synthetic nonrigid deformation; (2) synthetic tumor growth; (3) real data from image-guided neurosurgery with the expert-placed anatomical landmarks. In each case, the performance of an evaluation metric is measured as its ability to recover the deformation magnitude (thus, misalignment error value) at the locations of the image where such deformation is known (all points for benchmarks (1) and (2), and selected landmark points for (3)). The questions to answer are how the values of Hloc , GHloc and RGHloc locally compare to the ground truth alignment
Evaluation of Brain MRI Alignment with the Robust HD Measures
599
30000 140000
Error distribution
25000
Hloc distribution 120000
20000
100000 80000
15000
60000 10000 40000 5000
20000 0
0 0
1
2
3
4
5
6 7 error, mm
8
9
10
11
0
12
1
2
3
4
5
6 7 Hloc, mm
8
9
10
11
12
11
12
30000 50000 25000
GHloc distribution
RGHloc distribution
40000 20000 30000 15000 20000
10000
10000
5000
0
0 0
1
2
3
4
5
6 7 GHloc, mm
8
9
10
11
12
0
1
2
3
4
6 7 5 RGHloc, mm
8
9
10
Fig. 2. Distribution of the error and the HD, GHD, and RGHD values for the same synthetic deformation case (BrainWeb, sum of Gaussian deformations, variance 5)
error, and how the robust versions of the HD (GHD and RGHD) compare with the conventional HD. Synthetic nonrigid deformation. BrainWeb MRI simulator1 was used to create two normal subject T1 images (0% and 9% noise) with 1 mm slice thickness, and 0% intensity non-uniformity. We applied synthetic deformation to the image, using the framework described in [17,14]. The deformation at points sampled on a regular grid is calculated as a sum of Gaussian kernels, and deformation at nongrid image locations are interpolated with thin-plate splines [16]. The magnitude of deformation can be controlled by changing the variance of Gaussian kernel. Local alignment accuracy was estimated between the undeformed image with 0% noise and deformed images with 0% and 9% noise. The accuracy of alignment at an image location (error) is the magnitude of the synthetic deformation vector at that location. The distributions of the true error values, and Hloc , GHloc , and RGHloc are shown in Figure 2. Evidently, RGHloc is a significantly more accurate approximation of the error distribution. Robustness can also be compared by looking at the percentage of outliers within local distance estimations, shown in Figure 3. We define outlier as a local estimation that exceeds the true error value at a point by more than 2 mm (deformation field is in physical√space, while the HD-estimation is in 1-mm voxel space, thus errors as large as 3 cannot be prevented). With increasingly larger deformations, the ratio of outliers is also increasing. The contribution of the outliers in the conventional HD is increasing rapidly for larger deformations. The robust metrics have by far less outliers, which is reflected in the more stable behavior of GHD and RGHD in comparison to the HD: RGHD 1
http://www.bic.mni.mcgill.ca/brainweb/
600
A. Fedorov et al.
13
55
12
50
11
Hloc, 0% noise Hloc, 9% noise GHDloc, 0% noise GHDloc, 9% noise RGHDloc, 0% noise RGHDloc, 9% noise
45
10 40 35
8 outliers, %
error statistics, mm
9
7 6 5 max error mean error HD, 0% noise HD, 9% noise 95% HD, 0% noise 95% HD, 9% noise 95% RGHD, 0% noise 95% RGHD, 9% noise
4 3 2 1 1
2
3
4
5
6
7 8 9 10 11 12 gaussian variance, mm
30 25 20 15 10 5 0
1
2
3
4
7 8 5 6 gaussian variance, mm
9
10
11
12
Fig. 3. Left: Error statistics for synthetically deformed BrainWeb images with and without noise, and the derived values of the Hausdorff distance based estimations. Right: The change in the proportion of the outlier measurements depending on the Gaussian variance.
Fig. 4. Left: Synthetic tumor, case 2. Center: Deformation field produced by the tumor growth simulation (tumor mass effect and infiltration), colored by magnitude. Right: Edges recovered from the simulated tumor image. The same slice is shown in all images.
is consistently increasing as the alignment error increases, and it is always above the mean error value (see Figure 3, left). Thus, for large deformations (deformations as large as 10-15 mm have been reported during open scull craniotomy) RGHD is a more appropriate measure. Synthetic tumor growth. We used simulated brain tumor growth images to assess error estimation performance for more realistic deformation modes, and for the images of different contrast. The images were created from the BrainWeb anatomical data as described in [18]. We used two versions of the simulated data: (1) with the intensity distribution close to that of the healthy subject image, and (2) with the intensity distribution derived from the real tumor data. Edge detection was done on the images with the regions corresponding to the tumor excluded. The misalignment was estimated between the healthy subject data and the image with the simulated tumor for the same subject at each feature point of the edge images. The recovered distances were compared with the true deformation magnitude from the tumor growth simulation (deformation field
Evaluation of Brain MRI Alignment with the Robust HD Measures
601
Table 1. Outliers percentage for synthetic tumor growth data (only points corresponding to non-zero ground truth deformation are considered)
id 1 2 3
Hloc 7.9% 21.7% 4.1%
same contrast RGHloc 4.6% 16.4% 3.6%
Hloc 32.7% 30.3% 34.5%
diff. contrast RGHloc 42.9% 34.1% 44.7%
diff. contrast, enhanced Hloc RGHloc 12.1% 9.5% 20.2% 15.4% 10.4% 7.1%
Table 2. Accuracy of error assessment for the tumor resection data in mm; empty entries correspond to image locations without edge features to assess error landmark id 1 2 3 4 5 6 7 8 9 10
expert 2.59 0.48 0.48 0.48 2.59 1.07 2.45 1.44 3.36 1.44
case 1 Hloc RGHloc – – – – 1 0.92 1 1.2 1 0.82 – – – – 1 0.63 2.24 3.45 1 1.11
avg difference w.r.t. expert 0.77
0.69
case 2 expert Hloc RGHloc 2.51 – – 1.52 – – 2.99 1.41 1.53 1.36 1.73 1.16 0.98 – – 2.4 – – 2.04 – – 1.92 1 1.36 3.04 – – 1.36 1 1.43 0.81
0.57
expert 4.43 4.53 3.96 2.15 2.88 3.66 3.49 4.43 3.96 1.98
case 3 Hloc RGHloc 1 3.07 1.41 2.25 – – – – 2 2 – – 2.24 2.22 1.41 2.94 – – 1.41 1.05 2.04
1.37
being the sum of the tumor mass effect and infiltration induced deformations). The outlier statistics is summarized in Table 1. Case 2 was the most complex, with the two infiltrating tumors of large volume located one next to another. Edge detection is particularly problematic in the edema region, which in this particular case extends to the majority of the deforming tissue region. This explains large number of outliers for set 2. Figure 4 helps to appreciate the complexity of error recovery for tumor set 2: there are very few edges detected in the area of the deformation, and the tumor area is almost indistinguishable from the large edema region. Nevertheless, robust HD estimation consistently has less outliers than the HD. Neurosurgery registration data. We used three data sets from the public SPL repository of the tumor resection data2 . Expert radiologist placed 10 corresponding anatomical landmarks in the pre- and intra-operative brain MRI T1 images. The error recovered using the HD-based techniques was compared with the expert-estimated error. The results are summarized in Table 2. On average, the RGHD measure shows better accuracy compared to the HD. 2
http://www.spl.harvard.edu/pages/Special:PubDB View?dspaceid=541
602
A. Fedorov et al.
Fig. 5. Local estimation of misalignment using RGHD, all images show the same slice. Left: Undeformed image, BrainWeb. Center: Deformed image, Gaussian kernel variance 5 mm. Right: LDMap of the deformed and undeformed images, voxel values initialized to RGHDloc .
4
Conclusions
We have presented an HD-based approach to estimation of image alignment error. Based on the evaluation results, the RGHD measure we propose can be more robust in terms of outliers in local distance estimation, and thus can potentially improve the accuracy of the image alignment assessment. While our primary application is the assessment of the non-rigid registration results, validation of the proposed method itself on real neurosurgery data is complicated by the absence of ground truth. Nevertheless, it can be used to improve the confidence in registration results. The synthetic tumor growth data used in our evaluation may be more challenging than estimation of the pre-, intra-operative and registered image alignment. In the latter case, the images have similar content: tumor and edema are present on both images, and the edges detected from those images are more similar. We show that RGHD improves error estimation accuracy locally for anatomical landmarks, thus we expect that globally RGHD is also more accurate on neurosurgery data than the HD measure. The evaluated techniques, and specifically RGHD – the most robust of the evaluated methods – can serve multiple purposes in registration assessment. First, they can be used as a global similarity metric between the two images, as well as for local alignment assessment. This mode of operation is particularly useful for automatic assessment of the non-rigid registration results during large-scale unsupervised parametric studies. Second, localized assessment of registration error can also be applied in conjunction with the visual assessment to provide quantitative error measurements. An example is shown in Figure 5. We emphasize, that the proposed method cannot substitute validation studies. Instead, it can be used in conjunction with other accuracy assessment methods for the patient studies, where accuracy is critical, processing time is highly limited and there are no means to compare the registration result with the ground truth. An promising area of our future work is the evaluation of the proposed measures in conjunction with the consistency tests of the deformation fields obtained during the NRR, and sensitivity of the measures to parameter selection of a specific NRR method.
Evaluation of Brain MRI Alignment with the Robust HD Measures
603
References 1. Hill, D.L.G., Batchelor, P.G., Holden, M., Hawkes, D.J.: Medical image registration. Physics in Medicine and Biology 46, R1–R45 (2001) 2. Archip, N., Clatz, O., Whalen, S., Kacher, D., Fedorov, A., Kot, A., Chrisochoides, N., Jolesz, F., Golby, A., Black, P., Warfield, S.: Non-rigid alignment of preoperative MRI, fMRI, and DT-MRI with intra-operative MRI for enhanced visualization and navigation in image-guided neurosurgery. Neuroimage 35, 609–624 (2007) 3. Ferrant, M.: Physics-based Deformable Modeling of Volumes and Surfaces for Medical Image Registration, Segmentation and Visualization. PhD thesis, Universite Catholique de Louvain (2001) 4. Clatz, O., Delingette, H., Talos, I.F., Golby, A., Kikinis, R., Jolesz, F., Ayache, N., Warfield, S.: Robust non-rigid registration to capture brain shift from intraoperative MRI. IEEE Trans. Med. Imag. 24, 1417–1427 (2005) 5. Christensen, G.E., Geng, X., Kuhl, J.G., Bruss, J., Grabowski, T.J., Pirwani, I.A., Vannier, M.W., Allen, J.S., Damasio, H.: Introduction to the non-rigid image registration evaluation project (NIREP). In: Pluim, J.P.W., Likar, B., Gerritsen, F.A. (eds.) WBIR 2006. LNCS, vol. 4057, pp. 128–135. Springer, Heidelberg (2006) 6. Archip, N., Tatli, S., Morrison, P., Jolesz, F., Warfield, S.K., Silverman, S.: Nonrigid registration of pre-procedural MR images with intra-procedural unenhanced CT images for improved targeting of tumors during liver radiofrequency ablations. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 969–977. Springer, Heidelberg (2007) 7. Huttenlocher, D., Klanderman, D., Rucklidge, W.: Comparing images using the Hausdorff distance. IEEE Trans. Pat. Anal. and Mach. Intel. 15, 850–863 (1993) 8. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8, 679–698 (1986) 9. Lloyd, B., Szekely, G., Kikinis, R., Warfield, S.K.: Comparison of salient point detection methods for 3d medical images. In: 2005 MICCAI Open Source Workshop (2005) 10. Zhao, C., Shi, W., Deng, Y.: A new Hausdorff distance for image matching. Pattern Recognition Letters 26, 581–586 (2005) 11. Baudrier, E., Nicolier, F., Millon, G., Ruan, S.: Binary-image comparison with local-dissimilarity classification. Pattern Recognition 41, 1461–1478 (2008) 12. Peng, X., Chen, W., Ma, Q.: Feature-based nonrigid image registration using a Hausdorff distance matching measure. Optical Engineering 46, 057201 (2007) 13. Morain-Nicolier, F., Lebonvallet, S., Baudrier, E., Ruan, S.: Hausdorff distance based 3d quantification of brain tumor evolution from MRI images. In: Proc. 29th Annual Intl Conf of the IEEE EMBS, pp. 5597–5600 (2007) 14. Billet, E., Fedorov, A., Chrisochoides, N.: The use of robust local Hausdorff distances in accuracy assessment for image alignment of brain MRI. Insight Journal (January-June 2008), http://hdl.handle.net/1926/1354 15. Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection. John Wiley & Sons, Chichester (1987) 16. Ibanez, L., Schroeder, W.: The ITK Software Guide 2.4. Kitware, Inc. (2005) 17. Rogelj, P., Kovaˇciˇc, S., Gee, J.C.: Point similarity measures for non-rigid registration of multi-modal data. Comp. Vis. and Image Underst. 92, 112–140 (2003) 18. Prastawa, M., Bullitt, E., Gerig, G.: Synthetic ground truth for validation of brain tumor MRI segmentation. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005. LNCS, vol. 3749, pp. 26–33. Springer, Heidelberg (2005)
User Driven Two-Dimensional Computer-Generated Ornamentation Dustin Anderson and Zo¨e Wood California Polytechnic State University
Abstract. Hand drawn ornamentation, such as floral or geometric patterns, is a tedious and time consuming task that requires skill and training in ornamental design principles and aesthetics. Ornamental drawings both historically and presently play critical roles in all things from art to architecture, and when computers handle the repetition and overall structure of ornament, considerable savings in time and money can result. Due to the importance of keeping an artist in the loop, we present an application, designed and implemented utilizing a user-driven global planning strategy, to help guide the generation of two-dimensional ornament. The system allows for the creation of beautiful, organic ornamental 2D art which follows a user-defined curve. We present the application and the algorithmic approaches used.
1
Introduction
Hand-drawn ornamentation, like that drawn in Figure 1, is a tedious and time consuming task that requires much skill and training in ornamental design principles and aesthetics. Ornamental drawings both historically and presently play critical roles in all things from architecture to art, and allowing computers to handle the repetition and tedium of ornamental generation allows for considerable time savings. Building on concepts from Computer-Generated Floral Ornament [2], we have created an application which allows users to generate 2D ornament that more strongly adheres to the ornamental design principles than in previous works. Due to the importance of keeping an artist in the loop, we present an application, designed and implemented utilizing a user-driven global planning strategy. Ornamentalists use five principal techniques in conveying a perception of order: repetition, balance, conformation to geometric constraints, growth, and conventionalization [3,4,5]. In brief these principals are: 1. Repetition: Even a simple geometric mark, when repeated through translation, rotation, or scaling, can serve as the basis of an ornament. 2. Balance: The principle of balance requires that asymmetrical visual masses be made of equal “weight” [2]. 3. Conformation to Geometric Constraints: A careful fitting to boundaries is a hallmark of ornament [6]. In addition, for structural integrity, tangential junction provides a powerful sense of physical support to an ornament. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 604–613, 2008. c Springer-Verlag Berlin Heidelberg 2008
User Driven Two-Dimensional Computer-Generated Ornamentation
605
Fig. 1. (a) A physiographic wave ornament taken from [1]. (b) One of the wave segments created using our system.
4. Growth: Growth is a means of transporting design into new regions and continuing patterns. Especially for floral ornament, growth is an essential aspect of creating organic looking ornament. Additionally, intention provides another avenue for artistic control, expressing growth with external influences taken into consideration, such as growth toward pre-placed flowers or guidance along a central vine. 5. Conventionalization: In ornament, conventionalization is the development of abstractions of natural form. When artists develop a conventionalization, they extract only the essential aspects of form and the result often is stylized and modified to be more aesthetic. Of these principals, our application adheres to: repetion, balance, growth, and geometric constraints. Our system allows users to select when repetition will be used with radius-to-texture mappings, and balancing is a completely automated process. Our system’s interactivity support the principal of growth, by allowing the user to guide an ornament’s growth through intention. Our application allows the user to place a main curve and special user-placed polygons called no-draw regions where ornament may not exist, which are used to guide the overall structure of an ornament. Our system conforms to these geometric constraints and since our system carefully generates ornament elements along a user-defined curve, all generated ornament structures follow the principle of tangential junction. Tangential junction gives the overall ornament a sense of physical “strength” insofar as it seems to “hang together,” unlike the ornament generated by the system in [2] which intentionally grows ornament with the goal of filling space. These features, coupled with utilizing an interactive user-define curve and no-draw region placement as a global planning strategy for ornament structure, allows ornamentalists to create beautiful and organic-looking ornamental 2D art with our system.
2
Related Work
Many areas of computer graphics are related to computer generated ornamentation with the most relevant work being done by Wong et al.[2]. Other early
606
D. Anderson and Z. Wood
work that contributed to the field include: generating the 17 symmetry patterns within a plane [7], generating periodic tilings and patterns [8], synthesis of frieze patterns [9], and generation of flora using computers [10]. In addition, Beach and Stone introduced the idea of procedurally generating a simple repeating border pattern that is warped to follow the path of a spline in their paper on graphical style sheets [11], an idea that was expanded on by Hsu and Lee, who introduced the notion of “skeletal strokes” to warp vector clip art along a path [12,13]. Other areas of related work include L-systems for computer-generated growth [14], fractals and dynamical systems [6], computer generated Celtic design [15], and generative parametric design of gothic window tracery [16]. In the work by Wong et al. [2], a modern approach to generating floral ornament is presented, and the types of ornamentation are classified. The output from the system is called “adaptive clip art”. The implementation of the algorithm by Wong et al. first places the ornamental elements algorithmically using proxies to the actual geometry. A growth model handles the placement of the proxies, where new “growth” of the ornament is accomplished by applying rules from existing motifs into portions of the panel that are not yet populated. Artists are responsible for creating the actual geometry for each proxy, but the final placement of ornament element proxies is determined by the algorithm. A significant contribution from the work by Wong et al. is that the system does not create ornaments using traditional botanical growth models such as Lsystems[14]. Instead the growth model represents the artist’s process in creating aesthetic stylized plant designs, and is not meant to mirror the growth of actual flora. Kaplan [6] points out that although much effort is given to describe the principles of ornamental design in the work by Wong et al., the implementation of the system only loosely adheres to them. This technique appropriately deals with small areas, which are able to be ornamented in an aesthetic fashion. Larger areas, however, such as those in an architectural setting, would most likely fail to be aesthetically pleasing due to the lack of any sort of global planning strategies that would guide the growth of ornaments.
3
Overview and Algorithms
Our system is an application for use in the creation of two-dimensional ornamental drawings. In general, the system allows users to input the control points for a curve which defines the general underlying structure of an ornament. The curve is loaded into a buffer and then proxies are seeded along it according to user-defined controls. Proxy sizes are determined by user controls and geometric constraints. Once seeded, varying textures are mapped onto the primitive proxy geometry and displayed to the user. Texture variations are controlled by user’s selected mapping of proxy size ranges to specific textures. At this point, the user can decide to balance the ornament or not. Furthermore, the user is allowed to define polygonal regions where ornament may not exist, further promoting the user’s artistic control over the global planning of the ornament.
User Driven Two-Dimensional Computer-Generated Ornamentation
607
Fig. 2. The creation of seemingly multiple ornamentents from a single curve with radius-balanced group sizes 1:1. (a) The original ornament. (b) The ornament with no-draw regions active. (c) The final ornament (curve and no-draw regions hidden).
3.1
Goals
Our goal was to create a system that allowed for the direct, accurate, and interactive creation of two-dimensional ornamentation using global planning. Specifically we wanted users to be able to: 1. Create a fairly complex structural curve intuitively using the mouse 2. View the underlying structure and components of their ornament as it is created 3. Generate ornament elements that seem to “grow” from the user-defined structural curve 4. Compose a personalized ornament intuitively that adheres to the principles of ornamental design 5. Fine-tune a computer-generated ornament if desired, but also be able to create ornaments quickly without having to modify hundreds of controls The following sections describe our algorithm in more detail and demonstrate how our algorithm achieves these goals. 3.2
Curve Representation
In order to achieve these goals, the system was designed with the user in mind and works in real time. Because global planning was the main methodology for creating a user-driven ornament, curve placement is essential. Curve points frequently sampled and connected with short lines were chosen over longer, straighter, and sharper line segments in order to achieve a more organic aesthetic. The underlying curve representation is a Catmull-Rom representation. Catmull-Rom curves allow for two directional changes at any given control point, which is crucial for giving the user the freedom to construct curves with varying size segments and varying curvatures. Our system allows for the placement of up to fifty control points via mouse input, satisfying the first goal of being able to create a fairly complex structural curve intuitively using the mouse. Our system
608
D. Anderson and Z. Wood
Fig. 3. A radius-balanced ornament with group sizes 1:1 with oriented texture elements avoiding no-draw regions. The no-draw regions can be seen in the buffer window (grey), and the curve is hidden.
uses 4th degree equations, which are preferred over higher degree equations for mathematical simplicity. Other curve representations could be used, however, we were pleased with the results using the Catmull-Rom curves. For a longer discussion of the curve representation please see [17]. Figure 2 shows an example of complex ornamentation using the curve placement provided by our application. 3.3
Viewing Underlying Ornament Structure
In order to address our second goal of being able to view the underlying structure and components of an ornament as it is created, the application includes two
Fig. 4. A radius-balanced ornament with group sizes 1:1 and oriented floral texture elements. The interpolated curve is drawn as white.
User Driven Two-Dimensional Computer-Generated Ornamentation
609
Fig. 5. Another example of ornamentation created using our application
windows viewing the ornatmention: a buffer window and an interactive window. Once the user defines the control points of the curve, the curve is drawn into the interactive window. The interactive window is the area of our application where the user enters input into the system via the mouse by placing control points and defining no-draw regions. The buffer window is where the underlying components of a user’s ornament are shown in real-time as the ornament is modified. The user can choose to view the element proxies, control points, and/or the curve normals in the interactive window by turning on visibility through the application options. See Figures 3 and 6 for examples of the buffer window view and the interactive window. The interactive window is a reflection of the components in the buffer window, where proxies are mapped with textures and displayed as ornamental elements on screen. Before proxies have been calculated and placed around the curve, the curve is scanned into a two-dimensional array called the image buffer. Each pixel that matches the user-defined curve color and/or outline color is considered a buffer hit, and its value in the buffer is set to a constant value representative of existing geometry. All other pixels are loaded into the image buffer as empty. The mapping of the curve into the image buffer is a critical preparatory step for the seeding algorithm which calculates the placement of proxies that both do not overlap the curve, and best fill up the space. 3.4
Seeding Algorithms
Generating ornament along the user-placed curve creates an ornament with a strong sense of tangential junction. This satisfies the third goal of being able to create ornament where elements will “grow” from the user-defined structural curve, as seen in Figure 4. The algorithm executes as follows:
610
D. Anderson and Z. Wood
Fig. 6. In this progression, proxies begin to collect around the structural curve as sampling distance is decreased. (a) An ornament with the structural curve and no-draw regions hidden in the interactive window, but shown in the adjacent buffer window. (b) The sampling distance along the curve in (a) is decreased from 10 to 4. (c) The curve in (b) is then linearly interpolated.
For each sampling point along the curve corresponding to the user-defined sampling distance, a normal is computed. This calculated normal points to the correct side (left or right) of the curve, determined by the group sizing controls the user has set. A new proxy center is then generated at the user-defined largest radius size away from the curve along the normal. At this point, intersections between the new proxy and the curve, any other proxy, and no-draw regions are tested for by indexing into the image buffer. If intersection occurs, the new proxy’s radius is decreased by one pixel, and the center of the proxy is moved along the normal to keep the proxy as close to the curve as possible. The process of intersection testing, decreasing radius size, and moving proxies continues until no intersections occur. Once placement is final, the proxy is saved into the image buffer, and the corresponding element in the interactive window is texture mapped according to the user-defined radius-to-texture mappings.
User Driven Two-Dimensional Computer-Generated Ornamentation
3.5
611
The Balancing Algorithms and Error Checking
As defined earlier, balancing of an ornament requires that asymetrical visual masses be made of equal “weight.” In our system, balancing can only occur when elements are placed along the curve, where the curve splits the drawing area into left-space and right-space. Element placement along the curve, however, is able to be balanced by adjusting the “weight” of every element on one side of the curve with the elements on the other, either by balancing all proxy radii or by balancing the areas within each proxy. Various methods of balancing elements are possible, including balancing by radius, balancing by area, and balancing by texture map density. In the current implementation of the system we explored balancing by radius and area. In general, balancing is done by calculating the sum of all proxy weights (for example: radii) on the left of the curve, the sum of all proxy weights on the right of the curve, and decreasing proxy weights accordingly to make the larger sum equal to the smaller sum. As long as balancing is possible, this algorithm is invoked to balance the current ornament. Figure 4 depicts a simple ornament, balanced by radius. Note that balancing may not be possible in certain circumstances where group sizes are far apart. When two weight totals can never become equal due to the minimum and maximum weight constraints (say radius growth size due to geometric contraints), a warning message is provided to the user.
4
Results and Conclusions
Using the work presented by Wong et al. in [2] as both a reference and a springboard for implementation ideas, our contributions give users a means of globally planning ornaments interactively in real-time. Our system satisfies the goals from Section 3.1. Through our efforts, an interactive computer application that allows users to produce beautiful, organic ornamental images now exists. The system allows users to select textual elements to decorate a user-defined curve, providing a means of globally planning an ornament’s overall structure. We have shown several images created with the system, and more images created with the application can be seen in [17]. In our application, users have controls to modify: – – – – – – –
how the curve is drawn the placement, sampling distance, and sizes of proxies which components of the ornament are visible radius-to-texture mapping ranges if preset styles and/or color inversion are used the overall balancing of an ornament and element grouping sizes no-draw regions and their visibility
These controls allow users to completely personalize an ornament. Here, we explain our contributions in more depth, and compare our work with the work in [2] where appropriate. Specifically, our work:
612
D. Anderson and Z. Wood
– Provides an interactive method for designing two-dimensional ornament including curve placement and texture selection and their mappings. Our system receives input through the front-end GUI, allowing users to exert artistic control over their ornament. Additionally, the ornament created with our system need not be limited by any given “theme” such as “floral” or “geometric” because of the radius-to-texture mappings that can be applied on-the-fly by users. The work done by [2] did not allow for real-time interaction with the ornamentation process. – Presents a method to generate ornament based on an underlying curve. Inputs in [2] were predefined and were not real-time, ornament filled an arbitrary panel, and was not able to globally be directed or influenced by external sources. We have purposely kept the growth algorithms straight-forward and unobtrusive so that users can have mechanisms for directly and accurately laying down their global planning strategies. – Helps users generate ornament that automatically adheres more closely to ornamental design principles. The system of [2] produces ornament that only loosely follows these principles. Repetition is controlled by radius-to-texture mappings, but is not fully controllable. Balancing an ornament is an automated process and is fully controllable, as is growth along the user-defined curve.The principle of tangential junction is also upheld during ornament creation and the user can globally plan their ornament through intention. – Supplies pre-defined sets of textures and color mappings that define ornament “styles,”. Although [2] presents several “styles” of ornament in their work, libraries of these styles were not accessible by users, and proxy geometry could not be changed on-the-fly. In our system, however, any RGB formatted texture can be loaded at any time. Furthermore, this capability does not restrict the ornament generated by our system to be floral in nature, as is the case in [2]. See Figure 3 for a non-floral example. Overall, our system serves to augment the process of ornamentation by computationally managing ornament design structure while giving ornamentalists an interactive, real-time, direct, and accurate means to experiment without fear of wasting resources. With our application, users can create beautiful and personalized organic-looking ornament effectively and efficiently. 4.1
Future Work
Since two-dimensional ornamentation can be found on everything from fliers to the human body, the potential uses of our application are boundless. One of the key ways our application could be expanded is through improvements to the interface and interaction. A gesture-based means for creating strokes would allow users a very intuitive means of creating organic ornamentation. In addition, other future improvements include genetic algorithms for generation, 3D ornamentation, multiple curves, and no-draw regions as imported geometry, could improve the existing application. Lastly, only a small group of user’s have had the opportunity to give us feedback on our system. Users reported that the system is “fun and easy” to use, and the controls are simple enough that users were
User Driven Two-Dimensional Computer-Generated Ornamentation
613
easily able to design a personalized ornament within a few minutes, however, a full blown user study would further improve our application.
References 1. Meyer, F.S.: Handbook of Ornament. Dover Publications, New York (1957) 2. Wong, M.T., Zongker, D.E., Salesin, D.H.: Computer-generated floral ornament. In: SIGGRAPH 1998, pp. 423–434 (1998) 3. Day, L.F.: Nature and Ornament V1: Nature the Raw Material of Design 1909. Kessinger Publishing Company, New York (2007) 4. Ward, J.: The Principles of Ornament. Scribner, New York (1896) 5. Jones, O.: The Grammar of Ornament. DK ADULT, London (1910) 6. Kaplan, C.S.: Computer Graphics and Geometric Ornamental Design. PhD thesis, University of Washington (2002) 7. Alexander, H.: The computer/plotter and the 17 ornamental design types. In: SIGGRAPH 1975, pp. 160–167 (1975) 8. Gr¨ unbaum, B., Shephard, G.C.: Tilings and Patterns. W. H. Freeman & Co., New York (1986) 9. Glassner, A.: Frieze groups. IEEE Comp. Graphics and Applications, 78–83 (1996) 10. Smith, A.R.: Plants, fractals, and formal languages. In: SIGGRAPH 1984 (1984) 11. Beach, R., Stone, M.: Graphical style towards high quality illustrations. SIGGRAPH Comput. Graph. 17, 127–135 (1983) 12. Hsu, S.C., Lee, I.H.H., Wiseman, N.E.: Skeletal strokes. In: UIST 1993 (1993) 13. Hsu, S.C., Lee, I.H.H.: Drawing and animation using skeletal strokes. In: SIGGRAPH 1994, pp. 109–118 (1994) 14. Prusinkiewicz, P., Lindenmayer, A.: The algorithmic beauty of plants. Springer, New York (1990) 15. Kaplan, M., Cohen, E.: Computer generated celtic design. In: EGRW 2003, pp. 9–19. Eurographics Association (2003) 16. Kaplan, M., Cohen, E.: Generative parametric design of gothic window tracery. In: Shape Modeling International 2004, pp. 350–353 (2004) 17. Anderson, D.: Two-dimensinal computer-generated ornamentation using a userdriven global planning strategy. Technical Report CPSLO-CSC-08-02, California Polytechnic State University (2008)
Efficient Schemes for Monte Carlo Markov Chain Algorithms in Global Illumination Yu-Chi Lai, Feng Liu, Li Zhang, and Charles Dyer Computer Science, University of Wisconsin – Madison, 1210 W. Dayton St., Madison, WI 53706-1685, USA
Abstract. Current MCMC algorithms are limited from achieving high rendering efficiency due to possibly high failure rates in caustics perturbations and stratified exploration of the image plane. In this paper we improve the MCMC approach significantly by introducing new lens perturbation and new path-generation methods. The new lens perturbation method simplifies the computation and control of caustics perturbation and can increase the perturbation success rate. The new path-generation methods aim to concentrate more computation on “high perceptual variance” regions and “hard-to-find-but-important” paths. We implement these schemes in the Population Monte Carlo Energy Redistribution framework to demonstrate the effectiveness of these improvements. In addition., we discuss how to add these new schemes into the Energy Redistribution Path Tracing and Metropolis Light Transport algorithms. Our results show that rendering efficiency is improved with these new schemes.
1
Introduction
Generating a physically-correct image involves the estimation of a large number of highly correlated integrals of path contributions falling on the image plane. Markov Chain Monte Carlo (MCMC) algorithms such as Metropolis Light Transport (MLT) [1], Energy Redistribution Path Tracing (ERPT) [2], and Population Monte Carlo Energy Redistribution (PMC-ER) [3] exploit the correlation among integrals. They all reduce the variance and improve the efficiency during rendering images. However, MCMC algorithms are limited from achieving higher rendering efficiency due to the possibly high failure rate in caustics perturbation and the stratified exploration of the image plane. The predicted range of the perturbation angle for caustics perturbation depends on the path and scene properties. If the predicted range is too large, the failure rate of the caustics perturbation will be high and cause extra high energy to accumulate at some specific spots on the image plane. As a result, the large predicted range decreases the rendering efficiency. Additionally, the MCMC algorithms need to implement the lens and caustics perturbation separately because it is impossible for the original lens perturbation to generate a new mutated path for caustics paths with the form of EDS ∗ D+ (D|L) which is a notation of light tranport paths introduced by [4,1]. MCMC algorithms have issues in the G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 614–623, 2008. c Springer-Verlag Berlin Heidelberg 2008
Efficient Schemes for Monte Carlo Markov Chain Algorithms
615
extra cost needed for computing the perturbation angles for each path and the burden in predicting the perturbation change on the image plane. Stratified exploration of the image plane is another limitation because the importance of regions in the image are not perceptually the same. To achieve unbaisedness, the ERPT and PMC-ER algorithms carefully generate new paths by evenly distributing the path samples on the image plane. This choice is suboptimal because some areas on the image plane contain higher perceptual variance than others. To contribute more computational effort in reducing the variance in these high perceptual variance regions would increase the perceptual quality. In addition, some types of paths such as caustics paths are visually important but hard to find with a general path tracing algorithm. Concentrating more computational effort on these paths can further improve the rendering efficiency. However, evenly exploring the image plane prevents MCMCs from spending more computation effort in exploring those “hard-to-find-but-important” paths and limits the improvement in rendering efficiency. To address these two limitations, we augment lens perturbation to include caustics perturbation. This new perturbation allows us to control the mutation by a single and simple lens perturbation radius and increases the success rate of caustics perturbations to improve the rendering efficiency. We propose two methods to generate paths in order to spend more computational effort on exploration of noisy regions and “hard-to-find” paths without introducing bias. We present a variance-generation method that generates paths passing through high perceptual variance regions on the image plane to enhance the perceptual quality of the visually important regions in the rendered image. We also present a caustics-generation method generates a set of caustics paths with the goal of exploring the caustics path space more thoroughly. We weigh the energy deposited by each perturbation according to the type of the population path to prevent new generation methods from introducing bias.
2
Related Work
Currently, most global illumination algorithms are based on ray tracing and Monte Carlo integration. Two categories exist: unbiased methods such as [5,6,7]; and biased methods such as [8,4,9]. Interested readers can refer to Pharr and Humphreys [10] for an overview of Monte Carlo rendering algorithms. Sample reuse is an important technique to reduce the variance by exploiting the correlation among integrals. Metropolis Light Transport (MLT) [1] and Energy Redistribution Path Tracing (ERPT) [2] mutate existing light transport paths into new ones to make use of the correlated information among paths. However, finding a good mutation strategies is important but non-trivial for rendering efficiency. PMC-ERs [3] adapt the Population Monte Carlo framework into energy redistribution. Their algorithms can concentrate the computation on the important light paths and automatically adapt the extent of energy redistribution according to each path’s properties. This eases the problem of choosing the non-trivial mutation strategies existing in MLT and ERPT algorithms.
616
Y.-C. Lai et al.
However, there exist several limitations in MLT, ERPT, and PMC-ERs that prevent them from achieving higher rendering efficiency. In this paper we propose several new modifications to MCMC algorithms and implement them in PMC-ERs to demonstrate their effectiveness in improving rendering efficiency.
3
New Schemes to MCMCs
In this section we present new lens mutation and path-generation methods in PMC-ER-E. Interested readers can refer to [3] for the details of the original ˜ is referred to as a light transPMC-ER-E algorithm. In this work, a path, Y, port path defined as [4,11] and denoted as as L(S|D)∗ E. Figure 1(a) shows an example of such paths. 3.1
New Lens Perturbation
Details related to the kernel function, the choice of mutation strategies, and the computation of acceptance probability for the selected mutation are discussed in [3]. Here we only focus on how to use the perturbation method to replace the original caustics perturbation.
Fig. 1. (a) This is a path of the form LDDSDE and used to demonstrate the replacement of caustics perturbation with the new lens perturbation. We would like to replace the caustics sub-path y5 y4 y3 y2 y1 of the form of EDSDD. We first perturb the pixel position of the original path at y5 by uniformly choosing a point from the perturbing disk and then cast a view ray to pass through the new pixel position as shown in the bottom to get y4 . We link y4 and y3 to form the link path. Then, we extend the sub-path through the same specular bounce at y3 as the corresponding y3 to get y2 . Then, y2 and y1 are linked to form a new lens-perturbed path with the same form of LDDSDE as the original one. (b) A caustics path is generated by tracing the ray from a light source. Each vertex in the path is linked to the camera vertex. The algorithm then checks whether the new linked path is a caustics path and if it is, it keeps it in the candidate pool. After finishing the whole process, we then randomly choose a path from the candidate pool and put it into the caustics path pool.
Efficient Schemes for Monte Carlo Markov Chain Algorithms
617
Figure 1(a) shows an example of our new lens perturbation method for a caustics path. The lens perturbation replaces a sub-path yn−1 · · · yk of the form EDS ∗ (L|D). In the original implementation of lens perturbation, the lens fails to replace this kind of path because it is impossible to find exactly the same outgoing direction at the first specular bounce from the eye vertex when we perturb the pixel position at the eye vertex. Thus, we need to use caustics perturbation. However, in our new lens mutation, we look to replace the sub-path chain with E(D|S)+ [S(D|S)]∗ sub-paths, which can directly replace the lens and caustics perturbation. First, the perturbation takes the existing path and moves the point on the image plane through which it passes. In our case, the new pixel location is uniformly sampled within a disk of radius d, a parameter of the kernel component. The path is reconstructed to pass through the new image point. If yn−2 is a specular vertex, we choose a specular bounce to find the next vertex and then extend the sub-path through additional specular bounces to be the same length as the original path. If yn−2 is a diffuse ver tex, we link yn−2 to yn−3 to form the link path and then extend the sub-path through additional specular bounces to be the same length as the original path. The transition probability for the new lens perturbationfor a caustics path can ˜ |Y) ˜ = G(yn−1 ,yn−2 ) n−k−2 G(yj ,yj+1 ) : 1?y ⊂ S be computed as Td,lens (Y j j=n−3 Ad | cos θ | j ,in
where G(yj , yj+1 ) is the geometric term between yj and yj+1 , Ad is the area of the perturbation, and θj ,in is the angle between the normal of the surface and the direction of the incoming light ray at yj . This relieves us from the need in the original caustics perturbation to estimate the perturbation angle, θ, for each path. The computation for θ is difficult and hard to predict the movement caused by the caustics perturbation on the image plane. When using our new lens perturbation, we can use a single pixel perturbation radius to control the movement of the radius on the image plane. The results show that control is easier and movement on the image plane is more predictable.
3.2
Resampling
The resampling process consists of three steps: elimination, which eliminates well-explored and low-contribution samples and deposits the remaining energy of the eliminated path into the image; regeneration, which maintains a constant number of paths in the population and designs an exploration pattern in the path space; and adaptation of α values, which adjusts the energy distribution area. In this section, we only focus on the process of regeneration. For details of elimination and adaptation refer to [3]. To generate a new replacement path, we use three types of regeneration paths: paths passing through a set of stratified pixel positions, paths passing through a set of pixel positions generated according to the perceptual variance, and a set of caustics paths tracing from the light sources. To achieve this, we need to have two modifications in the original algorithm. First, we split the resampling loop into two loops (s, t). At the beginning of each s loop, we need to modify the step of the generation of a pool of stratified pixel positions to a pool of pixel positions and a set of caustics paths in the original energy redistribution algorithm. At the t loop, we apply the
618
Y.-C. Lai et al.
resampling process to the entire population. Second, we need to modify the energy deposition from Ed = ed to Ed = R ∗ ed where R is related to the properties of the path discussed later in this section. The following are the implementation details. Pixel Positions from Stratification Criterion. It is important to evenly distribute the starting pixel positions in order to reduce variance and guarantee the unbiasedness of energy redistribution algorithms. Thus, in each s loop, we assign the Nunif orm samples as initial paths for each pixel. Pixel Positions from Perceptual Variance Criterion. In order to generate new sample paths in regions with possibly higher perceptual variance, we have to keep track of the radiance of traced paths similar to a path tracing algorithm by adding an extra image variable, I. In the process of estimating the average path energy, we keep track of the radiance of energy-estimated paths in I. In the following steps we also keep track of the radiance of the newly generated initial and replacement population paths in I. Then we can compute (s)
σ2
the value βi,j = tvi(Ii,ji,j ) where Ii,j is the average radiance that falls on pixel (i, j); σi,j is the variance among all radiance samples falling in pixel (i, j), and tvi(I) is threshold-versus-intensity function introduced by Ferweda et al. [12] (s) for perceptually weighting the variance. βi,j is used to indicate the degree of requirement for more samples at pixel (i, j). At the beginning of each s loop, (s) we first choose Nvariance for (i, j) s pixels, according to the weight βi,j . After choosing pixels, we can compute the total number of samples falling on a pixel, Nunif orm + Nvariance (i, j), and then we evenly distribute the starting pixel positions inside the pixel. This forms a pool of pixel positions. During the regeneration process, we ask for a pixel sample from this pool or ask for a new path from the pool of caustics paths. If we get a pixel sample, we then use the path tracing algorithm to generate a path passing through the new pixel positions. The unweighed energy of the path is calculated as described in [3]. Later, we will describe how to weigh the deposited energy without introducing bias. Caustics Paths. A path tracing algorithm traces paths starting from the eye. However some types of paths are easier to trace when starting from a light source, e.g. caustics paths. The photonmapping algorithm uses caustics photons to improve the rendering efficiency. The rendering results in Figure 2.(c) show that caustics paths are hard to find by the path tracing algorithm but they are very important to generate the smooth caustic regions on the floor near the dragon. Thus, these two observations motivate us to have specific types of light paths to enable the exploration of caustics path space. At the beginning of each outer iteration i.e the s loop, we generate a pool of Ncaustics caustics paths in the following way. First, we choose a light source and then choose a position on that light source as the start vertex. From the light vertex, we trace a path in the scene as described in [11,7]. Then, we connect each vertex in the light path to the camera vertex. If the complete path formed is a valid caustics path, we keep
Efficient Schemes for Monte Carlo Markov Chain Algorithms
619
Fig. 2. (a) The top image is a Cornell Box image computed using PMC-ER-E with all new schemes with (S = 1, T = 1225, Nvariance = 199200, Ncaustics = 108000); the left in the bottom is the cropped image of the caustics region for the Cornell Box scene computed using PMC-ER-E with (S = 1, T = 1225), the middle the cropped image computed by the PMC-ER-E algorithm with (S = 1, T = 1225), and the right is the cropped image computed by the PMC-ER-E algorithm with (S = 2, T = 1225). (b) The top image is a room scene computed using PMC-ERE with all new schemes with (S = 8, T = 2350); the bottom is computed using PMC-ER-E in (S = 8, T = 2350, Nvariance = 170400, Ncaustics = 121200). (c) The top is the rendering result of a dragon scene computed using PMC-ER-E with all new schemes with (S = 1, T = 2430, Nvariance = 779700, Ncaustics = 30300); the left in the middle row is the cropped image of the caustics region below the dragon head computed using PMC-ER-E, the right in the middle row is the cropped image computed by PMC-ER-E with (S = 1, T = 2430), the left in the bottom row is the cropped image computed by PMC-ER-E with (S = 3, T = 2430), and the right in the bottom row is the cropped image computed by PMC-ER-E in (S = 8, T = 1620) iterations.
620
Y.-C. Lai et al.
the path in the candidate pool. Finally, we can construct one valid caustics path by randomly choosing a valid one from the candidate pool. Figure 1(b) shows an example. The criterion for a caustics path is: first, the length of the path must be over 4 vertices; second, the path must contain at least one specular vertex; third, the first connection vertex from the eye vertex must be a diffuse surface. Without weighting the path energy, these extra “hard-to-find” paths will introduce bias. Next, we describe how to weigh the deposited energy without introducing bias. Weighting the Energy of Newly Regenerated Paths. In the original energy redistribution algorithm, we evenly distribute the pixel positions. The energy distribution ratio R should be 1. However, if we apply extra samples on each pixel and extra caustics paths without weighting the energy of each path, the extra samples and paths introduce biased energy into the image. In this section, we describe how to weigh the energy to ensure that the result is still unbiased. For the perceptual-variance-type regeneration, each pixel originally has Nunif orm samples dropped in the effective area and this guarantees that the expected energy deposited from paths initialized from each pixel is the same. To keep the energy deposit in the region statistically equal, we should weigh N orm the deposited energy of the path by Rvariance = Nvarianceunif (i,j)+Nunif orm where Nunif orm is the assigned uniform samples per pixel and Nvariance (i, j) is the sample assigned to pixel (i, j) according to perceptual variance. By weighing the energy by a ratio of Rvariance we make sure that the total energy expected to be distributed starting from that pixel is the same. The caustics paths are global because they are light paths that can pass through any pixel position on the image plane. Thus, we need to handle them a little differently. In a scene we should expect the ratio of caustics paths and general paths generated from path tracing algorithm to be fixed. We can use this ratio to weigh the energy of all caustics paths in order to avoid bias. The Nexpect ratio can be calculated as Rcaustics = Nadd +N where Nexpect is the expected expect total number of caustics paths generated by the stratified regeneration method. In the initial process, when we estimate the average energy of a path, we also estimate RG2C which is the ratio of the total number of caustics paths to the total number of general paths. Then, during the regeneration process, we compute the Nexpect = RG2C ∗ Npixels ∗ Nunif orm and Rcaustics . By weighing the energy of each caustics path by a ratio of Rcaustics , we can guarantee the unbiasedness of the final result. However, the real ratio, R, of the path deposit energy is separated into the following three situations: first, if a path is from the pool of pixel positions and is a caustics path, the ratio should be Rvariance (i, j) × Rcaustics ; second, if a path is from the pool of pixel position but is not a caustics path, the ratio should be Rvariance (i, j); third, if a path is from the pool of caustics paths, the ratio should be Rcaustics . By using the appropriate ratio, we guarantee the unbiasedness.
Efficient Schemes for Monte Carlo Markov Chain Algorithms
4
621
Results
To evaluate the performance of our improvements we compared our methods wiht against the original PMC-ER equal deposition algorithm on a Cornell Box (CB) scene, a dragon scene, and a complex room scene using the criterion of starting with a similar number of initial PT paths. In all three cases, we used a population size of 5000 and three perturbation radii: 5, 10, and 50 pixels. In each step in the inner loop, each member generates 20 mutations, and 40% of the population is eliminated based on its remaining energy and regenerated. We used 16 samples ˜ and RG2C . When applying the new schemes to per pixel (SPPs) for estimating E the PMC-ER algorithms, we used NSP P , the number of SPPs, to compute Ntotal , the number of initial paths, and Niteration , the number of total iterations, for the PMC-ER algorithms. We then chose (S, T ) so that Niteration = S × T to indicate the total iterations used in PMC-ERs. If we implement new regenerations into PMC-ERs, we chose the pool size of the variance regeneration, Nvariance and the pool size of caustics regeneration, Ncaustics for each S. Thus, (S, T ), (Nvariance , and Ncaustics ) are the main parameters used. Table 1 presents the improvement statistics when applying each new scheme separately and together with PMCER-E. We used the perceptually-based mean squared efficiency (P-Eff) metric defined in [13] for comparing algorithms. The comparison between PMC-ER-E with the original perturbations and PMC-ER-E with the new lens perturbation in rendering the three scenes shows
Table 1. Measurements comparing the PMC-ER with original lens and caustics mutation and the stratified regeneration with PMC-ER with the new lens mutation and the stratified regeneration, PMC-ER with original lens and caustics perturbation with all regeneration methods, and PMC-ER-E using all the new schemes Image Box1
Method Total Iter(S, T) Nvariance E* 1, 1225 0 E+Lens** 1, 1225 0 E+Reg*** 1, 1225 199200 E+Lens+Reg 1, 1225 199200 Dragon E 1, 2430 0 E+Lens 1, 2430 0 E+Reg 1, 2430 779700 E+Lens+Reg 1, 2430 779700 Room E 8, 2350 0 E+Lens 8, 2350 0 E+Reg 8, 2350 170400 E+Lens+Reg 8, 2350 170400
Ncaustics 0 0 108000 108000 0 0 30300 30300 0 0 121200 121200
Time (s) 4769.1 4683.1 5366.3 5266.4 13081.3 12640.4 14296.7 14097.7 96575.1 95812.1 98158.9 98032.5
Err 0.0267 0.0207 0.0135 0.0113 3.09 1 0.985 0.164 0.0274 0.0208 0.0105 0.00569
Eff 7.85e-3 1.03e-2 1.38e-2 1.68e-2 2.47e-5 7.91e-5 7.10e-5 4.33e-4 3.78e-4 6.91e-4 9.70e-4 1.52e-3
* E represents the original PMC-ER-E algorithm using the lens and caustics mutation with stratified regeneration. ** +Lens represents that we implement the new lens mutation into PMC-ERs. *** +Reg represents implementations of the new regeneration methods in PMC-ERs.
622
Y.-C. Lai et al.
that we gain improvement in rendering efficiency by a factor of 1.31 for the CB scene, 3.2 for the dragon scene, and 1.82 for the room scene. The comparison between the original PMC-ER-E algorithm with stratified regeneration and PMCER-E with the new regeneration methods in rendering the three scenes shows an improvement of a factor of 1.76 for the CB scene, 2.87 for the dragon scene, and 2.56 for the room scene. The comparison between the original PMC-ER-E algorithm with stratified regeneration and PMC-ER-E with all new schemes shows an improvement of a factor of 2.14 for the CB scene, 17.53 for the dragon scene, and 4.02 for the room scene. When viewing an image, the attention of the viewer is drawn towards the caustic regions in the image because caustic regions are usually brighter than the regions next to them. Thus, improving the quality of the rendered caustic regions has a large impact on the perception of a rendered image. The caustics regeneration concentrates more computation in the caustics path space. In addition, the new lens mutation increases the perturbation success rate to increase the exploration of the caustics path for each population path. As a result, our algorithm can generate smoother caustics regions for the dragon and CB scene. In the room scene, we observe that the variance-regeneration puts more samples around the regions of the light in the right of the image. There is no obvious caustics region in the scene but the bright spots generated during the rendering process mostly come from the caustics paths. Thus, concentrating more computation in exploration of caustics path space reduces the variance of the result image. In addition, the failure rate of the caustics perturbation is high for this scene. With the new lens mutation method, the success rate can increase significantly. As a result, the rendered image is much smoother.
5
Discussion and Conclusion
In this section we present a short discussion of how to apply these schemes in the MLT and ERPT frameworks. The original lens and caustics pertubation methods in MLT can be directly replaced by our new lens perturbation method. To apply the new generation methods to MLT, we can first use these methods to generate a pool of paths passing through high-variance regions and caustics paths. Then, during the mutation process, we can replace the current seed path with one of the paths from the pool. We can compute the acceptability probability accordingly and decide whether the seed path transfers to the new generated path. This should achieve a similar result as presented in our demonstration. Since ERPTs contain a preprocessing phase to estimate the average energy of paths, we can implement a similar algorithm as stated in Section 3.2 to estimate the perceptual variance in each pixel and RG2C in the preprocessing phase. After deciding on the number of caustics paths and variance-generated samples, we distribute the variance-generated samples according to the perceptual variance and generate caustics paths. The energy-deposited ratio, R, is computed as described in Section 3.2.
Efficient Schemes for Monte Carlo Markov Chain Algorithms
623
Except for the factors listed in the original PMC-ER algorithm [3], there is another important factor which is the ratio between the total number of the stratified regeneration paths and the special regeneration paths. If the ratio is high, the image space exploration rate will be too low. As a result, this reduces the variance of those highly explored regions but we will have a higher variance in other regions. If the ratio is too low, our algorithm reverts to the original PMCER algorithm. In the current implementation a proper value is set by trial and error. In the future, we would like to implement some automatic mechanisms. In this paper we proposed two new path regeneration mechanisms by tracing paths through high perceptual variance regions and generating “hard-to-find” paths with a proper weighting scheme for concentration sampling without introducing bias. In addition, the new mutation method eases the control and computation of the caustics perturbation. Both schemes improve rendering efficiency.
References 1. Veach, E., Guibas, L.J.: Metropolis light transport. In: SIGGRAPH 1997, pp. 65–76 (1997) 2. Cline, D., Talbot, J., Egbert, P.: Energy redistribution path tracing. In: SIGGRAPH 2005, pp. 1186–1195 (2005) 3. Lai, Y., Fan, S., Chenney, S., Dyer, C.: Photorealistic image rendering with population monte carlo energy redistribution. In: Eurographics Symposium on Rendering, pp. 287–296 (2007) 4. Heckbert, P.S.: Adaptive radiosity textures for bidirectional ray tracing. In: SIGGRAPH 1990, pp. 145–154 (1990) 5. Kajiya, J.T.: The rendering equation. In: SIGGRAPH 1986, pp. 143–150 (1986) 6. Veach, E., Guibas, L.J.: Bidirectional estimators for light transport. In: Proc. of the 5th Eurographics Workshop on Rendering, pp. 147–162. Eurographics Association (1994) 7. Lafortune, E.P., Willems, Y.D.: Bi-directional path tracing. In: Proceedings of Compugraphics, pp. 145–153 (1993) 8. Ward, G.J., Rubinstein, F.M., Clear, R.D.: A ray tracing solution for diffuse interreflection. In: SIGGRAPH 1988, pp. 85–92 (1988) 9. Jensen, H.W.: Realistic image synthesis using photon mapping. AK Peters (2001) 10. Pharr, M., Humphreys, G.: Physically Based Rendering from Theory to Implementation. Morgan Kaufmann, San Francisco (2004) 11. Veach, E.: Robust Monte Carlo Methods for Light Transport Simulation. PhD thesis, Stanford University (1997) 12. Ferwerda, J.A., Pattanaik, S.N., Shirley, P., Greenberg, D.P.: A model of visual adaptation for realistic image synthesis. In: SIGGRAPH 1996, pp. 249–258 (1996) 13. Fan, S.: Sequential Monte Carlo Methods for Physically Based Rendering. PhD thesis, University of Wisconsin-Madison (2006)
Adaptive CPU Scheduling to Conserve Energy in Real-Time Mobile Graphics Applications Fan Wu, Emmanuel Agu, and Clifford Lindsay Worcester Polytechnic Institute, Worcester, MA 01609
Abstract. Graphics rendering on mobile devices is severely restricted by available battery energy. The frame rate of real-time graphics applications fluctuates due to continual changes in the LoD, visibility and distance of scene objects, user interactivity, complexity of lighting and animation, and many other factors. Such frame rate spikes waste precious battery energy. We introduce an adaptive CPU scheduler that predicts the applications workload from frame to frame and allocates just enough CPU cycles to render the scene at a target rate of 25 FPS. Since the applications workload needs to be re-estimated whenever the scenes LoD changes, we integrate our CPU scheduler with LoD management. To further save energy, we try to render scenes at the lowest LoD at which the user does not see visual artifacts on a given screen. Our integrated Energy-efficient Adaptive Real-time Rendering (EARR) heuristic reduces energy consumption by up to 60% while maintaining acceptable image quality at interactive frame rates. Keywords: Energy conservation, Multiresolution rendering,Real-time rendering, CPU scheduling.
1
Introduction
Battery-powered mobile devices, ranging from laptops to cell phones, have become popular for running 3D graphics applications that were previously developed exclusively for desktop computers or gaming consoles. Mobile devices now feature more processing power, programmable graphics hardware and increased screen resolution. Emerging mobile graphics applications include multiplayer games, mobile telesurgery and animations. Mobile graphics applications offer a new commercial opportunity especially considering the total number of mobile devices sold annually far exceeds the number of personal computers sold. The mobile gaming industry already reports revenues in excess of $2.6 billion worldwide annually, and is expected to exceed $11 billion in 2010 [1]. The most limiting resource on a mobile device is its short battery life. While mobile CPU speed, memory and disk space have grown exponentially over the years, battery capacity has only increased 3-fold in the past decade. Consequently, the mobile user is frequently forced to interrupt their mobile graphics experience to recharge dead batteries. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 624–633, 2008. c Springer-Verlag Berlin Heidelberg 2008
Frames Per Sec.
Adaptive CPU Scheduling to Conserve Energy
(a) Screenshot on a HP iPaq Pocket PC
625
Overshot Frame Rate Utilized Frame Rate
100 100
80 80 60 60
25 FPS Threshold
40 40 20 20 0
0
Number of Frames
200
(b) Application running at high realtime frame rate
Fig. 1. Screenshot of HP Ipaq & Example Application Framerate
Application-directed energy saving techniques have previously been proposed to reduce the energy usage of non-graphics mobile applications. Our main contribution in this paper is the introduction of application-directed energy saving techniques to make mobile graphics applications more energy-efficient. The main idea of our work is that energy can be saved by scheduling less CPU timeslices or lower the CPU’s clock speed (Dynamic Voltage and Frequency Scaling (DVFS)) for mobile applications during periods when its requirements are reduced. In order to vary the CPU timeslices alloted to a mobile application, we need to accurately predict its workload from frame to frame. Workload prediction is a difficult problem since the workload of real-time graphics applications depends on several time-varying factors, such as user interactivity level, the current Level-of-Detail (LoD) of scene meshes and mip-mapped textures, visibility and distance of scene models, and the complexity of animation and lighting. Without dynamically changing the application’s CPU allotment to correspond to its needs, the mobile application’s frame rate fluctuates whenever there is a significant change in scene LoD, animation complexity, or other factors that affect its workload. Such spikes above 25-30 Frames Per Second (FPS) drain the mobile device’s battery and increased energy consumption by up to 70% in our measurements (see figure 1b). We propose an accurate method to predict the mobile application’s workload and determine what fraction of the CPU’s cycles it should be alloted to maintain a frame rate of 25 FPS. As the application’s workload changes, we update its CPU allotment at time intervals determined by a windowing scheme that is sensitive to applications with fast-changing workloads and prudent for applications with slow-changing workloads. Our adaptive CPU scheduling scheme dampens frame rate oscillations and saves energy. Since the application’s workload changes and should be re-estimated whenever LoDs are switched, we have coupled our CPU scheduler with the application’s LoD management scheme. When switching scene LoD, we minimized energy consumption by selecting the lowest LoD at which the user does not see visual artifacts, also known as the Point of Imperceptibility (PoI) [2]. Although our primary goal was to minimize the mobile application’s energy consumption, we also ensured that the frame rates and visual quality of the rendered LoD were acceptable. In summary, our integrated EARR (Energy-efficient Adaptive
626
F. Wu, E. Agu, and C. Lindsay
Real-time Rendering) heuristic minimizes energy consumption by i) selecting the lowest LoD that yields acceptable visual realism, ii) scheduling just enough CPU timeslices to maintain real-time frame rates (25 FPS). EARR also switches scene LoD to compensate for workload changes caused by animation, lighting, user interactivity and other factors outside our control. To the best of our knowledge, this is the first work to use CPU scheduling to save energy in mobile graphics. Our results on animated test scenes show that CPU scheduling reduced energy consumption by up to 60% while maintaining real time frame rates and acceptable image realism. The rest of the paper is organized as follows. Section 2 presents related work and background, Section 3 through Section 5 describes our proposed EARR heuristic, Section 6 describes our experimental results. and section 7 presents conclusions and future work.
2
Related Work and Background
Application-Directed Energy Management: This class of energy management schemes uses Dynamic Voltage or Frequency Scaling (DVFS) [3,6] or intelligently reduces the application’s output quality to conserve energy [4,5]. For instance, energy can be saved by intelligently reducing video quality [5] or document quality [4]. In DVFS, energy is conserved by dynamically reducing the processor’s speed or voltage, without degrading the application’s quality. GRACE-OS [3] proposes a DVFS framework for multimedia applications, which probabilistically predicts the CPU requirements of multimedia applications in order to guide CPU speed settings. Chameleon [6] proposes CPU scheduling policies for soft real-time (multimedia), interactive (word processor) and batch(compiler) applications. To the best of our knowledge, adaptive CPU scheduling to conserve energy has not previously been applied to graphics applications. LoD management to maintain real-time frame rates: Funkhouser and Sequin [7], and Gobetti [9] both describe systems that bound rendering frame rates by selecting the apprioprate LoD. While Funkhouser and Sequin used discrete LoDs, Gobetti extended their work by using multiresolution representations of geometry. Wimmer and Wonka [10] propose estimating the upper limit of rendering times. The Performer system maintains a specified rendering speed [12], by switching LoDs whenever application frame rate changes. Tack et al [11] describe a model for predicting rendering times on mobile devices. Background: In order to minimize the amount of energy consumed our technique uses a combination of Adaptive CPU scheduling, LoD management using Wavelets, and the Point of Imperceptibility (PoI) metric. Our goal during scheduling is to assign the lowest fraction of CPU time slices that runs the application at the selected LoD as close to 25 FPS as possible. To further refine our ability to adjust scene complexity in order conserve energy, we can adjust Wavelet levels for two situations. First, we minimize energy usage by continuously choosing the lowest mesh resolutions that is visually acceptable. Second, we can switch to lower mesh LoDs is to compensate when animation, lighting and other scene elements not in our control and require more CPU cycles. In
Adaptive CPU Scheduling to Conserve Energy
627
order to quantify the distortion caused by mesh simplification we use the perceptual metric proposed by Wu et al [2] that generates a PoI LoD of a mesh. This metric factors in the effects of lighting, shading and texturing on the perceptibility of simplification artifacts and exploits knowledge of how human perception works to pinpoints the lowest LoD at which mesh simplification artifacts are imperceptible on mobile displays of different resolutions.
3
Our EARR Heuristic
The accurate prediction of the application’s CPU requirements in advance is important in our work because allotting CPU cycles after the application’s demands increase causes jerkiness or frame rate to drop. In complex real-time graphics applications, it is difficult to accurately model and predict all factors that affect the observed frame rates. Using a statistical workload predicting model, we developed an heuristic that is both efficient to compute and accurate. Through a series of steps described below, our EARR heuristic compares the predicted with the actual frame rate and adaptively adjusts future predictions, mesh LoDs, and CPU resource allocations to minimize energy consumption. At the start of the heuristic, all meshes are rendered at their PoI LoD. As the mesh moves during an animation, EARR reallocates CPU resources using the workload predicting model and the CPU scheduling policy. There are three cases to which our heuristic is required to adjust the application parameters, each requiring different actions. If we let d denote the current LoD of a mesh and dp denote its PoI LoD. Let f denote the frame rate at which that scene is currently being rendered. The three cases are as follows: Case 1: the predicted frame rate drops such that fi < 25, current LoD i = minimum LoD possible, and 100% of CPU cycles already alloted to this task: In this case, we are at the limits of the parameters under our control (minimizing LoD and maximizing CPU cycles). We conclude that the mobile’s resources are not enough to render the scene at 25 FPS and we cannot rectify the situation. In such a scenario, we simply choose the minimum possible LoD and set the CPU cycles to a maximum and achieve the highest frame rate possible. Case 2: the predicted frame rate drops such that fi < 25, current LoD i = PoI, dp : In this case, the heuristic will allocate more CPU resources to increase the rendering frame rate. If the frame rate is still less than 25 FPS, the heuristic will then choose a lower LoD level to increase the frame rate to 25 FPS and allocate the optimal fraction of CPU cycles, Copt accordingly. We note that in this case, to achieve 25 FPS, we are forced to use an LoD below the mesh PoI, which will cause visual artifacts. Case 3: the predicted frame rate increases such that fi >> 25, current LoD i = PoI, dp : EARR continues to use the PoI LoD but tries to save energy by reducing the percentage of CPU timeslices scheduled for our application to the minimum required in order to maintain a frame rate of 25 FPS. Figure 2a is the flow chart of EARR heuristic.
F. Wu, E. Agu, and C. Lindsay
Frames Per Sec.
628
LoD Selection EARR Simple
100 100 100 80 80 80 60 60 60
20
20 20 0
(a) EARR Heuristic Flow Chat
25 FPS Threshold
40
40 40
0
Number of Frames
200
(b) Frame Rates at check points along animation path
Fig. 2. Heuristic and Frame rate
In the following sections, Section 4 describes our workload predicting model and Section 5 describes the CPU scheduling policy.
4
Workload Predicting Model
The workload predicting model predicts what fraction of available CPU timeslices should be alloted to our mobile application in order to sustain a target frame rate of 25 FPS. To minimize energy consumption, the goal of the CPU scheduler is to allocate just enough CPU cycles to finish rendering each frame just before its deadline expires. Our target frame rate of 25 FPS yields a deadline of 40 ms for each frame to complete rendering. The optimal (fewest) CPU resources Copt to meet our task’s deadline can be expressed as: Copt =
τ × Cmax rmax
(1)
where Cmax is the maximum available allotment of the processor’s timeslices, Copt is a reduced allotment of CPU timeslices generated by our algorithm, which just meets the frame’s deadline. rmax is the rendering time of a mesh if all available processor cycles are alloted to our application and τ is the deadline for the frame. As mentioned above, τ = 40 ms. We apply our workload predictor as follows. At runtime, given a frame rendering deadline, τ , we use equation 1 to calculate the optimal CPU processor allocation, Copt . We then use our pre-generated statistics to estimate the meshes LoD that corresponds to Copt . A complex scene typically contains multiple objects at different LoDs. The workload for rendering the scene depends not only on the total number of scene triangles, but also on their visibility, which varies as the camera and objects move. Thousands of triangles might be visible from some camera positions,
Adaptive CPU Scheduling to Conserve Energy
629
Get first actual workload Update predict workload using current real workload
Window Size = n (Initially n=2) Yes Play ended?
Stop
No
Window Size
Actual Workload Predicted Workload
Workload
[renndering time]
}
[No. of Frames]
2
4
8
9
10
11
Workload Error < Threshold?
2 4
No
Yes Double n
Time (in frames)
Current Window
Yes Window Size < N? Increase n by 1 No
(a) Window Size Updates
(b) Flow Chart for Workload Predicting Model (N=8)
Fig. 3. Window Size Update and Flow Chart
whereas just a few may be seen from others. We used the eye-to-object visibility algorithm described in [13] to determine the set of visible objects to be rendered in each frame and calculated the entire scene’s workload as the sum of all visible objects where each object’s workload is determined by the method of Funkhouser and Sequin [7], which suggested that the number of triangles in a mesh is a good predictor of its rendering time. To characterize the accuracy of the method, relative error between measured rendering times and predicted rendering times produced by our workload predictor for a single model, was calculated for various LoDs. The results corroborates the results of Funkhouser and Sequin [7] since all relative errors are less than 4%. The eye-to-object visibility algorithm culls away large portions of a model that are occluded from the observer’s viewpoint, and thereby improves the accuracy of workload estimation significantly. In the eye-to-object algorithm, the scene can be subdivided into cells, and the model partitioned into sets of polygons attached to each cell. After subdivision, cell-to-cell and eye-to-cell visibility are computed for each cell of subdivision. This algorithm has been previously used to accelerate a range of graphics computations, including ray-tracing and object-space animation. Thus far, our predictor focused on the workload for rendering one frame of a scene. Next, we consider changes in application’s workload over time. Since the application workload changes only slightly from one frame to the next (milliseconds), we maintain the current predicted workload for n future frames, where n is called the window size and is varied depending on how quickly the workload is changing. The choice of n affects the performance of our algorithm. If n is too small, then we update predicted workload too often incurring large computation overheads. If it n is too large, then the system may not be sensitive enough to fast-changing workloads and the error between the predicted and actual workload may become too high. Therefore, in our prediction model, the window size (n) is adaptively varied at run-time. Figure 3a shows our window size updates method, which is inspired by the Transmission Control Protocol (TCP) in networking. Initially, we calculate and update our predicted workload every two
630
F. Wu, E. Agu, and C. Lindsay
(a) Accuracy of Workload Predicting Model
ts Max. Processor Availability Cmax
td
Application Workload p
Time
(b) Processor Availability Vs. Workload
Fig. 4. Accuracy of Workload Predicting model and Processor Availability
frames (n = 2). Every time the error between the predicted and actual workload is smaller than a pre-determined threshold, the window size is doubled. We continue to double the window size until n = 8, beyond which the window size is increased by 1 every time the observed error falls within an acceptable threshold. Whenever the workload prediction error exceeds the threshold, we reset the window size to 2 and set predicted workload to the value of the currently observed workload value. Figure 3b is a flow chart of the workload predicting model. The adaptive workload predictor estimates the workload of each frame at full processor speed, from which we can estimate the CPU timeslices required to render a frame at our target frame rate. We tested our technique with two scenes provided by the Benchmark for Animated RayTracing(BART) [8], The results are shown in figure 4a. It can be observed that the relative errors are bounded in 0.18.
5
CPU Scheduling Policy
Our CPU scheduler runs a three-phase, which are workload estimation, estimating processor availability and determining processor resource allocation. We now formalize our CPU scheduling algorithm. For each real-time task T , let us denote its start time by ts and its deadline as td . Let Cmax denote the maximum fraction of CPU timeslices that are currently available for running applications. It is important to note that without the intervention of our scheduling algorithm, all tasks will run with 100% allocation of all available CPU timeslices, Cmax . The fraction of CPU timeslices required by T will be denoted by p. We note that the execution time of the task T is inversely proportional to p. In summary, a feasible schedule of the task guarantees that the task T receives at least a fraction, A, of the maximum available CPU cycles such that it receives A ∗ Cmax CPU cycles before its deadline, where A ≤ 1. Given the application workload p, the maximum processor availability Cmax and interactivity deadline td , as shown in figure 4b, our allocation policies fall into two distinct cases.
Adaptive CPU Scheduling to Conserve Energy
A= Copt =
min(Cmax ×
p min(Cmax , td − ts ) Cmax p ˆ , Cmax ) min(Cmax ,td −ts )
631
(2) : :
Cmax < pˆ otherwise
(3)
Case 1: If Cmax < p, then the application’s demand for CPU timeslices exceeds CPU availability. In this case, the CPU scheduler cannot meet the task’s deadline while using the current mesh LoD. Our scheduling algorithm shall allot all available CPU timeslices to the task and also reduce mesh LoD to lower the offered workload p. Case 2: If ts + p < td , the task can complete before its deadline. If all available CPU resources are alloted to this task, the rendering speed achieved is larger than 25 FPS. In this case, the algorithm reduces the fraction of CPU timeslices alloted such that the demanded workload p is just adequate to complete the task before its deadline. The percentage of CPU resources alloted is calculated in equation 2. In the equation 2, the deadline td − ts is known. We chose td − ts as 40 ms, p is determined by using our workload predictor. The maximum CPU resources currently available, Cmax can be monitored by our resource adapter. Given an estimated demanded workload, pˆ and the maximum processor availability, Cmax , the optimal CPU resource allocation, Copt , is computed in equation 3.
6
Experiment and Results
In this section we describe performance of the EARR heuristic on both a laptop and a PDA. The laptop used was a Windows Vista Lenovo T61 laptop equipped with an Intel Core 2 Duo 2.1GHz processor and 3GB RAM. The PDA is a Windows CE HP iPAQ Pocket PC h4300 with a 400 MHz Intel XScale processor and 64MB RAM. We repeated all experiments eight times, eliminated the minimum and maximum values before averaging all others. We animated a pre-determined animation path in the kitchen scene provided by the Benchmark for Animated RayTracing (BART) [8]. We ran three sets of experiments using the BART kitchen scene, applying three levels of adaptations: (1)Simple: No LoD switching, no adaptive CPU scheduling; (2)LoD Selection: LoD switching, no adaptive CPU scheduling; (3)Our EARR heuristic: LoD selection with adaptive CPU scheduling. Measuring the exact energy consumption of the CPU alone is a fairly difficult problem. To measure the energy consumption of the CPU independent of our experiments we subtracted the base idle energy consumption to give our applications power usage and multiplied power usage by execution time to arrive at energy consumption. In our experiment, the base power consumed by the laptop in idle mode was 12.58W. During our experiments, we set 20 check points along the animation path of the mesh. Figure 2b is a plot of measured frame rates at these check points while testing the 3 different adaptation levels in section 6.
632
F. Wu, E. Agu, and C. Lindsay
In the experiments called ”simple”, without switching of scene LoD causes the observed rendering speed to be generally low and non-uniform, as shown by black dashed line of figure 2b. The straight dashed line is the target minimum frame rate of 25 FPS. Without LoD selection in the simple experiment, the target frame rate of 25 FPS is not achieved. In the experiments called ”LoD selection” adaptation level, the objects do not show visual artifacts due to LoD reduction and the application frame rate is always above 25 FPS. Even though the frame rate is much faster than the ”simple” test, the frame rate still fluctuates a lot. Moreover, since no CPU scheduling is used, 100% of all available CPU cycles are always alloted (Cmax ) to the application, and at many points during the experiment, the scene rendered much faster than (overshoots) 25 FPS. The blue dashed line of Figure 2b illustrates this point, that CPU cycles are wasted when rendering the scene beyond 25 FPS. At frame 20 and frame 120, the frame rate drops, this heuristic compensates by choosing a lower LoD to render, thus causing the frame rate to go up. However, this lower LoD will show some visual artifacts since it is below the PoI LoD. At frame 40 and frame 170, the available CPU resource is enough to maintain a frame rate greater than 25 FPS, we then switch the LoD of meshes to their PoI. In comparison to the other two experiments, the frame rate of ”EARR heuristic” is more uniform with less fluctuations, as shown by red solid line of figure 2b. As in the ”LoD selection” heuristic, at frame 20 and frame 120, the frame rate drops. EARR heuristic first tries to increase alloted CPU timeslices while using the PoI LoD. Since the frame rate continues to drop, the EARR heuristic selects a lower LoD and runs the CPU scheduler algorithm, which reduces the CPU resources alloted to 52% of the maximum available. The energy is saved while the application frame rate of 25 FPS is maintained. In the experiments run on the laptop, the ”simple heuristic” is used with the objects at their original LoD, the frame rate is only 13.54 FPS. However when EARR heuristic is used with multiple objects at their PoI LoDs, the frame rate is maintained at 25 FPS and never goes above 29 FPS. Therefore on average the target frame rate of 25 FPS is maintained. Figure 1a shows a screenshot of our applications on PDA. Our results show the LoD selection heuristic saves 27.4% of the energy, while EARR heuristic saves 62.3% of the energy consumption.
7
Conclusion and Future Work
We have presented our EARR heuristic that minimizes energy consumption while maintaining acceptable rendering speed and image quality. Our proposed EARR heuristic uses a workload predictor to adaptively predict frame rendering times and a dynamic CPU scheduler to save energy used by mobile 3D applications. Our experimental results show that our EARR heuristic generated more uniform frame rates than other strategies and successfully found the best rendering parameters that minimized mobile resource usage. Our experiments demonstrated energy savings of about 60%. In future, we shall extend our work in the following ways. 1) Improve energy saving by integrating Dynamic Voltage Scaling (DVS)
Adaptive CPU Scheduling to Conserve Energy
633
and Dynamic Frequency Scaling (DFS). We expect our heuristic will yield further savings after integrating DVS or DFS. 2) Improve PoI by integrating eye’s gaze pattern Eye’s gaze pattern is another important factor affecting human visual perception. With cues about the eye’s gaze pattern, we can increase the LoD of objects that user focuses on while reducing the LoD of objects outside of the focus area. In this way, even more rendering costs can be saved. 3) Accurately measuring CPU energy usage. We currently estimate CPU energy usage using a subtractive technique described in section 6, which can be improved in accuracy. We plan to develop more accurate methods to more accurately measure CPU energy consumption on mobile devices.
References 1. Mobile Games Indus. Worth US $11.2B by 2010 (2005), http://www.3g.co.uk/PR/May2005/1459.htm 2. Wu, F., Agu, E., Lindsay, C.: Pareto-Based Perceptual Metric for Imperceptible Simplification on Mobile Displays. In: Proc. Eurographics 2007 (2007) 3. Yuan, W., Nahrstedt, K.: Practical voltage scaling for mobile multimedia device. In: Proc. of ACM MM 2004, pp. 924–931 (2004) 4. Flinn, J., de Lara, E., Satyanarayanan, M., Wallach, D., Zwaenepoel, W.: Reducing the energy usage of office applications. In: Proc. of Middleware 2001 (2001) 5. Tamai, M., Sun, T., Yasumoto, K., Shibata, N., Ito, M.: Energy-aware video streaming with QoS control for portable computing devices. In: Proc. of ACM NOSSDAV 2004, pp. 68–73 (2004) 6. Liu, X., Shenoy, P., Corner, M.: Chameleon: Application level power management with performance isolation. In: Proc. ACM MM 2005 (2005) 7. Funkhouser, T., Sequin, C.: Adaptive display algorithm for interactive frame rates during visualization of complex virtual environments. In: Proc. of ACM SIGGRAPH 1993, pp. 247–254 (1993) 8. Lext, J., Assarsson, U., Moller, T.: A Benchmark for Animated Ray Tracing. IEEE Computer Graphics and Applications 21(2), 22–31 (2001) 9. Gobbeti, E., Bouvier, E.: Time-Critical Multiresolution Scene Rendering. In: Proc. of IEEE Visualizatoin, pp. 123–130 (1999) 10. Winmmer, M., Wonka, P.: Rendering time estimation for Real-Time Rendering. In: Proc. of the Eurographics Symposium on Rendering, pp. 118–129 (2003) 11. Tack, N., Moran, F., Lafruit, G., Lauwereins, R.: 3D Rendering Time Modeling and Control for Mobile Terminals. In: Proc. of ACM Web3D Synposium, pp. 109–117 (2004) 12. Rohlf, J., Helman, J.: IRIS Perfromer: A High Performance Multiprocessing Toolkit for Real-Time 3D Graphics. In: Proc. ACM SIGGRAPH, pp. 381–395 (1994) 13. Teller S.: Visibility Computations in Densely Occluded Polyhedral Environments. Ph.D. thesis (1992)
A Quick 3D-to-2D Points Matching Based on the Perspective Projection Songxiang Gu1 , Clifford Lindsay1 , Michael A. Gennert1,2 , and Michael A. King2 1
2
Worcester Polytechnic Institute, Worcester, MA 01609 University of Massachusetts Medical School, Worcester, Massachusetts 01655 Abstract. This paper describes a quick 3D-to-2D point matching algorithm. Our major contribution is to substitute a new O(2n ) algorithm for the traditional N ! method by introducing a convex hull based enumerator. Projecting a 3D point set into a 2D plane yields a corresponding 2D point set. In some cases, matching information is lost. Therefore, we wish to recover the 3D-to-2D correspondence in order to compute projection parameters. Traditionally, an exhaustive enumerator permutes all the potential matching sets, which is N ! for N points, and a projection parameter computation is used to choose the correct one. We define ”correct” as the points match whose computed parameters result in the lowest residual error. After computing the convex hull for both 2D and 3D points set, we show that the 2D convex hull must match a circuit of the 3D convex hull having the same length. Additionally a novel validation method is proposed to further reduce the number of potential matching cases. Finally, our matching algorithm is applied recursively to further reduce the search space.1 Keywords: Convex Hull, Residual Error, Horizon, Calibration.
1
Introduction
3D-to-2D points matching is still an open topic in computer vision. Projecting a 3D point set into a 2D plane yields a corresponding 2D point set. If the points are identical, we will lose the correspondence information through the projection. On the other hand, if our input data is 3D and 2D point sets and we want to estimate the projection parameters based on the 3D and 2D points’ coordinates, we have to know the points correspondence information. The traditional way [1] to acquire the best matching 3D point set is to enumerate all potential matching cases via camera calibration computation. A single matching case yields a set of projection parameters, which we use to project the 3D points into the 2D plane. We then calculate the residual errors, which we define as the difference between the projected 2D points and the input 2D points. We then choose the matching case with minimal residual error as the correct one. To recover a best matching correspondence without any constraint requires a search space of N ! for N points. 1
This work was supported by the National Institute of Biomedical Imaging and Bioengineering by Grant R01-EB001457.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 634–645, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Quick 3D-to-2D Points Matching Based on the Perspective Projection
635
Even without any extra constraint, we still can shrink the N ! searching space with topology information provided by the points set and multi-view projection. In this paper, we introduce a novel points matching algorithm to pick the correct match with O(2n ) complexity as well as solve the perspective projection parameters. In this paper, we use pose estimation [2] to determine a transformation in order to simulate the perspective projections. Therefore, given a set of matching 2D and 3D points, we can compute the projection parameters with the following closed-form calibration method [3] [4]. Once the camera parameters are determined, we can use them to project the original 3D points into 2D and compute the residual error in 2D. If the matching information is correct with minimal distortion, then the residual error would be zero. In the presence of measurement noise, we expect a small, but non-zero residual error. Therefore, an incorrect match will produce a large residual error and can be disregarded. We show experimental results to validate the correctness of our method. Our experiment builds a relationship between the camera and a Single Photon Emission Computed Tomography (SPECT) system, which is used to collect the radioactivity information. Our 3D points are both retro-reflective and radioactive and we acquire the 2D image of points by camera and the 3D coordinates by the SPECT. Since the 3D information is collected by radioactivity, it is impossible for us to mark the points for the correspondence. Therefore, by inputting the identical 3D and 2D points set into our method, we can show the correct correspondence as well as the camera parameters. We also designed a simulation procedure to study the performance of our method. The rest of the paper is organized as follows; section two will covers related work, followed by section 3 which is an in-depth explanation of our approach. Then section four outlines our experiments, and section four discusses the conclusion and future work. 1.1
Related Work
The RANSAC algorithm [5] is one of the most popular algorithm for 3D-to2D points matching. Developed from the traditional RANSAC algorithm [6], Fitzgibbon [7] registered the large number of 3D and 2D points by LM-ICP (Levenberg-Marquardt Iterative Closest Point) algorithm. Though it works well for a large number of random points, it does not converge to a global optimum necessarily. Furthermore, both algorithms require models in order to compute a match. For our application, we need an optimal correspondence without providing a model and our point number is small. Goshtasby [8] matched point patterns with convex hull edges. However, he did not investigated the complete inherent relationship between the 3D and 2D convex hulls. Cyr [9] introduced a new method to register 3D objects to 2D projections using shape. Kita [10] introduced an iterative searching method for 3D volume data to 2D projection image registration. This method can be adopted for 3D-to-2D points matching. Kurazume [11] developed another estimation method that calculates the pose of 3D objects. All of theses methods do not necessary converge to the best match.
636
S. Gu et al.
Although calibration is not the main contribution of our work, it is an important factor in this paper. Pin-hole camera model is a close-form calibration model. Tsai [3] introduced a calibration method to solve the pin-hole without distortion and skew. Tsai [4] improves the model by using a 2-step calibration model. With the same input, this method can handle skew and distortion problem. At least 6 points in both 3D and 2D space are required with only a single camera and image. Since we have the 3D and 2D point coordinates and only one camera, we employ Tsai’s calibration method in our points matching algorithm.
2 2.1
Approach Overview
In this paper, we combine calibration with points matching computation as a whole to create an efficient algorithm for the global optimized matching case. To simplify the problem, we do not consider distortion in this paper. Although the total potential 3D-2D points matching cases are N !, most of the cases can be disregarded. Only the potential matching sets that follow the topology restrictions between 3D and 2D point sets are considered for the calibration computation. To utilize the topology information and simplify the computation, we have to compute the convex hull for both 3D and 2D point sets. Since we assume no distortion, a 2D convex hull corresponds a circuit on the 3D convex hull(We make a simple proof in Appendix A). With such a theorem, the topology degenerates the factorial method to an exponential one. Secondly, we claim that not all the 3D circuits, but only 3D horizons [12] can be projected into 2D plane as a convex hull. Therefore, we propose a validation method to invalidate a large number of 3D circuits. Then, a recursive method is adopted to search the remaining points dynamically. After putting all the filtered matching cases into the calibration computation, we create a set of camera parameters. Residual error is computed via the calculated camera parameters. Given a pair of 3D/2D point sets, the exact matching case and the correct projection parameters are finally picked out by the minimum residual error. The residual error is introduced to measure the validity of a potential matching case. For 3D point set S and 2D projection set S , we have E = 1/n
n
¯ S )) S − P roj(S, Calib(S,
(1)
i=0
Where n is the number of the points; S¯ is one of the potential matching cases; P roj() is the projection function with the parameters of 3D point set and camera parameters and Calib() means the calibration. We want to try less times of S¯ to get the minimum residual error E. 2.2
Convex Hull Matching
To extract topology information, convex hulls are computed on both 3D and 2D point sets [Fig. 1]. A convex hull of a point set S, is the unique convex polygon
A Quick 3D-to-2D Points Matching Based on the Perspective Projection
637
A A' B
D
C B'
(a) 3D Vertices
C'
(b) 2D vertices
Fig. 1. (a) Convex hull of 3D point set S; (b) Convex hull of 2D point set S
or polytope, which contains S and all of whose vertices are points from S [13]. Computing the convex hull is a well studied problem in computational geometry [12]. Yao [14] showed that the lower bound to find convex hulls is O(n ln n). In two and three dimensions, the quick hull algorithm [15] [16] determines convex hulls for most point sets with time complexity O(n ln n). However, this method may fail when more than 3 points are co-planar. Finally, O’Rourke [17] provided a O(n2 ) robust method for 2D two and 3D convex hull computations. Based on the convex hull, we have split all the points into two categories: boundary points, which are on the convex hull, and interior points, which are interior the convex hull. For example, in the Fig. 1(a), the 3D point A and C are boundary points and the 3D point B is an interior point. In the Fig. 1(b), the 2D point A and C are boundary points and the 2D point B is an interior point. Based on two primary theorems (Theorem 1, Theorem 2 in Appendix A) from ComputationalGeometry [18] concerning 3D and 2D convex hulls, it is easy to prove that all the boundary points on the 2D convex hull correspond to a subset of the boundary points on the 3D convex hull. Furthermore, based on Theorem 3 (Appendix A) , we can claim the circuit of 2D boundary points [Fig. 1(b)] must correspond to a circuit of 3D boundary points on 3D convex hull [Fig. 1(a)]. This means, to match the m-length 2D convex hull, we do not have to try all the permutations of the 3D point set, but only the m-length circuits on the 3D convex hull. With the insight from the topology, we shrink the searching space in the first step. It is not difficult to trace a m length circuit on the 3D convex hull. After tracing such a circuit, if the point number of 2D convex is m, we only need 2m trials to search for the correct matching case in the 3D and 2D boundary point sets. In other words, we need m trials to proceed clockwise around the 2D boundary point set and another m trials for the counter-clockwise case. We use the terminology Hnm to denote the number of valid m length circuits on an n-vertex convex hull. Considering the constraint on convex hull point matching, we search for a length m circuit on the 3D convex hull for the first matching step, instead of an exhaustive N ! search. This first convex hull matching step reduces the computational complexity from O(N !) to O(2mHnm (N-m)!).
638
S. Gu et al.
q1
q2
Intersection Line p1
p4
VR2 V1
VR
V2
p2 p3
Focal Point (a)
VR1
VRc
VR1
VR2 (b)
Fig. 2. (a) Valid Region for an convex edge. (b) Common Valid Region intersected by multiple valid regions.
2.3
Horizon Validation
If a 3D circuit could be projected into a 2D plane as a convex hull, all of the points and edges in that circuit should be visible to a certain focal point. Such a circuit is called horizon. In other words, if there is no such a 3D focal point that can view all the edges on the 3D circuit, the circuit is not a valid horizon and therefore can be excluded. Here we developed a horizon validation method to validate the circuits. As shown in Fig. 2(a), for an edge P1 P2 in the 3D circuit, there is always a pair of triangles ((ΔP1 P2 P3 ),(ΔP1 P2 P4 )) associated with it. If the edge is projected into a 2D plane as a 2D convex edge, the focal point must view one of the pair of the triangles, but not both. Particularly, in Fig. 2(a), if the focal point is in the region between V1 and V2 , it can view the triangle (ΔP1 P2 P4 ), but not (ΔP1 P2 P3 ). Therefore, for each edge, there is a valid region in 3D space that the focal point could only exist in, which we call a ”valid region” (V R). Each edge is determined by its own valid region V R. For two edges, there are two valid regions that can be intersected into a common valid region V Rc (Fig. 2(b)). If there is a focal point that can view these two edges, the focal point has to be in the common valid regions with V Rc = null. In general, if all of the edges have at least one common valid region (V Rc = V R1 V R2 ... V Rl = null, l is the edge number), the circuit is proved to be a horizon. Otherwise, if the common valid region is null, no matter where we put the focal point we can not view the circuit as a horizon. Therefore, we consider the circuit to be invalid. Though we have not analyzed the horizon validation mathematically, based on the results of our simulation shown in Section 3.3, this algorithm is an improvement of approximately O(2n ) over an exhaustive search. 2.4
Recursive Computation
As shown in Fig. 3(a;b), during the matching procedure, we dynamically split the point sets into m matched points and (N − m) remaining points in both 3D and 2D point sets by the circuit. Since the two (N − m) remaining point sets can be considered new an independent potential matching set, we developed a recursive method to deal with the remaining points. If (N − m) is still large,
A Quick 3D-to-2D Points Matching Based on the Perspective Projection n points 3D convex hull
(a) First Layer of 3D Vertices
(c) Second layer of 3D Vertices
639
m points 2D convex hull
(b) First Layer of 2D vertices
(d) Second layer of 2D vertices
Fig. 3. (a;b) m points circuit in 3D convex hull corresponding to m points in the 2D convex hull. (c;d) Next layer points matching.
we can compute the convex hulls for 2D, 3D (N − m) point sets and create the potential matching cases for it. As shown in Fig. 3(c;d), we deal with the remaining point sets just like the initial point sets. By recursively repeating the procedure mentioned above until the remaining points is small enough, number we can reduce the potential matching cases to O(2r mi Hnmi i ), where r is the r
r
number of layers (recursive calls) of point set splitting. It is easy to tell that the algorithm has no computational benefit when the number of the remaining points is fewer than 4. Therefore, we can shrink the problem space until (N − m) ≤ 4. Not only the circuit searching but also the horizon validation can be recursively propagated. Initially, the valid region is set to infinite before we begin the horizon validation computation. After the first level horizon validation, we go to the next layer of circuit searching. Since the focal point should not be changed when we search the circuit in next layer, the common valid region of the next layer can only stay inside the common valid region of the previous layer. We can propagate the common valid region of the first level horizon as the initial valid region to the next layer, which is no longer infinite. During the valid region propagation, the common valid region may be split into several pieces by the edges and triangles. Fortunately, most of the pieces turn out to be invalid very quickly. Since m ≤ n, the 2D convex hull is the benchmark for each recursive convex matching. To optimize the computation, we can pre-compute all the layers of 2D convex hull. However, when we delete m points from the 3D point sets dynamically, we have to recompute the 3D convex hull for the remaining point sets again and compute a new 3D convex hull for next recursive step. Finally, for each potential point set, we do a calibration computation of the projection matrix. Then we re-project the 3D points into 2D by perspective projection and calculate the residual error. The points matching case with the smallest residual error is determined to be the correct matching case.
640
3 3.1
S. Gu et al.
Experiments Camera Parameters
The pinhole model [3] has 11 parameters including the distortion and skew. For each potential correspondence between the 3D and 2D point sets, we compute the camera parameters that optimally projects the 3D world points into the 2D image. We use a closed-form calibration method that computes the intrinsic and extrinsic parameters by solving the perspective projection equation. c
w
X = P roj(CP, X )
(2)
Where P roj() is the projection function, CP is the set of camera parameters and w
c
X , X are the point coordinates in 3D world and 2D camera plane, respectively. Camera parameter CP can be decomposed into ⎡
⎤
CP = A · P
(3)
fx 0 ICx Where A = S · ⎣ 0 fy ICy ⎦ is the set of intrinsic parameters; S is the scaler; fx , 0 0 1 fy are the scaled focal lengths and ICx , ICy are the image center coordinates. The extrinsic parameters are P = [R|T ], where R is a 3 × 3 rotation matrix and w
T is the translation vector. Given a set of at least 6 pairs of points in both X c
and X , we can linearly compute the camera parameters, A and P [1]. After the basic camera parameters are computed, we can re-project the 3D points into 2D plane. Then a second step is performed for the distortion parameters w
[4]. The computed camera parameters are then used to project each point X according to Eq 2 and the residual error is computed by Eq 1. We select the camera parameters from the parameter set which generate the lowest residual error. 3.2
Verification Using Real Camera Data
We illustrate this method using a 7 point data set as in Fig. 4. Recalling that at least 6 point pairs are needed to calculate CP , we use 1 extra point to provide redundancy. In this case, the 2D optical data comes from an AXIS PTZ 2130 camera with resolution of 640 × 480 pixels. As mentioned in Section 3.4, a reasonable estimation for the maximum distortion error is 9 pixels. Then we add radioactivity into the centers of retro-flective spheres and put the 7-sphere phantom into SPECT. The 3D data acquired has a resolution of 256×256×256 voxels. Each voxel is a cube with 2.33mm on each side, for a volume of 12.65mm3 . Since the 3D information is acquired by the radioactivity, all the spheres are identical. Using brute-force matching, 7 points need 7! = 5040 calibration computations which completes in 1.26 seconds. In our experiment, we repeat the data acquisition 6 times with different camera parameters [Fig. 4]. The results for such 6 trials are shown in Table 1.
A Quick 3D-to-2D Points Matching Based on the Perspective Projection
(a)
(b)
(c)
(d)
(e)
(f)
641
Fig. 4. 6 samples with the same image resolution 640 × 480 but different camera parameters
In the experiments, all final matches are optimal. The average value for the residual error is 2.051 pixels. If we have known the 3D coordinates of the retroreflective markers, the closed-form method developed by Tsai [4] works well. Moreover, this result shows that the current lens distortion usually has little influence in the imaging procedure and does not change the projection topology. Such results signify that our solution is applicable for the point matching problem even with real cameras. For our solution, the average potential matching cases are 817, which is much smaller than 7!. In this experiment, we can not change the point number at will. Therefore, we have to design a simulation procedure to show the results with different point number. 3.3
Simulation and Result
Based on the closed-form calibration method, we want to know the average time complexity for our algorithm with points sets that have a different number of points. An evaluation is proposed to simulate experiments with more pseudopoints via the following steps. 1. 2. 3. 4. 5. 6.
Generate the coordinates for a set of 3D points. Generate a set of camera parameters. Project the 3D points into the 2D camera plane. Shuffle the order of the 2D point set. Put them into our algorithm for the best matching case. Select camera parameters that yield the smallest residual error.
To test our convex hull based points matching algorithm, we generated 8 sets of data with 7 to 14 points, and put them into our algorithm to show the matching results and computation time. Previously, we mentioned a brute-force matching method, a basic convex hull based matching method(Section 3.1) and an improved method with horizon validation(Section 3.2). We compared these three methods to show the average time consumption in the simulation [Fig. 5].
642
S. Gu et al. Table 1. Time Consumption for the Real Data Trial
Residual Er- Time Conror (Pixel) sumption (Sec) 2.48 0.313 0.37 0.328 0.48 0.343 3.24 0.322 1.12 0.375 3.18 0.329 1.810 0.335
1. 2. 3. 4. 5. 6. Mean
Comparison for time consumption
Comparison for potential matching cases
1E+8
Brute-force Basic Convex Hull Matching Horizon Validation
1E+10 1E+09 1E+08
Brute-force Basic Convex Hull Matching Horizon Validation
1E+7
Average Time Consumption (Sec)
Potential Match Cases
1E+11
Potential Matching Cases 780 780 780 780 1002 780 817.0
1E+07 1E+06 100000 10000 1000
1E+6 100000 10000 1000 100 10
100 10
1 1 7
8
9
10 11 12 Points Number
13
14
0.1
15
7
8
9
10
11
12
13
14
Points Number (b)
(a)
Fig. 5. Comparison between different methods. (a) Comparison of average potential matching cases. (b) Comparison of average time consumption. 9 Points, Horizon Validation Matching, 3000 trials 500
600 500 400 300 200 100
Matching cases
Matching cases
Matching cases
9 Points, Basic Convex Hull based matching, 3000 trials 400 200 100
0
0 -100
Matching cases
300
0
2
4
6 Time (Sec)
(a) Basic Convex Hull based matching
8
10
-100
0
2
4
6
8
10
Time (Sec) (b) Improved Horizon Validation Matching without Distortion Tolerance
Fig. 6. The time consumption distribution for the 3000 trials of 9 random point sets
Although we cannot exhaustively list all possible topologies, we repeat the points generation of certain point number with uniform distribution 3000 times. After 8×3000 iterations of the simulation, the mean results are shown in Fig. 5. Fig. 5(a) shows the comparison of the average number of valid potential matching cases and Fig. 5(b) shows the comparison of the average time consumption. From the Fig. 5, our method provides an exponential solution for the point
A Quick 3D-to-2D Points Matching Based on the Perspective Projection
643
matching problem. Formerly, the brute-force method required 1.26 seconds for 7! calibration computations for optimal matching of 7 points. It is reasonable to state that when the number of points is increased to 13, the brute-force matching takes more than 18 days to calculate making it impractical. Based on the simulation results, only 77.03 seconds on average is required for best matching for 13 points. The time consumption of the method is approximately O(2n ). Fixing the point number to 9, Fig. 6 shows the time consumption distribution for the 3000 trials with random positions. From that map, it is easy to tell that for most of the random cases, the time consumptions are less than 10 seconds. We also show that the distribution of time consumption is roughly Gaussian. Based on the simulation results, our method decreases the searching time for the best match by the convex hull based enumerator.
4
Conclusions and Future Work
Our algorithm provides a quick method to get the correct 3D-to-2D points matching information. We use the inherent topology information and the multiview projection principle to shrink the search space. Based on our simulation result, the performance of our method is approximately O(2n ). Our solution fits for some applications, in which the best matching case is required and a small number of points are small. In future, we would like to investigate to use our method with other applications, such as 3D-to-2D points matching, pose estimation and camera calibration. However, it is still an exponential solution. If the point number is further increased, we still have to spend a lot of time on searching the best matching case.
References 1. Gennert, M.A., Bruyant, P.P., Narayanan, M.V., King, M.A.: Calibrating optical images and gamma camera images for motion detection. In: Proc. Soc. Nuclear Medicine 47th Ann. Mtg. (2002) 2. Linnainmaa, S., Harwood, D., Davis, L.S.: Pose determination of a threedimensional object using triangle pairs. IEEE Trans. Pattern Anal. Mach. Intell. 10, 634–647 (1988) 3. Lenz, R.K., Tsai, R.Y.: Techniques for calibration of the scale factor and image center for high accuracy 3-d machine vision metrology. IEEE Trans. Pattern Anal. Mach. Intell. 10, 713–720 (1988) 4. Tsai, R.Y.: A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses. Radiometry, 221–244 (1986) 5. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM 24 (1981) 6. Rusinkiewicz, S., Levoy, M.: Efficient variants of the icp algorithm. 3DIM, 145 (2001) 7. Fitzgibbon, A.: Robust registration of 2d and 3d point sets, vol. 2, pp. 411–420 (2001)
644
S. Gu et al.
8. Goshtasby, A., Stockman, G.C.: Point pattern matching using convex hull edges. IEEE Trans. on Systems, Man and Cybernetics SCM-15(5), 631–637 (1985) 9. Cyr, C.M., Kamal, A.F., Sebastian, T.B., Kimia, B.B.: 2d-3d registration based on shape matching. In: MMBIA 2000: Proceedings of the IEEE Workshop on Mathematical Methods in Biomedical Image Analysis, p. 198. IEEE Computer Society, Los Alamitos (2000) 10. Kita, Y., Kita, N., Wilson, D.L., Noble, J.A.: A quick 3d-2d registration method for a wide-range of applications. Inter. Conf. on Pattern Recognition 01, 1981 (2000) 11. Kurazume, R., Nishino, K., Zhang, Z., Ikeuchi, K.: Simultaneous 2d images and 3d geometric model registration for texture mapping utilizing reflectance attribute. In: ACCV: The 5th Asian Conference on Computer Vision (2002) 12. Berg, M., Kreveld, M., Overmars, M., Schwarzkopf, O.: Computational Geometry– Algorithm and Applications, 2nd edn. (2000) 13. Weisstein, E.W.: Convex hull (From MathWorld–A Wolfram Web Resource) 14. Yao, A.C.C.: A lower bound to finding convex hulls. J. ACM 28, 780–787 (1981) 15. Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. 22, 469–483 (1996) 16. Skiena, S.: Convex hull. 8.6.2 in The Algorithm Design Manual, 351–354 (1997) 17. O’Rourke, J.: Computational geometry in C (1994) 18. Preparata, F.P., Shamos, M.I.: Computational geometry: an introduction. Springer, New York (1985)
Appendix Theorem 1 [18]. As shown in Fig. 1(b), the line segment l defined by two 2D points A and C . (A C ) is an edge of the 2D convex hull(CH(S )) if and only if all other points of the point set S lie on l or to one side of it. Theorem 2 [18]. As shown in Fig. 1(a), the face p defined by three 3D points A, C and D. ΔACD is a face of the 3D convex hull(CH(S)) if and only if all other points of the point set S lie on the plane p or to one side of it. Lemma 1. As shown in Fig. 7, given 3D triangle Δv1 v2 v3 , its 2D perspective projection is also a triangle, called Δv1 v2 v3 . We also have that all the points interior the triangle Δv1 v2 v3 are projected into the interior of triangle Δv1 v2 v3 .(If
v3 v' 3 vi v1
v2 v' 1
v' 2
v' i
Fig. 7. Triangle Projection
Focal Point
A Quick 3D-to-2D Points Matching Based on the Perspective Projection
v' 1
v1
v' 3 v' 2
v3 p v2 (a) 3D Vertices
645
Focal Point
l (b) 2D Vertices
Fig. 8. Corresponding convex hull edge in 3D (a) and 2D (b)
the 3D triangle is projected into a 2D plane as a line segment, it can also be considered as a special case of Lemma 1.) Theorem 3. Given 2D point set S [Fig. 8(b)] which is the projection of 3D point set S[Fig. 8(a)], if v1 ,v2 are adjacent points in the 2D convex hull of S , then v1 , v2 are also adjacent in the 3D convex hull of S, where v1 ,v2 are 2D projections of 3D points v1 and v2 .
Deformation-Based Animation of Snake Locomotion Yeongho Seol and Junyong Noh Visual Media Lab, Graduate School of Culture Technology, KAIST 335 Gwahangno, Yuseong-gu, Daejeon 305-701, Republic of Korea {seolyeongho,junyongnoh}@kaist.ac.kr
Abstract. A simple but very efficient method for snake locomotion generation is presented in this paper. Instead of relying on conventional physically based simulation or tedious key-framing, a novel deformation-based approach is utilized to create realistic looking snake motion patterns. Simple sinusoidal, winding, and bending functions constitute the deformation. The combination of various types of deformation becomes a powerful tool for describing the characteristic motions of a typical snake. As an example, three basic deformations and their combinations are utilized and various locomotive animations are generated with a high degree of realism. The proposed method provides an easy-to-use, fast, and interactive mechanism for an animator with little experience. The method is versatile in that it also works in conjunction with the creative input of an experienced animator for the improvement of the overall quality of the animation.
1 Introduction Thanks to the remarkable advancements in computer graphics technology, digital creatures have played important roles in recent movie making. Purely computergenerated (CG) characters such as the dinosaurs in Jurassic Park (1993) or the lion in The Chronicles of Narnia (2005) have played lead roles in movies replacing live action animals. Among the various CG creatures, a snake is one of the most common characters, as exemplified by Anaconda (1997), Anaconda 2 (2005), Snakes on a Plane (2006), The Shaggy Dog (2006), The Wild (2006), and D-war (2007). Creating a CG snake is a challenging task. Unlike familiar biped or quadruped characters, snakes have a distinctively long and flexible skeletal structure. The entire body makes contact with the substrate demonstrating unique cyclic patterns during locomotion [10]. These characteristics peculiar to snakes require a novel animation strategy for the efficient generation of realistic snake motions. Several research efforts concerning snake animation have relied mainly on physically based approaches [12, 17]. Although physical simulations report a great success by providing an automated solution for quick animation, the methods are less preferred by visual effects production studios. Interviews with digital artists have revealed that simulation-based automated approaches provide few interactive controls and rarely allow any creative input from the users. This lack of interactivity makes the purely automated approach less attractive. In contrast, relatively tedious manual keyframing is often preferred for easy editing of the pose and timing of the characters, which are essential operations in the production of visual effects. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 646–657, 2008. © Springer-Verlag Berlin Heidelberg 2008
Deformation-Based Animation of Snake Locomotion
647
There are two different approaches in key-framing for snake animation. The first is to provide a set of control vertices (CVs) to maneuver the control curve of the snake body. The individual manipulation of the large number of CVs makes this approach very tedious during the creation of the desired animation. Having independent control of each CV is naïve considering the motion of CVs coupled together. A more efficient alternative to individual CV control is to employ motion paths. A NURBS curve is defined as a path through which the snake body moves. This method is well suited for the serpentine motion of a snake [10]. However, it is not easily adaptable to other characteristic locomotion patterns of a snake such as sidewinding [14], concertina [14], rectilinear [15], and swimming [23] movements. This paper describes a fast and powerful method to create locomotive animation of a snake. The goal here is to provide an animator with highly familiar but greatly improved key-frame system that is efficient enough to expedite the animation process and flexible enough to accept creative input while achieving the striking visual realism required in movie production. The control should be direct and very easy-to-use. These goals are achieved using a deformation-based approach. Simple deformation rules tailored to the motion of a snake are proposed. Each deformation is devised by faithfully referencing available research in biology [10, 11, 14, 15, 23] as well as live video footage [4, 9, 26] of snake locomotion. Single deformation or various combinations of deformations applied to the control curve of a rigged snake generate the distinctive locomotion observed in typical snakes. The control curve manipulation is performed by moving a set of CVs. The method is transparent to conventional keyframing as it also provides individual control of the CVs. The additional tweak of individual CV control can improve the animation quality. The remainder of this paper proceeds as follows: After reviewing previous work and background information in Section 2, the proposed algorithm is described in detail along with each step of the implementation in Section 3 and 4. Section 5 presents the experimental results and the methods of evaluation. Before concluding in Section 7, general issues that were found as well as possible extensions are discussed in Section 6.
2 Related Work Research on biological characteristics of a snake is very important to create realistic animation. Studies in biology have provided valuable insight into various snake motions. A set of experiments performed by [10, 14, 15] identify four main types of terrestrial locomotion exhibited by snakes [11] (Fig. 1). The four types are lateral undulation (serpentine), sidewinding, rectilinear, and concertina. In addition to terrestrial locomotion, swimming is also a distinctive locomotion type [23]. Aside from the
Fig. 1. Modes of terrestrial locomotion (a) serpentine (b) sidewinding (c) rectilinear (d) concertina
648
Y. Seol and J. Noh
motions that snakes use to travel forward, they can perform many other motions such as coiling up, striking, and climbing trees. The locomotion types of a snake are classified here into three categories (Table 1). Two criteria are chosen: the path of the head, and the existence of the cyclic undulation of the body during each type of locomotion. Path and undulation have a direct influence on the level of difficulty and time required for snake animation. The classification is also utilized for the efficient evaluation of various animation methods in Section 5. Table 1. Classification of snake locomotion
Locomotion Serpentine Rectilinear Sidewinding Concertina Swimming
Body follows path of head Yes Yes
Existence of undulation No Yes
No
Yes
Various research efforts have been made in the computer graphics community concerning animal animation and efficient animating methods. Physically based approaches are most popular for the realistic animation of CG creatures. Miller [17] used a massspring model to animate the motion of snakes and worms. Using the technique of [17], short animation “Her Majesty’s Secret Serpent” was made. Tu and Terzopoulos [25] created a virtual marine world to simulate artificial fish. Grzeszczuk and Terzopoulos [12] also employed a mass-spring model incorporating a multi-level learning process. Nougaret and Arnaldi [18] simulated the pulse-like and periodic muscle activation of jellyfish and fish. Several researches [13, 19, 27, 20] presented a physics-based bird models for its flight animation. Ma [16] built a snake-like robot simulator based on the torque and friction at each joint of a snake. Although physical simulations automatically create the desired animation and can be a great boon for a novice, they are not suitable for a professional production workflow due to the low level of user interactivity. Deformations provide a useful means of creating animations. In Barr’s work [2], deformations are easily combined in a hierarchical structure, creating complex deformations from simpler deformation elements. Free-form deformations (FFDs) [22] are a primary example of global geometric deformations. Coquillart and Jancéne [7] used animated free-form deformation (AFFD) for an interactive animation technique. AFFD makes it possible for animators to control deformations and timing interactively. Barzel [3] suggested a modular layered modeling method for flexible linear bodies such as a rope and a spring. The advantages of animated deformations are that it is a straightforward method of generating particular shapes and it makes interactive controls available to animators. Several dynamic deformation techniques have been proposed to generate secondary animations automatically. Chadwick and Haumann [5] presented a layered construction in the animation of articulated characters. Terzopoulos and Qin [24] described dynamic NURBS that deforms a NURBS curve in a physically intuitive manner in response to direct user manipulation. Faloutsos et al. [8] presented dynamic free-form deformations that extended the use of FFDs to a dynamic setting.
Deformation-Based Animation of Snake Locomotion
649
The advantages provided by deformation-based approaches are utilized for our snake locomotion animation. In particular, repetitive and cyclic motions of a snake are a viable candidate for deformation operation. Our approach is conceptually similar to [3], in that basic and compound deformations corresponding to animation requirements are utilized.
3 Deformation-Based Animation The proposed system is built on top of conventional key-framing methods. The specialized deformation algorithm describes characteristic shapes and motions specific to an archetypal snake. The system alleviates the tedium of conventional key-framing while expanding the expressive power. A typical snake motion is simply generated by a few mouse clicks. Instead of building a single deformation for every locomotion type, several deformations that can make a layered combination are modeled. Simple deformations and their layered combinations expand the expressive power compared to complex single deformation. The process of deformation-based animation can be summarized as follows: Given a rigged 3D snake model whose body joints are bound to a NURBS control curve (Section 3.1), a layered combination of the deformation modules is constructed on the control curve with a target animation in mind (Section 3.2). The control curve is deformed showing typical shapes of the snake, which in turn, deforms the snake model in same way. Key-framing each of the parameters of deformation over time causes the movement of the snake model. After the application of a set of deformation algorithms, the animator has freedom to pursue conventional key-framing further for finetuning if desired (Section 3.3). In the following section, each step is described in detail. 3.1 Snake Rigging The proposed method entails a specially rigged snake model. Previous research on deformations [2, 3, 7, 22] directly operates on the target geometry, often leading to changes in the total length or volume of geometry. Such an alteration is not acceptable when visual realism is of paramount importance. Physically based approaches such as mass-spring models [12, 17, 25] also share a similar problem. The snake rigging used here provides an answer to the issue of changing geometry. To represent a snake, a
Fig. 2. Snake rigging consists of a key-frame part(head area) and a deformation part(body area)
650
Y. Seol and J. Noh
skeletal system is constructed whose joints are bound to a NURBS curve with inverse kinematics (IK). To create a desired shape, this NURBS curve is manipulated instead of the snake geometry. Handling of the NURBS curve rotates the joints to form the desired body shape. Although the length of the NURBS curve can change during this process, the total length of the body always remains constant. This type of snake rigging is advantageous for both conventional key-framing and for the proposed Modular Layered Deformation Method (MLDM), because both methods manipulate the NURBS curve to create animation. The skeletal system is divided into the two parts of the head and body (Fig. 2). The head part is controlled by manual key-framing while the body part can be managed by both MLDM and manual key-framing. Section 3.3 and Section 3.2, respectively, provide details of these two parts. An additional thin surface is bound to the ventral area of the snake as a pseudo abdomen. This indicates the ventral side of the snake geometry explicitly and makes it easy to find the distance from the CV of the control curve to the abdomen of the snake. This assists with the ground detection process described in Section 4. 3.2 Modular Layered Deforming Method The similarity between a wave and a snake motion curve inspired the utilization of the modular layered deformation method. A single deformation on a curve can be described by a waveform while multiple deformations by a composite wave. Several predefined deformation modules form layered combinations for sophisticated snake motions. Three deformations are defined: sinusoidal, winding, and bend to represent the basic elements of locomotive patterns. Each deformation was chosen based on analyses of Gray [10] and Taylor [23] and by referring to video references [4, 9, 26] to ensure its resemblance to an actual snake motion. Each deformation has a handle for the interactive control of the position and rotation. Converting an original CV position of a control curve into the local space of a handle and applying deformation and reconverting the new position back into the object space of the curve is the procedure for calculating the deformed position of every CV. 1
the transformation matrix of a where denotes the initial position of the CV, handle, F the deformation. In the following section, the detailed implementation of each deformation is described, as is the deformation results when the parameters are varied. 1) Sinusoidal Deformation Snake body shapes during serpentine, concertina, sidewinding, and swimming are very much similar to a sinusoidal curve, as indicated by [10]. Sinusoidal curves are also applicable to rectilinear locomotion to create the regularly spaced instances of contact with the ground by the snake. Sinusoidal deformation is specified by the equation ,
2
, 0 1
b
1 1
0
3
Deformation-Based Animation of Snake Locomotion
651
where i is the CV's index, is the frequency, the phase, the amplitude, and a drop-off function. See Fig. 2 for x, y, z directions relative to the snake body. Varying each parameter in Eq. 2 produces various poses of a snake (Fig. 3-a).
Fig. 3. Poses from (a) sinusoidal deformation (b) winding deformation (c) bend deformation by varying each deformation’s parameters
2) Winding Deformation During sidewinding and concertina locomotion, the curvature of the snake often exceeds the scope of the sinusoidal curve [10]. Winding deformation covers the motions for which the sinusoidal curve does not apply. Winding deformation is specified by repeatedly mirroring cubic Bezier curves linked with a user-specified frequency: 4 ! !
!
1
5
where i is the CV’s index, the frequency, the position of the Bezier points, and 3 in this case. Various poses can be created by varying each parameter (each point of the Bezier curve) of the winding deformation (Fig. 3-b). 3) Bend Deformation When a snake changes its direction of progress, rears up to threaten, or vibrates its body slightly, parts of its body essentially form an arc shape. To describe this, bend deformation is created. Bend deformation is specified by the circular bend equation presented by Barr [2]. ,
,
6
, ,
7 ,
1
,
1
,
,
1
, 1
1
1
1
1
1
, , ,
8
652
Y. Seol and J. Noh
where is the bending rate, the bending angle, , the range of bending, the center of the bending. See [2] for further details regarding bend deformation. Fig. 3-c shows the various results of bend deformation. The three deformations can be applied to the control curve repeatedly. The final position of the CV after the layered deformations is the sum of the positions by each single deformation module. Numerous variations can be created by varying the parameters of combined deformations. Examples using combined deformations are presented in Fig. 4. By applying key-frames to the parameters of each deformation over time, locomotive animation is generated. Animations and their explanations are presented in Section 5.
Fig. 4. Combination of deformations (a) two sinusoidal deformations (b) horizontal sinusoidal and vertical sinusoidal deformations (c) two bend deformations (d) sinusoidal deformation and bend deformation (e) winding deformation and bend deformation
3.3 Further Improvement The proposed method provides flexibility in motion control. It allows the individual manipulation of the CVs of the control curve after MLDM operation that reflects the position of CVs as described in Eq. 1. Applying additional manual key-framing to the CVs of the body further enhance the quality of the body movement generated via MLDM. Through this step, the creative input of the animator can be freely added. Moreover, if desired, any arbitrary animation is possible. As the motion of the head part including the mouth and the tongue is independent of the body and because it is not governed by basic deformations, conventional keyframing is incorporated for synchronized head motions.
4 Ground Detection A snake moves on the ground, in a tree, and in water. The body shape matches the curvature of the ground or object over which it is traveling. For realistic animation, it is crucial to model the interaction between the ground and the snake. Interpenetration should not occur, and the snake should not appear to float on the ground. When the snake moves on an uneven surface, the ground detection algorithm is applied. The control curve of the snake is utilized for efficiency instead of every vertex of the snake geometry. The snake animation created in Section 3.2 represents the input of
Deformation-Based Animation of Snake Locomotion
653
the ground detection. Each CV of the control curve is replaced reflecting the curvature of the ground. The same general method is applied to any surface:
Find the closest point on the surface from each CV (Fig. 5-1). Move each CV to the closest point on the surface (Fig. 5-2). Find the closest point on the pseudo abdomen from each CV and calculate the distances between the two (Fig 5-2). Move each CV along the normal directions of the surface (Fig. 5-3).
Fig. 5. Ground detection process
The closest point on the surface a vector , , 0,
,
·
,
from a point is found by Newton’s iteration using , with the following condition: , ·
0.
This approach rapidly produces visually pleasing animations. Checking only the CVs instead of every vertex of the snake mesh achieves real-time performance.
5 Results Using the techniques described in Sections 3 and 4, realistic animations of various snake motions were created. Figs. 6-13 show pairs of frames from an accompanying demo video clip. This is followed by detailed explanations. Fig. 6 shows serpentine locomotion. The entire body and tail faithfully follow the path taken by the head and neck, as indicated by Gray [10]. Two sinusoidal deformations constitute a natural and non-regular sinusoidal path. Fig. 7 shows sidewinding. A combination of a horizontal sinusoidal deformation and a bend deformation produces a sidewinding curve. A vertical sinusoidal deformation is responsible for generating several contact points of the body with the ground. Another type of sidewinding (Fig. 8) is created using a winding deformation. The snake progresses with a steep body angle. It shows a greater area of the body having static contact with the substrate. Fig. 9 demonstrates a pushing motion against a substrate during concertina locomotion. Two sinusoidal deformations are placed on the anterior and posterior of the snake, respectively. Cyclic, regular spaced contacts with a substrate during rectilinear are easily created using multiple sinusoidal deformations (Fig. 10). Waves on the dorsum from anterior to posterior cycles are generated by adjusting the timing. The amplitude of the lateral waves increases from the head to the tail when the animated snake is swimming (Fig. 11). A simple drop-off function, such as 1 b serves this purpose. The proposed method is also applicable to various snake motions other than locomotion. Figs. 12 and 13 show the coiling up and striking motion. The technique was also applied to an actual short animation “Jungle Duel” (Fig. 14). The realistic locomotion and striking motion of a snake was created very efficiently utilizing the proposed method. The accompanying video shows further demonstrations of theses motions.
654
Y. Seol and J. Noh
Fig. 6. Serpentine
Fig. 7. Sidewinding-1
Fig. 8. Sidewinding-2
Fig. 9. Concertina locomotion
Fig. 10. Rectilinear locomotion
Fig. 11. Swimming locomotion
Fig. 12. Coil motion
Fig. 13. Striking motion
Fig. 14. Striking motion(left) and locomotion(right) of snake from “Jungle Duel”
To evaluate the efficiency of the proposed approach, a user test was performed. One locomotion type was chosen from each category in Table 1. One hundred and fifty frame scenes of serpentine, rectilinear, and swimming motions were chosen as
Deformation-Based Animation of Snake Locomotion
655
the target animations. An animator with three years of experience was asked to create the same animations using 1) manual CV manipulating, 2) the motion path method, and 3) the proposed deformation-based method. The test was conducted on a PC with an Intel Pentium D2.66GHz, 2GB RAM, Geforce 8800 GTS, and Autodesk Maya 8.5 as the animation software. As CV manipulation requires a great amount of time to animate a snake, the animator was allowed to create one cycle of animation and proportionally multiply the result to match a 150-frame animation. The control curve of the snake consists of fifteen CVs. Table 2 shows the result of the user test. Table 2. Result of user test (# of key-frames / minutes)
Target Serpentine Rectilinear Swimming
Manual 305/ 88 285/ 112 315/ 128
Motion Path 2/ 2 Impossible Impossible
Deformation-based Animation 2/ 2 82/ 15 4/ 3
Table 2 clearly indicates that the proposed method is much faster and requires fewer keys compared to a manual CV manipulating method. Furthermore, the proposed method is able to create every type of snake locomotion summarized in Table 1, while the motion path method cannot do this. The remarkable speed and versatility of the proposed approach make it an ideal solution for snake animation. It is strongly believed that this method is highly effective in the production of visual effects.
6 Discussion and Future Work The proposed method also has room for improvement. Complex path animations are difficult to generate. A complex path is created by the combination of several bend deformations. However, due to the circular nature of bend equations, applying more than three bend deformations to the same object inevitably overlaps their deformation spaces. The snake then follows an unwanted path caused by the overlap. This issue can be investigated in future research. To create animation that is more realistic, the proposed deformation approach was combined with a physical simulation. A hair simulation technique [1] was applied to the tail control curve, the most irregular part during locomotion. Deformations were applied to the body except for the tail, and the tail followed the body movement as a secondary motion with its simulation parameters. The results showed natural movement, but considerable time was required to determine the appropriate simulation parameters. Currently, the deformation-based animation is designed only for characters with a long and simple body structure, such as those observed in snakes. Biped or quadruped characters' joint structures are overly complicated to enjoy the efficiency provided by the deformation method. However, it is believed that it is worth looking into such a complicated character in future research.
656
Y. Seol and J. Noh
7 Conclusion A deformation-based animation system that creates animations of snake locomotion is presented. Instead of utilizing a physically based simulation or traditional keyframing, algebraically designed deformations were applied on a rigged snake model. The method is simple in terms of the concept and provides an interactive control to animators. It was shown to be effective in terms of both animation time and quality. Instead of building a single deformation for every locomotion type, the method provides three basic deformations that can be combined. Sinusoidal, winding, and bend deformations are presented to model the biological analyses and video references accurately. The generation of typical locomotion patterns was shown by combining the three types of deformation. A fast ground detection algorithm was also presented for the snake to follow the curvature of a ground while maintaining its locomotion patterns. This method is versatile in that it provides an easy-to-use, fast, and interactive mechanism for both novice and experienced animators.
References 1. Anjyo, K., Usami, Y., Kurihara, T.: A simple method for extracting the natural beauty of hair. ACM SIGGRAPH Computer Graphics 26(2), 111–120 (1992) 2. Barr, A.H.: Global and local deformations of solid primitives. In: Proc. 11th annual conference on Computer graphics and interactive techniques, pp. 21–30 (1984) 3. Barzel, R.: Faking dynamics of ropes and springs. IEEE Computer Graphics and Applications 17(3), 31–39 (1997) 4. Beynon, M.: Wildlife specials: serpent. BBC (2007) 5. Chadwick, J.E., Haumann, D.R., Parent, R.E.: Layered construction for deformable animated characters. ACM SIGGRAPH Computer Graphics 23(3), 243–252 (1989) 6. Colin, M.: Reptile. DK (2000) 7. Coquillart, S., Jancéne, P.: Animated free-form deformation: an interactive animation technique. In: Proc. 18th annual conference on Computer graphics and interactive techniques, pp. 23–26 (1991) 8. Faloutsos, P., Van de Panne, M., Terzopoulos, D.: Dynamic free-form deformations for animation synthesis. IEEE Trans. Visualization and Computer Graphics 3(3), 201–214 (1997) 9. Foster, C.F., Foster, R.: Land of the Anaconda. National Geographic (1999) 10. Gray, J.: The Mechanism of Locomotion in Snakes. J. Exp. Biol. 23(2), 101–120 (1946) 11. Greene, H.W.: Snakes, California (1997) 12. Grzeszczuk, R., Terzopoulos, D.: Automated learning of muscle-actuated locomotion through control abstraction. In: Proc. 22nd annual conference on Computer graphics and interactive techniques, pp. 63–70 (1995) 13. Haumann, D.R., Hodgins, J.K.: The control of hovering flight for computer animation. Springer Computer Animation Series, pp. 3–19 (1992) 14. Jayne, B.C.: Muscular mechanisms of snake locomotion: an electromyographic study of the sidewinding and concertina modes of Crotalus cerastes, Nerodia fasciata and Elaphe ob-soleta. J. Exp. Biol. 140(1), 1–33 (1988) 15. Lissmann, H.W.: Rectilinear Locomotion in a Snake (Boa Occidentalis). J. Exp. Biol. 26(4), 368–379 (1950)
Deformation-Based Animation of Snake Locomotion
657
16. Ma, S.: Analysis of creeping locomotion of a snake like robot. Advanced Robotics 15(2), 205–224 (2001) 17. Miller, G.S.P.: The motion dynamics of snakes and worms. ACM SIGGRAPH Computer Graphics 22(4), 169–173 (1988) 18. Nougaret, J.L., Arnaldi, B.: Pulse-modulated locomotion for computer animation. Computer Animation and Simulation 1995. In: Proc. Eurographics Workshop in Maastricht, The Netherlands, pp. 154–164. Springer, Wien (1995) 19. Ramakrishnananda, B., Wong, K.C.: Animating bird flight using aerodynamics. The Visual Computer 15(10), 494–508 (1999) 20. Reynolds, C.W.: Flocks, herds and schools: A distributed behavioral model. ACM SIGGRAPH Computer Graphics 21(4), 25–34 (1987) 21. Schneider, P.J., Eberly, D.H.: Geometric Tools for Computer Graphics. Morgan Kaufmann, San Francisco (2003) 22. Sederberg, T.W., Parry, S.R.: Free-form deformation of solid geometric models. ACM SIGGRAPH Computer Graphics 20(4), 151–160 (1986) 23. Taylor, G.: Analysis of the Swimming of Long and Narrow Animals. Proc. Royal Society of London. Series A, Mathematical and Physical Sciences 214(1117), 158–183 (1952) 24. Terzopoulos, D., Qin, H.: Dynamic NURBS with geometric constraints for interactive sculpting. ACM Trans. Graph 13(2), 103–136 (1994) 25. Tu, X., Terzopoulos, D.: Artificial fishes: physics, locomotion, perception, behavior. In: Proc. 21st annual conference on Computer graphics and interactive techniques, pp. 43–50 (1994) 26. Whitaker, R., Matthews, R.: King Cobra. National Geographic (1997) 27. Wu, J., Popović, Z.: Realistic modeling of bird flight animations. ACM Trans. Graph 22(3), 888–895 (2003)
GPU-Supported Image Compression for Remote Visualization – Realization and Benchmarking Stefan Lietsch and Paul Hermann Lensing University of Paderborn Paderborn Center for Parallel Computing {slietsch,plensing}@upb.de
Abstract. In this paper we introduce a novel GPU-supported JPEG image compression technique with a focus on its application for remote visualization purposes. Fast and high quality compression techniques are very important for the remote visualization of interactive simulations and Virtual reality applications (IS/VR) on hybrid clusters. Thus the main goals of the design and implementation of this compression technique were low compression times and nearly no visible quality loss, while achieving compression rates that allow for 30+ Frames per second over 10 MBit/s networks. To analyze the potential of the technique and further development needs and to compare it to existing methods, several benchmarks are conducted and described in this paper. Additionally a quality assessment is performed to allow statements about the achievable quality of the lossy image compression. The results show that using the GPU not only for rendering but also for image compression is a promising approach for interactive remote rendering.
1
Introduction, Related and Previous Work
Hybrid cluster systems offer great potential for new applications apart from traditional HPC applications. Especially interactive simulations and Virtual Reality applications can greatly benefit from the distributed and powerful computing resources of modern cluster systems as shown for example in [1]. However the problem remains, that such universal and centralized computing resources are not very accessible. They often reside in dedicated computing facilities and are only connected to a few sophisticated visualization devices such as a tiled wall or a CAVE. A remote user has no practical access to the graphical output of such systems. This is where systems for remote visualization (RV) come into play. There exist several classes of RV software as described in [2], however only the third class (rendering of 3D applications on the server) is really suited for the remote access to IS/VR applications on hybrid clusters. Only this class addresses the main requirements of this applications, which are low latencies and high quality over standard networks (10-100 Mbit/s). One of the best system in this class regarding compression speed and quality is the VirtualGL framework [3]. Besides two modes for uncompressed transmission of the rendered data (as raw G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 658–668, 2008. c Springer-Verlag Berlin Heidelberg 2008
GPU-Supported Image Compression for Remote Visualization
659
RGB stream or as X11 stream), the standard compression mode is based on the JPEG still image compression. For JPEG compression VirtualGL offers two implementations. The first is the pseudo-standard libjpeg [4] implementation which is completely open source and allows decent performance. The second is called TurboJPEG [5] and is based on the Intel(R) Integrated Performance Primitives [6], a set of libraries which contain highly-optimized multimedia functions for x86 processors. According to the developers of VirtualGL, it outperforms the libjpeg implementation by a factor of 2-4. However it is still limited in speed when it comes to very high resolutions (above 1600x1200) or tiled or distributed rendering and heavily relies on and exhausts the CPU’s processing power. Thus, in [2] we introduced the Invire framework to implement, test, evaluate and improve techniques that help to overcome the problems and limitations described before and will allow the successful and flexible usage of remote visualization for IS/VR applications on hybrid cluster systems. It is the first framework of its kind, that uses the graphics cards not only for the rendering of the frames but also for compression. This has 2 advantages: 1. Instead of large raw-data RGB frames, only compressed image data needs to be read-back from the graphicscard, which eliminates one factor of latency. 2. The powerful GPU can be used to perform compression on a thread-parallel basis. In the following we describe the realization of a JPEG-based compression method, which follows this principle, show benchmarks that provide an insight in its performance and compare it to existing technologies. 1.1
CUDA
CUDA [7] is a combination of software and hardware architecture (available for NVIDIA G80 GPUs and above) which enables data-parallel general purpose computing on graphics hardware. It offers a C-like programming API with some language extensions. The architecture offers support for massively multi threaded applications and provides mechanisms for inter-thread communication and memory access. The API distinguishes between host and device domains and offers access to fast caches on the device side. The implemented method of thread partitioning allows for the execution of multiple CUDA applications (kernels) on one GPU. Each kernel is executed by a grid of thread blocks. A block consists of a batch of threads that can cooperate together by efficiently sharing data through some fast shared memory and synchronizing their execution to coordinate memory accesses. The threads in the block are addressed through one-, two- or three-dimensional IDs. This allows to uniquely identify and assign tasks to each thread. Another feature of the CUDA architecture is the interoperability with graphic APIs (OpenGL and Direct3D) which allows to use, for example, rendered images as input to CUDA kernels. Since this data already resides on the graphics device it only needs to be copied on the device to be processed by CUDA.
660
2
S. Lietsch and P.H. Lensing
Parallel, CUDA-Based JPEG Compression
The most common format for still image compression is the JPEG standard [8]. It uses several attributes of human vision to eliminate unnecessary or minor information from the images and combines them with traditional compression algorithms. Since it delivers high compression rates with only moderately complex algorithms it fits very well in the remote rendering scenario. Additionally, many of its compute intensive parts are well suited for parallelization, since they can be calculated independently for one pixel or small groups of pixels. However, the problem is that currently existing implementations are still too slow for the usage in highly interactive remote visualization scenarios. Thus, this work proposes a novel approach for doing fast JPEG compression on the GPU using the CUDA API. The CUDA-based JPEG compression is logically separated into two CUDA kernels. The first kernel computes the color conversion and the downsampling and the second kernel is responsible for the Discrete Cosine Transformation (DCT) and the Quantization of the DCT coefficients. This partitioning is the best compromise between maximizing the amount of threads per block and minimizing expensive read and write operations from and to global memory. The maximum amount of threads per block is determined by the amount of independent operations on a certain amount of data. It is obvious that color conversion can be done independently for every pixel, thus it is best to use the maximum amount of threads per block. The CUDA Programming Guide [7] recommends 64-256 threads per block as the best value for current hardware1. To keep to the JPEG standard, a blocksize of 8x8 pixel was chosen which results in 64 threads for the color conversion step. The downsampling step uses 4 pixels to compute their mean Cb and Cr values, thus it would be best to also use 64 threads per block and assign each thread to 4 pixels. However, it is important to reduce global memory access in CUDA kernels since they consume 200-300 clock cycles in contrast to 4 cycles for a memory access to shared memory. After reading and storing 64 pixels in shared memory for color conversion, it is better to reuse the results for downsampling with just 32 threads, instead of writing the results to global memory and starting a new kernel which reads them in again. Downsampling is performed for both the Cb and the Cr components, therefore 32 threads can compute that step in parallel. The following enumeration briefly describes the computation steps of the first kernel: 1. At first the 64 RGB pixel values are loaded into shared memory in parallel by 64 threads. The RGB values are stored in a one-dimensional array. To be able to compute the downsampling, two-dimensional 4x4 pixel blocks are needed and DCT and quantization require 8x8 pixel blocks. Thus, the pixels are stored in 8x8 blocks in shared memory to allow the subsequent computations on the same data structure. 1
This may change for new revisions of GPUs since they may have more parallel execution units.
GPU-Supported Image Compression for Remote Visualization
661
2. The color conversion is implemented through three equations which compute the YCbCr values from any given RGB pixel. The equations are: Y = 0.29900 * R + 0.58700 * G + 0.11400 * B - 128; Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B; Cr = 0.50000 * R + 0.41869 * G - 0.08131 * B. The constants used in these equations are those used in the libJPEG implementation. All three equations are computed sequentially by one thread for each pixel. 3. The downsampling of the Cb and Cr components is done by 32 threads in parallel. 16 threads compute the mean of a 2x2 pixel block for the Cb and the other 16 threads on the same pixel block for the Cr values. The mean is computed by adding all 4 values and dividing the result by 4. The division is replaced by a shifting operation to optimize performance. 4. The last step is to store the downsampled YCbCr values in global memory. Since DCT and quantization is done independently on each component, they are already stored separately. The Y values are stored in consecutive 8x8 pixel blocks which are mapped to a one-dimensional array. The same holds for the Cb and the Cr components. However, the resulting 4x4 blocks of the downsampling need to be grouped to 8x8 blocks to prepare the data for DCT. After kernel 1 has finished for all blocks the second kernel is started. Each component (Y, Cb and Cr) is processed separately by one call to kernel 2. It has 8 threads per block and does the following: 1. Loading the data is straight forward since it is already stored in the required 8x8 pixel blocks. Each of the 8 threads loads one row of a block into shared memory. 2. To avoid the complex and sequential computation of a 2D DCT it can be split into the computation of several 1D DCTs. The computation of the DCT for each row is independent and can be done by 8 threads in parallel. Thereafter, those 8 threads compute the DCTs of the columns on the results of the proceeding step. The actual computation of the 1D coefficients is described by the following equations: Si = Ci =
7 Ci (2x + 1)iπ cos 2 x=0 16
√1 2
for i = 0
1
for i > 0
and 0 ≤ i ≤ 7.
The computation would cost 896 additions and 1024 multiplications per 8x8 pixel block, but can be substituted by an algorithm which was developed by Arai et al. and is described in [9]. The coefficients computed by this algorithm need to be scaled by a constant factor to receive the final coefficients. Since this factor is constant it can easily be integrated into the constant quantization tables in the next step. By using this optimized 1D DCT algorithm the 2D DCT of an 8x8 pixel block can be computed by 464 additions and 80 multiplications.
662
S. Lietsch and P.H. Lensing
3. Quantization is a simple division by a constant for every coefficient in the 8x8 block. Quantization tables are similar for all blocks of a component. Before they are applied in kernel 2, the tables are multiplied by the scaling values of the DCT step as described before. Additionally the selected compression quality influences the values of the quantization tables. The scaled and adapted quantization tables are applied to the coefficients by 8 threads in parallel. This could also be done by 64 threads, but in order to avoid copying data back and from global memory the 8 threads of the DCT steps are reused. Again each thread computes one line of 8 DCT coefficients and multiplies them by the inverse of the corresponding value in the quantization table. 4. Finally the quantized DCT coefficients are written back to global memory in 8x8 blocks for each component. The last step of the JPEG compression is the Huffman encoding which makes use of the many zeros in lower frequencies resulting from DCT and quantization. This step can not be computed efficiently in parallel2 . Thus, the optimized sequential version of the libjpeg implementation is used to compute this step. As a side effect, it is possible to use the libjpeg data structure to automatically generate a JPEG conform header. After initializing the library and allocating a jpeg compress struct, the function jpeg write coefficients is used to pass the precomputed coefficients to the Huffman encoding facility of the libjpeg. The result after calling jpeg finish compress is a standard conform JPEG image in a specific memory location. A CUDA-based decompression is also available and mainly consists of the inverse steps of the encoding (i.e. Huffman decoding, inverse DCT and color conversion from YCbCr to RGB). However, the main focus of this implementation was to show the feasibility of a fast parallel JPEG encoding to achieve high frame rates for RV. Decoding is quite fast in the sequential case already since Huffman Decoding is faster and the quantization step is omitted, thus the parallel decompression does not achieve significant performance increase.
3
Benchmarking and Quality Assessment
This section deals with comparative benchmarks of the three major JPEG compression implementations (libjpeg, the proposed CUDA-based JPEG and turboJPEG). In addition to the pure performance benchmarking, a quality assessment for the JPEG-based compression methods is conducted, using the SSIM (Structural SIMilarity) index, which was introduced by Wang et al. in [12]. All benchmarks were carried out on two different sample applications and with varying parameters (e.g. frame resolution, JPEG quality) that influence the performance and the quality of the remotely rendered frames. The hardware that was used 2
There are approaches to parallelize Huffman encoding, but they work only under certain assumptions, which are contradictive to the JPEG standard (see for example [10] and [11] pp. 263f).
GPU-Supported Image Compression for Remote Visualization
663
to perform the benchmarks, as well as the sample applications, which represent two classes of applications, are described in the following section. 3.1
The Benchmarked System and the Sample Applications
In order to prove the theoretical concept and to compare the described compression algorithms, we prototypically implemented the Invire system and tested it in a sandbox environment. The server and the client part run on a computer equipped with a CUDA-ready Geforce 8800 GTS (G92) graphics card by NVIDIA. It has 12 multiprocessors and a wrap size3 of 32. That leads to k = 12 ∗ 32 = 384 threads that can run concurrently. Both computers are equipped with an Intel Core 2 Duo E4400 CPU running at 2.00 GHz and 2GB of RAM. The network connection is a 100Mbit Ethernet connected through a switch. The peak nominal bandwidth that could be achieved over this network was roughly 11,8 MB/s. Two sample applications where chosen to represent groups of applications. The following criteria played a role in the selection: The mean heterogeneity of two consecutive frames: This parameter determines if an application is highly dynamic or rather static. In the field of IS/VR there exist both kinds of applications, for example rather static CAD visualizations or highly dynamic virtual worlds with rich visual effects. Thus, it was important to choose one application of each class. The heterogeneity could be measured with the help of the SSIM index. Multicolored or greyscale frames: Again, there are applications in IS/VR that either have the one or the other attribute, e.g. greyscale medical imaging visualizations vs. high dynamic range VR environments. Large uniformly colored areas vs. very heterogenous multicolored areas: As for example in CAD applications vs. highly dynamic virtual worlds with rich visual effects. Following these criteria, two sample applications were selected: A rotating teapot (see figure 1a)) as a representative of the area of rather static object visualization applications. It simulates a steady rotation of a greyscaled 3d model, with one fixed light source. The background is uniformly black and not lit. This sample application has quite a low heterogeneity between two consecutive frames (mean SSIM index between two consecutive frames 89,35%), greyscale frames and at least one large uniformly colored area (the black background). These attributes describe a rather typical CAD or 3D object visualization application. The Virtual Night Drive Simulator (VND) (see figure 1b)) as a representative of the area of highly dynamic virtual worlds. The VND is specialized on simulating automotive headlights at night and uses the shaders of the graphics card to calculate the luminance intensity per pixel. It also features a daylight view on the scene which is very detailed and feature-rich and 3
Number of threads that are executed in parallel on one multiprocessor.
664
S. Lietsch and P.H. Lensing
Fig. 1. Test cases: (a) simple teapot, (b) Virtual Night Drive without headlight simulation
offers dynamic lighting features. This sample application has quite a high heterogeneity between two consecutive frames (mean SSIM index between two consecutive frames 53,89%), multicolored frames and very heterogeneous multicolored areas because of the extensive use of textures. These attributes describe a rather typical VR application. By selecting those two representatives of different application areas, the benchmarking results can give information about how well each of the compression algorithms and the overall system might perform depending on the selected application. Possibly there are algorithms that are more suited for one or the other area. There certainly are more groups of applications in IS/VR, but the ones selected represent the biggest groups with the most obvious differences. The specifications of most of the other application groups lie in between these two extremes. 3.2
Benchmarking Three Different JPEG Implementations
To get a deeper insight in the raw performance of the different JPEG implementations of Invire, the following benchmarks compare the pure compression times of the libjpeg [4], CUDA-based JPEG (see section 2) and turboJPEG [5] implementations. All three implementations were tested in two different resolutions with the two sample applications (VND and rotating teapot) described above. For each measurement 1000 samples were taken and the mean of these samples as well as the minimum and maximum of 98% of all measurements are displayed in the diagrams. For the CUDA-based method the overall mean time is composed by the mean CPU and GPU compression times. All times are measured in milliseconds. Figure 2 depicts the comparison of the three JPEG implementations for both applications in 640 x 480 pixels. The libjpeg implementation takes the most time for compression (about 14 ms for teapot and 18 ms for VND), the CUDA-based version is about 60% faster (9 ms teapot and 10ms VND) and turboJPEG outperforms CUDA-based JPEG by the factor 2 (4 ms teapot and 5 ms VND). For CUDA-based compression an interesting shift between GPU and CPU times for the applications can be regarded: The greyscale teapot application produces DCT coefficients that are much easier to compress since most
GPU-Supported Image Compression for Remote Visualization a) Teapot 640 x 480
b) VND 640 x 480
14 12 10 14,433
6
3,909
4 4,795
2
3,570
Compression time in ms
Compression time in ms
16
8
0 JPEG
665
Cuda JPEG
20 18 16 14 12 10 8 6 4 2 0
Mean CPU GPU 17,843
JPEG
TurboJPEG
5,671
4,863
4,520
Cuda JPEG
TurboJPEG
Compression
Compression
Fig. 2. Comparison of the JPEG implementations for Teapot (a) and Virtual Night Drive (b) applications in 640 x 480 Pixels
a) Teapot 1680 x 1050
b) VND 1680 x 1050 140
70 60 50 40
72,949 17,742
30 20
21,477
10
17,408
Compression time in ms
Compression time in ms
80
Mean
120
CPU
100
GPU
80 60
106,788
40
25,913
20 0
0 JPEG
Cuda JPEG Compression
TurboJPEG
JPEG
20,953
23,688
Cuda JPEG
TurboJPEG
Compression
Fig. 3. Comparison of the JPEG implementations for Teapot (a) and Virtual Night Drive (b) applications in 1680 x 1050 Pixels
of the color values are 0, thus the Huffman coding, which is performed on the CPU, executes faster for this application than for the more elaborate compression of the coefficients of the VND application. Therefore, the time needed for GPU calculation is nearly constant for both applications but the time spend on the CPU varies heavily. The time for GPU computation is still higher as the total time for turboJPEG compression. Theoretically achievable framerates are about 70/56 Fps (Teapot/VND) for libjpeg, 115/95 Fps for Cuda-based JPEG and 280/221 Fps for turboJPEG. Figure 3 depicts the comparison of the three JPEG implementations for both applications in 1680 x 1050 pixels. For this high resolution, a similar performance ratio as for the 640x480 resolution (1:2:4) can be regarded. However, libjpeg has very high variations in the VND application. This might be caused by memory or cache limitations for the high amount of colored pixels (1.76 MPixels) to process. Theoretically achievable framerates are about 14/9 Fps (Teapot/VND) for libjpeg, 25/21 Fps for CUDA-based JPEG and 57/42 Fps for turboJPEG. Thus, the libjpeg implementation already is the limiting factor for reaching more than 20 fps with this resolution.
666
3.3
S. Lietsch and P.H. Lensing
Quality Assessment of the Lossy Compression Methods
The following section describes quality assessments of the three lossy (JPEG) compression methods, each conducted with the two sample applications. All JPEG implementations were benchmarked with different quality settings q 4 (q = 25, q = 75 and q = 100), and the SSIM index as well as the compression ratio in bytes per pixel are recorded.
b) Compression ratio of Teapot in 1024x768
SSIM Index in %
100 99 98
98,04
99,91
99,89
99,8
98,11
98,08
97 96
95,48
95,48
95,49
95 94
Compression ratio in bytes per pixel
a) SSIM of Teapot in 1024x768 101
0,18
0,168
0,164
0,163
0,16 0,14 0,12 0,1 0,08 0,06 0,04
0,056 0,032
0,055 0,033
0,058 0,033
0,02 0
93 JPEG
CUDA_JPEG Compression
TURBO_JPEG
JPEG
CUDA_JPEG
TURBO_JPEG
Compression
Fig. 4. SSIM index (a) and compression ratio (b) of the three JPEG implementations for the Teapot application. The the bars represent the selected JPEG quality settings for each implementation: Left q = 25, Middle q = 75 and Right q = 100.
Figure 4 shows those results in two diagrams for the teapot application. First of all, it is visible that all three implementations produce nearly similar SSIM indices and compression ratios for equal JPEG quality settings. This is evident since all three versions implement the same basic algorithms (variations only in parallelization and hardware acceleration). Small deviations may result from error corrections or rounding differences between the utilized APIs / compilers (GNU/IPP/CUDA) and hardware (CPU/GPU). For the teapot application it is observable that even for very low JPEG quality settings (q = 25) the SSIM index is around 95% while achieving a great compression ratio (0.03 bytes per pixels in contrast to 4 bytes per pixel of the raw image). Higher quality settings result in even higher SSIM indices; the best SSIM index / compression ratio is achieved for q = 75, as shown in the charts. Figure 5 shows the quality and compression ratio results for the VND application. They are analogue to the one for the teapot application. But, it is observable that the lowest JPEG quality setting (q = 25) produces frames with a SSIM index below 90%. That means that there are visible distortions in the frames that might disturb remote users of the application. For higher q’s the SSIM index improves and reaches about 92 % for q = 75. Admittedly, also the compression ratio rises by a factor 4 between q = 25 and q = 75. Setting q = 100 further improves the SSIM index a bit, but this improvement is hardly noticeable for the human eye. 4
Those settings directly influence compression quality and ratio by altering the quantization table and thereby determining the amount of DCT coefficients that are eliminated. q = 0: highest compression with strongly visible artifacts, q = 100: lowest compression, best quality, worst compression ratio.
GPU-Supported Image Compression for Remote Visualization
95,34
95,28 92,43
84,88
b) Compression ratio of VND in 1024x768
92,33
84,73
95,19 91,27
84,32
Compression ratio in bytes per pixel
SSIM Index in %
a) SSIM of VND in 1024x768 98 96 94 92 90 88 86 84 82 80 78
667
0,8 0,6
0,684
0,679
0,7
0,531
0,518
0,677 0,523
0,5 0,4 0,3 0,2
0,117
0,115
0,114
0,1 0
JPEG
CUDA_JPEG Compression
TURBO_JPEG
JPEG
CUDA_JPEG
TURBO_JPEG
Compression
Fig. 5. SSIM index (a) and compression ratio (b) of the three JPEG implementations for the VND application
4
Conclusion
This paper introduced a novel approach on JPEG image compression for remote visualization systems. These systems need to provide fast compression and possibly no visual quality loss to provide a remote interactive interface for IS/VR applications. Therefore we proposed to utilize the powerful GPUs to compute the compression. By parallelizing a well known image compression algorithm (JPEG) and adapting it to the special GPU hardware, substantial performance increases and significant load relieving of the CPU could be achieved. Additionally, through outsourcing the image compression to the GPU and thereby avoiding the need to readback large uncompressed frames through the limited host interface, a further source of latency could be eliminated. All these improvements help to achieve the goal of providing a seamless graphical remote interface without restrictions in latency or quality. The second part of the paper describes the benchmarking and the quality assessment of three different JPEG implementations (libjpeg, the proposed CUDAbased JPEG and turboJPEG). The performance comparison of the three JPEG variants shows that the highly optimized turboJPEG implementation is currently the fastest way to compress JPEG-conform images. It uses the Intel Performance Primitives library and thus can fully harvest the power of the Intel dual-core processor built in the testing machine. The CUDA-based JPEG implementation introduced in this paper, however, partially relies on the unoptimized code of the libjpeg implementation. Especially the sequential Huffman encoding which is used by this implementation seems to be limiting the performance for high resolutions and multi-colored frames. The time spend on the GPU for doing the compute intensive steps (color conversion, downsampling, DCT and quantization) is less than the overall time used by the fast turboJPEG implementation for resolutions of 1024x768 and higher. This is a good perspective to further optimize the CUDA-based implementation and design a CUDA-supported Huffman encoding to outperform the highly optimized turboJPEG implementation. In addition, the GPU-based methods strongly relief the CPU and leave it available for other tasks. The libjpeg implementation is by far the slowest but most feature-rich and universal implementation. Anyway, it is not really suited for the
668
S. Lietsch and P.H. Lensing
application in remote visualization of IS/VR since it limits the overall system performance for high resolutions by its slow compression times. The main finding of the quality measurement is that all three JPEG implementations produce very high quality compressed images, even with low JPEG quality settings. However, to ensure that the resulting frames permanently have SSIM indices above 90% q needs to be set to 75 or higher. It is planned to extend the Invire system by an automatic quality assessment component, which dynamically measures the quality of the compressed frames and adapts q to the results of this measurement. This either helps to save bandwidth or to provide the users with optimal quality.
References 1. Lietsch, S., Zabel, H., Berssenbruegge, J.: Computational Steering of Interactive and Distributed Virtual Reality Applications. In: ASME CIE 2007: Proceedings of the 27th ASME Computers and Information in Engineering Conference, ASME (2007) 2. Lietsch, S., Marquardt, O.: A CUDA-Supported Approach to Remote Rendering. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Paragios, N., Tanveer, S.-M., Ju, T., Liu, Z., Coquillart, S., Cruz-Neira, C., M¨ uller, T., Malzbender, T. (eds.) ISVC 2007, Part I. LNCS, vol. 4841, pp. 724–733. Springer, Heidelberg (2007) 3. VirtualGL: The VirtualGL Project (2007), http://www.virtualgl.org/ 4. Independent JPEG Group: libjpeg (Open Source JPEG library) (2008), http:// www.ijg.org 5. VirtualGL: TurboJPEG 1.10 - Intel IPP accelerated JPEG compression (2008), http://sourceforge.net/project/showfiles.php?group id=117509\&package% id=166100 6. Intel: Intel Integrated Performance Primitives 5.3 (2008), http://www.intel.com/ cd/software/products/asmo-na/eng/302910.htm 7. NVIDIA: NVIDIA CUDA - Compute Unified Device Architecture (2008), http:// www.nvidia.com/object/cuda home.html 8. Pennebaker, W.B., Mitchell, J.L.: JPEG Still Image Data Compression Standard. Kluwer Academic Publishers, Norwell (1992) 9. Arai, Y., Agui, T., Nakajima, M.: A Fast DCT-SQ Scheme for Images. Transactions of IEICE E71, 1095–1097 (1988) 10. Howard, P.G., Vitter, J.S.: Parallel lossless image compression using Huffman and arithmetic coding. Information Processing Letters 59, 65–73 (1996) 11. Crochemore, M., Wojciech, W.R.: Jewels of stringology. World Scientific Publishing Co. Inc., River Edge (2003) 12. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing 13, 600–612 (2004)
Linear Time Constant-Working Space Algorithm for Computing the Genus of a Digital Object Valentin E. Brimkov1 and Reneta Barneva2 1
Mathematics Department, SUNY Buffalo State College, Buffalo, NY 14222, USA
[email protected] 2 Department of Computer Science, SUNY Fredonia, NY 14063, USA
[email protected]
Abstract. In recent years the design of space-efficient algorithms that work within a limited amount of memory is becoming a hot topic of research. This is particularly crucial for intelligent peripherals used in image analysis and processing, such as digital cameras, scanners, or printers, that are equipped with considerably lower memory than the usual computers. In the present paper we propose a constant-working space algorithm for determining the genus of a binary digital object. More precisely, given an m × n binary array representing the image, we show how one can count the number of holes of the array with an optimal number of O(mn) integer arithmetic operations and optimal O(1) working space. Our consideration covers the two basic possibilities for object and hole types determined by the adjacency relation adopted for the object and for the background. The algorithm is particularly based on certain combinatorial relation between some characteristics of a digital picture. Keywords: Digital geometry, digital picture, hole, genus, connected component.
1
Introduction
In recent years developing specialized algorithms and software for intelligent peripherals is becoming increasingly important. This is crucial for peripherals used in image analysis and processing—such as digital cameras, scanners, or printers— where the size of the problem input may be huge. At the same time, such kinds of devices are equipped with considerably lower working space than the usual computers. Therefore, a hot topic of research is the design of spaceefficient algorithms, that is, ones working within a limited amount of memory. Of special interest are algorithms whose working space size is limited by a constant (preferably, not too large). Some general terminology and theoretical foundations have already been introduced in relation to the work on several specific problems (see, for example, [1,2,3,4,5,6,7,11,12,15,16,17,20], among others). A number of diverse computation models have been considered. For a short discussion on these and related matters the reader is referred to [1,2]. For example, in the so-called in-place algorithms, the input data are given by a constant number of arrays that can be used (under some restrictions) as working space for G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 669–677, 2008. c Springer-Verlag Berlin Heidelberg 2008
670
V.E. Brimkov and R. Barneva
the algorithm. Another (more restrictive) model assumes that the input data are given as a read-only array whose values cannot be changed during the algorithm’s execution, and the algorithm can use working space of constant size, i.e., that does not depend on the input size1 . In this paper we conform to this last model of actual constant-working space algorithms. We propose a constant-working space algorithm for determining the genus of a binary digital object. More precisely, given an m × n binary array representing the image, we show how one can count the number of holes of the array with an optimal number of O(mn) integer arithmetic operations and optimal O(1) working space. Our considerations cover the two basic possibilities for object and hole types determined by the adjacency relation adopted for the object and for the background, i.e.: (a) 0-connected object with 1-connected holes2 ; (b) 1-connected object with 0-connected holes. The genus of an object is a basic topological invariant providing important information about the object topology (e.g., the degree of its connectedness). Therefore, important applications are expected in image analysis, computer vision, computer graphics, as well as the design of build-in software. Investigations on related problems (for example, about Euler’s number computation) have been already carried out (see, e.g., [13]). To our knowledge of the available literature, however, the present article provides for the first time rigorous mathematical analysis of the space and time complexity of the computation. The paper is organized as follows. In the next section we introduce some basic notions and results to be used in the sequel, and with their help we formally state the considered problem about computing the genus of a digital object. In Section 3 we prove a subsidiary fact that is instrumental in our computation. In Section 4 we present our main results. We conclude with some remarks in Section 5.
2
Preliminaries
In this sections we introduce some basic notions of digital geometry to be used in the sequel. We conform to terminology used in [18] (see also [19,21]). 2.1
Some Basic Notions of Digital Topology
All considerations take place in the grid cell model that consists of the grid cells of Z2 , together with the related topology. In the grid cell model we represent 2-cells as squares, called pixels. Their edges and vertices are 1-cells and 0-cells, 1
2
Strictly speaking, in complexity theory algorithms with constant-working space correspond to the class DSPACE(1) which is very limited. Therefore, in-place algorithms also include algorithms from the complexity class L that consists of the problems requiring O(log n) additional space, where n is the size of the problem input (see [14]). Note that on a real computer storing an integer k requires only a small fixed amount of space, while theoretically, O(log k) bits are required to store the integer k. 0/1-connectedness in the grid-cell model correspond to 8/4-connectedness in the grid-point model, see [18].
Linear Time Constant-Working Space Algorithm
671
respectively. For every i = 0, 1, 2, the set of all cells of dimension i (or i-cells) (i) (i) is denoted by C2 . Further, we define the space C2 = 2k=0 C2 . We say that two 2-cells e, e are k-adjacent for k = 0 or k = 1 if they share a k-cell. Two 2-cells are strictly k-adjacent if they are k-adjacent but not (k + 1)-adjacent. A k-adjacency relation is usually denoted by Ak . A digital object S ⊂ C2 is a finite set of 2-cells. A k-path (where k = 0 or k = 1) in S is a sequence of pixels from S such that every two consecutive pixels on the path are k-adjacent. Two pixels of a digital object S are k-connected (in S) iff there is a k-path in S between them. A subset G of S is k-connected iff there is a k-path connecting any two pixels of G. The maximal (by inclusion) k-connected subsets of a digital object S are called k-(connected) components of S. Components are nonempty, and distinct k-components are disjoint. Clearly, an object may be 0-connected but not 1-connected. 2.2
Holes and Genus. Problem Statement
Let S be a finite digital object in C2 and S¯ its complement3 to the whole space C2 . It is clear (and well-known) that S¯ has exactly one infinite connected component with respect to an adjacency relation Ak (k = 0 or 1) and, possibly, a number of finite components. The latter are called k-holes of S, or also connectivity of S. The number of holes of an object A ⊆ C2 (or A ⊆ R2 ) is equivalent to the genus of A, which is the minimal number of “cuts” that makes the set simply connected (i.e., homeomorphic to the unit disc). Holes and genus are also defined for any set A ⊆ Rn as well as for digital objects in any dimension. For detailed accounting of the matter the reader is referred to [18]. Note that different adjacencies may be used in defining the connectedness of an object and its background (e.g., A0 for S and A1 for S¯ or vice-versa). In fact, this is often the preferred approach, since 0- and 1-adjacencies form a good pair of adjacency relations. Basically, good pairs characterize separation of the object holes from the infinite background. For details and discussion about the usefulness of this notion we refer to [8,9,18,19]. Here we mention it only to justify the consideration of the following two meaningful cases: (A) Given a 0-connected digital object S, one looks for the number of all 1-holes ¯ of S (i.e., the number of the finite 1-connected components of S). (B) Given a 1-connected digital object S, one looks for the number of all 0-holes ¯ of S (i.e., the number of the finite 0-connected components of S). See for illustration Figure 1. In Section 4 we show how one can solve the above problems in linear time and constant-working space. Before that, in the next section we prove some subsidiary technical results. 3
Sometimes S¯ is called the background for S.
672
V.E. Brimkov and R. Barneva
Fig. 1. Left: 0-connected object with seven 1-connected holes. Right: 1-connected object with three 0-connected holes.
3
Subsidiary Technical Results
Let S be a digital object consisting of p pixels (0-cells). Denote by PS the rectilinear polygon obtained as a union of all 2-cells of S. Let ∂(PS ) be the boundary of PS . Following [10], a vertex (0-cell) or an edge (1-cell) of S is called free iff it belongs to ∂(PS ). Otherwise it is called non-free. (See Figure 2, left.) Let v and e be the number of vertices and edges of S, and let v = v ∗ + v , e = e∗ + e , where v ∗ , v , e∗ , and e denote the number of the free vertices, non-free vertices, free edges, and non-free edges of S. Denote by B and b the number of 2 × 2 and 2 × 1 blocks in S, respectively (see Figure 2, middle). It is easy to see that we have the following equalities: v = B, e = b
(1) (0)
We say that S has a gap located at a vertex (0-cell) x ∈ S ∩ C2 if there are exactly two strictly 0-adjacent pixels p1 , p2 ∈ S with p1 ∩ p2 = x (see Figure 2, right).
Fig. 2. Left: A digital object whose free vertices and edges are in bold. Middle: 2 × 1 and 2 × 2 blocks. Right: A gap pointed by an arrow.
Linear Time Constant-Working Space Algorithm
673
Denote by g the number of gaps in S. The following fact was proved in [10]. Proposition 1. Let S be a digital object. Then g = e∗ − v ∗ ,
(2)
where e∗ and v ∗ denote the number of free edges and free vertices of S, respectively. In [10] the above theorem is proved by induction. For this, all possible 3 × 3 different configurations of pixels have been examined, which made the proof too long. Here we provide a much shorter proof based on graph-theoretic approach. Note that both proofs apply to objects that have arbitrarily many components. Proof of Proposition 1 Every boundary vertex of S is incident to either two or four boundary edges, the latter being the case if the vertex locates a gap; otherwise, the former case holds. Since each boundary vertex is incident to an even number of edges, there must be an Eulerian cycle consisting of free edges belonging to ∂(PS ), as each such edge is used in the cycle exactly once (Figure 3). Along that cycle every edge and every vertex that does not expose a gap are counted once, while free vertices that expose gaps are counted twice. Since edges and vertices alternate on the cycle, we obtain e∗ = v ∗ + g, which completes the proof.
Fig. 3. Illustration to the proof of Proposition 1. Arrows trace an Eulerian cycle.
Proposition 1 implies the following corollary. Corollary 1. Let S be a digital object. Then the following combinatorial relation holds: h − c + p − b + B − g = 0, (3)
674
V.E. Brimkov and R. Barneva
where h is the number of holes, c is the connectivity, p is the number of pixels, b is the number of 2 × 1 blocks, B is the 2 × 2 blocks, and g is the number of gaps of S. Proof: Consider the planar graph GS (V, E) defined as follows: (i) The elements of the set V of vertices of GS are labeled by the vertices (0-cells) of S, and (ii) The edges of GS are the edges (1-cells) of S. Applying to GS the well-known Euler’s formula, we obtain v − e + f = 1 + c, where f is the number of faces of GS . Now we observe that, in fact, f counts the pixels of S and its holes, plus one more unit for the infinite background of S, i.e., we have v − e + p + h + 1 = 1 + c. Since e = e∗ + e and v = v ∗ + v , we obtain v ∗ + v − e∗ − e + p + h = c. Then equality (3) follows by substitution from (1) and (2).
We will use the above corollary in the computation of the number of holes in a digital object.
4
Counting Holes
Combinatorial relation (3) suggests that in order to compute the number of holes of a connected digital object S, it suffices to know the number of pixels in S and to compute the number b of 2 × 1 blocks, the number B of 2 × 2 blocks, and the number g of gaps. This can be done as follows. Given an input binary array A[m, n], one scans it row by row and counts the parameters of interest. This can be achieved, e.g., by the following procedure. procedure Count(); ⎡ p := 0; b := 0; B := 0; g := 0; ⎢ for j := 1 to n + 1 do A[m + 1, j] := 0; ⎢ ⎢ for i := 1 to m do A[i, n + 1] := 0; ⎢ ⎢ for i := 1 to m do ⎢⎡ ⎢ for j := 1 to n do ⎢ ⎡ ⎢ ⎢ if A[i, j] = 1 then ⎢⎢ ⎡ ⎢ ⎢ ⎢ p := p + 1; c := 0; ⎢⎢⎢ ⎢ ⎢ ⎢ ⎢ if A[i + 1, j] = 1 then b := b + 1; c := c + 1; ⎢⎢⎢⎢ ⎢ ⎢ ⎢ ⎢ if A[i, j + 1] = 1 then b := b + 1; c := c + 1; ⎢⎢⎢⎢ ⎣ ⎣ ⎣ ⎣ if c = 2 and A[i + 1, j + 1] = 1 then B := B + 1; if c = 0 and A[i + 1, j + 1] = 1 then g := g + 1; end
Linear Time Constant-Working Space Algorithm
675
Consider first version (A) of the problem where S is 0-connected and we are looking for the number h of 1-holes of S. Having p, b, B, and g computed, we find the number of holes by a direct application of formula (3) with c = 1: h=1−p+b−B+g
(4)
Now consider version (B) of the problem where S is 1-connected and we are looking for the number h of 0-holes of S. With the help of Figure 4, one can easily realize that: – If q is a 0-connected hole of S and q features m gaps, then it consists of m + 1 one-connected components (see Figure 4, left); – If q is a 0-connected set of cells of S¯ with m gaps, that is 0-connected to the infinite background, then clearly q is not a 0-hole of S (see Figure 4, right). Hence, the number of 0-holes of S is found as h = (1 − p + b − B + g) − g = 1 − p + b − B.
(5)
The procedure described above is linear in the number of pixels of S. The working space consists of the fields in which one stores the current values of the few counters used in formulas (4) and (5). As a last remark we would like to mention a recent work by Asano and Buzer [4], that provides an O(n2 log n) time constant-working space algorithm for computing the 1-connected components of a digital object of size n×n. The algorithm
Fig. 4. Left: 0-connected set of background pixels (in grey). It does not constitute a 0-hole since it is 0-connected to the rest of the infinite background. It consists of four 1components and contains four gaps (one of them at the point where the set is connected with the rest of the infinite background). Right: 0-connected set of background pixels (in grey) that appears to be a 0-hole. It consists of four 1-components and contains three gaps.
676
V.E. Brimkov and R. Barneva
works under the same model adopted in the present paper. Note: that algorithm may require Ω(n2 log n) operations even in the case of a single component. Combining the above-mentioned result from [4] and those of the present paper, one obtains an O(n2 log n) time constant-working space algorithm for computing the number of holes of a not necessarily connected binary object.
5
Concluding Remarks
In the present paper we have proposed a linear time constant-working space algorithm for determining the genus of a connected binary digital object. The computation is based on a combinatorial relation for digital pictures that may also be of independent interest. The algorithm is also applicable to the case of digital objects with more than one connected component, provided that the number of components is known in advance. A challenging task is seen in constructing an equally time and space efficient algorithm for the case of objects with unknown connectivity.
Acknowledgements The authors are indebted to the three anonimous referees for a number of useful comments. The first author thanks Tetsuo Asano for a useful discussion on the usefulness of constant-working space algorithms.
References 1. Asano, T.: Constant-working space algorithm for image processing. In: Proc. of the First AAAC Annual Meeting, Hong Kong (April 2008) (to appear) 2. Asano, T.: Constant-working space image scan with a given angle. In: Proc. of the 24th European Workshop on Computational Geometry, Nancy, March 2008, pp. 165–168 (2008) 3. Asano, T., Biotu, S., Motoki, M., Usui, N.: In-place algorithm for image rotation. In: Tokuyama, T. (ed.) ISAAC 2007. LNCS, vol. 4835, pp. 704–715. Springer, Heidelberg (2007) 4. Asano, T., Buzer, L.: Constant-working space algorithm for connected components counting with extension, personal communication (to appear, 2008) 5. Asano, T., Tanaka, H.: Constant-working space algorithm for connected components labeling, IEICE Technical Report, Special Interest Group on Computation, Japan, IEICE-COMP2008-1, vol. 108(1), pp. 1–8 (2008) 6. Asano, T., Tanaka, H.: Constant-working space algorithm for Euclidean distance transform, IEICE Technical Report, Special Interest Group on Computation, Japan, IEICE-COMP2008-2, vol. 108(1), pp. 9–14 (2008) 7. Blunck, H., Vahrenhold, J.: In-place algorithms for computing (layers of) maxima. In: Arge, L., Freivalds, R. (eds.) SWAT 2006. LNCS, vol. 4059, pp. 363–374. Springer, Heidelberg (2006) 8. Brimkov, V.E., Klette, R.: Curves, hypersurfaces, and good pairs of adjacency ˇ c, J. (eds.) IWCIA 2004. LNCS, vol. 3322, pp. 276– relations. In: Klette, R., Zuni´ 290. Springer, Heidelberg (2004)
Linear Time Constant-Working Space Algorithm
677
9. Brimkov, V.E., Klette, R.: Border and surface tracing - theoretical foundations. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(4), 577–590 (2008) 10. Brimkov, V.E., Maimone, A., Nordo, G.: Counting gaps in binary pictures. In: Reulke, R., Eckardt, U., Flach, B., Knauer, U., Polthier, K. (eds.) IWCIA 2006. LNCS, vol. 4040, pp. 16–24. Springer, Heidelberg (2006) 11. Br¨ onniman, H., Chan, T.M.: Space-efficient algorithms for computing the convex hull of a simple polygonal line in a linear time. In: Farach-Colton, M. (ed.) LATIN 2004. LNCS, vol. 2976, pp. 162–171. Springer, Heidelberg (2004) 12. Br¨ onniman, H., Iacono, J., Katajainen, J., Morin, P., Morrison, J., Toussaint, G.: In-place planar convex hull algorithms. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 197–205. Springer, Heidelberg (2002) 13. Chen, M.-H., Yan, P.-F.: A fast algorithm to calculate the Euler number for binary image. Pattern Recognition Letters 8(5), 295–297 (1988) 14. Garey, M.S., Johnson, D.S.: Computers and Intractability: a Guide to the Theory of NP-Completeness. Freeman & Co., San Francisco (1979) 15. Geffert, V., Katajainen, J., Pasanen, T.: Asymptotically efficient in-place merging. Theoretical Computer Science 237(1-2), 159–181 (2000) 16. Katajainen, J., Pasanen, T.: In-place sorting with fewer moves. Information Processing Letters 70(1), 31–37 (1999) 17. Katajainen, J., Pasanen, T., Titan, G.: Sorting multisets stably in minimum space. Acta Informatica 31(4), 301–313 (1994) 18. Klette, R., Rosenfeld, A.: Digital Geometry - Geometric Methods for Digital Picture Analysis. Morgan Kaufmann, San Francisco (2004) 19. Kong, T.Y.: Digital topology. In: Davis, L.S. (ed.) Foundations of Image Understanding, pp. 33–71. Kluwer, Boston (2001) 20. Pasanen, T.: In-place algorithms for sorting problems. ACM SIGACT News 30(2), 61 (1999) 21. Rosenfeld, A.: Connectivity in digital pictures. Journal of the ACM 17(3), 146–160 (1970)
Offset Approach to Defining 3D Digital Lines Valentin E. Brimkov1 , Reneta P. Barneva2 , Boris Brimkov3, and Fran¸cois de Vieilleville1 1
Mathematics Department, SUNY Buffalo State College, Buffalo, NY 14222, USA 2 Department of Computer Science, SUNY Fredonia, NY 14063, USA 3 Gifted Math Program, University at Buffalo, Buffalo, NY 14260-1000, USA
Abstract. In this paper we investigate an approach of constructing a 3D digital line by taking the integer points within an offset of a certain radius of the line. Alternatively, we also investigate digital lines obtained through a “pseudo-offset” defined by a parallelepiped enclosing the integer points around the line. We show that if the √ offset radius √ (resp. side of the parallelepiped section) is greater than 3 (resp. 2 3), then the digital line is at least 1-connected. Extensive experiments show that the lines obtained feature satisfactory appearance. Keywords: Digital geometry, digital line, connectivity of digital object, line offset.
1
Introduction
Digital lines appear to be basic primitives in digital geometry. They have been used in volume modeling as well as in certain algorithms for image analysis, in particular in designing discrete multigrid convergent estimators [5,10,19]. While in two dimensions, fully satisfactory definitions of digital lines are available (see, e.g., [18] and the bibliography therein), this is not exactly the case in three dimensions. For example, about some early definitions (based, e.g., on the so-called supercover), Andres points out the possibility for existence of “bubbles,” that can be viewed as sort of line defects [1]. Structural definitions and algorithms to digitize a 3D line in R3 are proposed, e.g., in [6,14]. However, the digital line models proposed there do not admit an analytic description, as in conventional analytic geometry. Over the last fifteen years, a number of authors arrived at essentially equivalent definitions, based on the approach of projections [8,11]. Within that approach, one must first find the projections of the line onto two (appropriately chosen) coordinate planes, then define the corresponding 2D digital lines over the integer grids in those planes, and finally reconstruct the 3D digital line from these projections. In fact, the possibility to analytically define the two 2D digital lines immediately provides an analytical definition of the 3D digital line as well. The soobtained digital line possesses certain desirable properties, such as 0-connectivity, minimality1 , and closest approximation to the continuous line [19]. As already 1
Digital line minimality means that the removal of a voxel from the line makes it disconnected.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 678–687, 2008. c Springer-Verlag Berlin Heidelberg 2008
Offset Approach to Defining 3D Digital Lines
679
mentioned, however, this definition is not as perfect as its two-dimensional counterpart. For example, unlike in 2D, the 3D definition does not admit easy extension assuring 1- or 2-connectedness. Although 2-connected lines can be defined using the so-called standard model [1], this approach does not work for assuring strict 1-connectedness. Knowledge about different types of connectivity for a given digital object may be useful for certain applications (e.g., design of multigrid convergent estimators) or for theoretical research within certain framework (e.g., 2-connectivity when using Khalimsky spaces). Note also that the projection of a 3D digital line on the third coordinate plane is not necessarily a 2D digital line. Let us remark at this point that the basic approach to modeling curves and surfaces in computer aided geometric design is the one based on computing the offset of a curve/surface Γ – the locus of points at a given distance d from Γ (see, e.g., [3,4,13] and the bibliography therein). As a rule, such sort of considerations resort to results from algebraic geometry, such as Grobner basis computation (see, e.g., [7,2]). Note that usually, when looking for a curve offset defined by a distance d, one does not really care about the properties of the set of integer points enclosed by the offset. Exceptions (in 2D) are provided, e.g., by some recent works that define digital conics [9,12]. In view of the above discussion, it would indeed be quite natural to define a digital straight line by the set of integer points within the offset. In fact, this is the idea of the definition of a 2D digital line, which consists of the integer points that belong to a strip determined by two parallel lines at an appropriate distance from each other. The authors assume that this idea has already been exploited by other researchers. We suppose that the reason for the lack of publications about possible results of such an approach may be rooted in the fact that these researchers have faced difficulties to obtain satisfactory results. While obtaining the offset for a straight line is much easier than for, e.g., a quadratic curve, it might not be so trivial to analyze the properties of the digital set defined by the offset. Once such an analysis is performed, it may still be not so easy to decide how useful such a digital line may be for practical applications. Anyway, in our opinion, it would be good to know more about the pros and cons of a 3D digital line obtained by a line offset. To some extent this research can also be seen as a first attempt to generalize the blurred segments introduced by Debled-Rennesson [10] which provide an approach to analyzing noisy digital objects with arithmetic methods. In this paper we present some results of our investigation on this problem. We show that 3D lines obtained through an offset may be of interest regarding certain applications in visual computing. In particular, by choosing an appropriate radius of the offset, one can guarantee, e.g., 1-connectivity of the digital line. The paper is organized as follows. In the next section we recall some basic definitions of digital geometry. In Section 3, we investigate the 3D digital line obtained by a cylindrical offset of the continuous line. In Section 4, we consider a “pseudo-offset,” which is a special type of a parallelepiped. We support our
680
V.E. Brimkov et al.
discussion with illustrations and experimental results obtained by a computer program and an original visualization system, which are briefly described in Section 4. We conclude with some final remarks in Section 5.
2
Preliminaries
In this section, we introduce some basic notions of digital geometry to be used in the sequel. We conform to terminology used in [15] (see also [16,17]). Our considerations take place in the grid cell model which consists of the grid cells of Z3 , together with the related topology. In this model, one represents 3-cells as cubes, called voxels. A voxel can be identified with its center. The facets, edges, and vertices of a voxel are also called 2-cells, 1-cells, and 0-cells, respectively. We say that two voxels v, v are k-adjacent for k = 0, 1, or 2, if they share a k-cell. (Equivalently, instead of the terms 0-,1-, and 2-adjacency, one can use the terms 26-, 18-, and 6-adjacency). A 3D digital object S is a finite set of voxels. A k-path (where k = 0 or k = 1) in S is a sequence of voxels from S such that every two consecutive voxels on the path are k-adjacent. Two voxels of a digital object S are k-connected (in S) iff there is a k-path in S between them. A subset G of S is k-connected iff there is a k-path connecting any two voxels of G. If G is not k-connected, we will say that it is k-disconnected. A maximal (by inclusion) k-connected subset of a digital object S is called a k-(connected) component of S. Components are non-empty and distinct kcomponents are disjoint.
3
Cylindrical Offset
Consider a line L in R3 . Without loss of generality, assume that L goes through the origin, i.e., it is defined by x = at, y = bt, z = ct, t ∈ R.
(1)
So, (a, b, c) ∈ R is a vector collinear with L. We will assume that the vector (a, b, c) is rational. Then, without loss of generality, we can suppose that (a, b, c) is integer. We can also assume that 3
0
(2)
Clearly, the r-offset of L is a cylinder C(L, r) with axis L and radius r. See Figure 1. Using standard calculus, it is not hard to obtain that C(L, r) is defined by the formula (x − at)2 + (y − bt)2 + (z − ct)2 = r, (3) where t =
ax+by+cz a2 +b2 +c2 .
This formula simplifies to
(b2 +c2 )x2 +(a2 +c2 )y 2 +(a2 +b2 )z 2 −2abxy −2bcyz −2acxz −(a2 +b2 +c2 )r2 = 0 (4)
Offset Approach to Defining 3D Digital Lines
681
z
y x Fig. 1. A cylindrical offset of a straight line in R3
The r-offset (4) of line L provides the following definition of a digital line D(L) of thickness r: D(L) = {(x, y, z) ∈ Z 3 |(b2 + c2 )x2 + (a2 + c2 )y 2 + (a2 + b2 )z 2 −2abxy − 2bcyz − 2acxz − (a2 + b2 + c2 )r2 ≤ 0} An important question is how long a radius would have to be to guarantee connectivity of a digital line D(L). Non-trivial theoretical tasks are to determine the minimal values of r as a function of the coefficients a, b, and c for which D(L) is always 0-, 1-, or 2-connected. This will be a subject of a future work. At this point we are mostly interested to assure at least 1-connectivity of the line and to experimentally test the quality of its appearance. As a theoretical background we can list the following proposition. Proposition √ 1. Let D(L) be a digital line defined by cylindrical offset of radius r. If r ≥ 3, then D(L) is at least 1-connected. The proof of the above proposition is based on the following plain facts. √ Fact 1. A closed ball B with radius 3 contains a grid cell (i.e., a unit hypercube with 8 integer vertices). Proof. Follows from the well-known fact that the diameter of a unit cube equals √ 3 (that is, the length√of the longest diagonal). So, any point of such a cube is at √ a distance at most 3 from any of its vertices. Therefore, a ball B of radius 3 centered at that point will contain a grid cell. Fact 2.√ Let M be the set of integer points that belong to a closed ball B with radius 3. Then M is at least 1-connected.
682
V.E. Brimkov et al.
Proof. Assume that M has at least two 1-components. By Fact 1, there is a 1-component M1 of M that contains a grid cube C as a subset (proper or nonproper). Let M2 be√another 1-component of M . Then, any element of M2 is at a distance at least 3 from any element of M1 . Let w be the element of M2 that is √ at a maximal distance from u. Consider the limit case when that distance is 3, which is reached for point u ∈ C and w ∈ M2 . This may happen when for some vertex x of C, u is a midpoint of the segment with end-points x and w, (see Figure 2, left). In this case there is a ¯ centered at u ball B that contains eight grid cubes forming a (2 × 2)-cube C, (see Figure 2, right). Obviously, the vertices of these cubes are the only integer points within B and they all are 2-connected.
v w
y u
u
x Fig. 2. Illustration to the proof of Proposition 1
√ Assume that there is no element of M2 at a distance 3 from u. Then the second longest possible distance is equal to two, which is reached for a configuration such as the one exposed in Figure 2, left. With a reference to that figure, one can conclude that in such a case, vertex y will belong to ball B. (This follows from the folklore fact that the polytope obtained as a convex hull of a set of points contained in a convex set A, is contained in A.) Thus, vertex v will be 2-connected to vertex u. It is not hard to deduce that any vertex that is not 0-adjacent to u and that √ is not two units far from u (as v above), will be at a distance greater than 3 from u, so it will not belong to B. This completes the proof. √ Proof of Proposition 1. Moving a ball of radius 3 from point (0, 0, 0) towards point (a, b, c) along the line L, Fact 2 implies that at any point one encounters at least 1-connected sets of integer points within the moving ball. This implies that the integer set of points contained in the obtained r-offset is 1-connected as well. √ One can reasonably expect that r = 3, that guarantees connectivity of D(L), is not the smallest radius with that property. We have studied this point by performing extensive experiments. To this end, we tested all possible samples (a, b, c)
Offset Approach to Defining 3D Digital Lines
683
for 1√≤ a, b, c ≤ 100, provided condition (2). As suggested by Proposition 1, for r = 3, the line D(L) is always 1-connected (and, therefore also 0-connected). Note that it turned out that D(L) is 2-connected√as well. Moreover, the exper√ iments showed that if r is chosen to be equal to 2 or to 1 rather than to 3, then D(L) is still always 2-connected (and thus also 1- and 0-connected).
4
Parallelepiped Pseudo-offset
In this section we consider a 3D digital line L that consists of the integer points within a parallelepiped centered about the line. We call it a parallelepiped pseudooffset of L. As before, let L be a straight line in R3 defined by (1). The parallelepiped P (L, s) is such that: (a) The sections of P (L, s), that are perpendicular to L, are squares with side s; (b) A pair of parallel walls of P (L, s) are perpendicular to the coordinate plane Oxy. See Figure 3.
z
y x Fig. 3. A parallelepiped pseudo-offset of a straight line in R3
One can see that, in general, the equations of the two vertical border planes that determine P (L, s) are of the form −e = −bx + ay and e = −bx + ay for a certain e. They determine a vertical space-strip defined by −e < −bx + ay ≤ e, z ∈ R.
(5)
The specific value of e depends on the length s of the side of the square section of P .
684
V.E. Brimkov et al.
The other pair of (non-vertical, parallel to each other) planes have the form 2 2 a2 +b2 ax + by − a +b c z − f = 0 and ax + by − c z + f = 0 for a certain f , which depends on the value of s. These two planes define the space strip a2 + b2 a2 + b 2 z − f ≤ 0, ax + by − z+f ≥0 (6) c c √ Proposition 1 implies that if s = 2 3, then the digital line D(L) obtained is always at least 1-connected. Using standard calculus, one obtains that such a value of s occurs if the parameters e and f in (5) and (6) are chosen to be as follows: 2 2 2 √ 2 2 2 a2 + b2 + a +b 2 a +b c √ √ e= , f= 3 3 ax + by −
As before, we also tested the connectivity of D(L) for other values of s. Since by construction C(L, r) ⊂ P (L, s) for s = 2r, connectivity of a line obtained by cylindrical offset implies connectivity of the corresponding parallelepiped offset √ √ a2 +b2 (e.g., for s = 2 2 and s = 2). Another test was performed for s = √2a2 +b 2 +c2 b
2
2
a2 +b2 + a +b c √ a2 +b2
2
(which is the case for e = b and f = ). Then the projection of the parallelepiped P on the xy-coordinate plane is a strip, such that the distance between the two parallel border lines along the y-axis equals 2. The experiment showed that in this case the obtained digital line D(L) is always 6-connected.
5
Appearance of Offset Lines
As already discussed, connectivity of the digital lines introduced in the present paper was tested experimentally. A basic objective was also to test how the above-defined digital lines appear on the computer screen, regarding possible practical applications. For this purpose, we have developed a computer system which visualizes 3D digital lines obtained through various available definitions, together with the adjacency relations between the line voxels. The system allows to easily vary the different parameters defining the lines. Our software uses Qt for the GUI, OpenGL primitives to draw the lines, as well as some other tools, such as Qhull for convex hull computation. For each investigated class of lines (defined by certain offset radius/pseudooffset side), 10 000 samples have been tested, for line coefficients ranging from 1 to 100. The results concerning connectivity issues have already been commented in the previous sections. As far as line appearance is concerned, it turned out to be quite satisfactory, √ √ as demonstrated in Figures 4 and 5. In particular, this is the case when 3- or 2-offsets are used. Specifically, these lines are comparatively thick; therefore, in practice, bubbles are not apparent as, e.g., in the case of lines obtained through supercover approach [1]. Note that if the radius/side of a cylindrical/parallelepiped pseudo-offset is increasing, the enclosed set of integer points tends to approximate the shape of
Offset Approach to Defining 3D Digital Lines
685
Fig. 4. A 3D digital line obtained by a cylindrical offset for a = 7, b = 11, c = 3. The dots mark the centers of the voxels involved, while the connecting segments visuarize the existing 2-adjacencies among voxels. Left: Digital line obtained by offset of radius √ r = 2. Right: Digital line obtained by offset of radius r = 1.
Fig. 5. A 3D digital line obtained by a parallelepiped pseudo-offset for a = 7, b = 11, c = 3. The dots mark the centers of the voxels involved, while the connecting segments visuarize the existing√ 2-adjacencies among voxels. Left: Digital line obtained by a pseudo-offset of side s = 2 2. Right: Digital line obtained by pseudo-offset of side s = 2.
a cylinder/parallelepiped. By rotating the configuration of points in order to see it from √ different √ points of view, one can observe that for larger radii (such as r = 2, r = 3), those obtained by a cylindrical offset are preferable.
6
Concluding Remarks
In the present paper we investigated an approach of constructing a 3D digital line by taking the integer points within an offset of a certain radius. Alternatively, we investigated digital lines obtained through a pseudo-offset defined by a parallelepiped enclosing the integer points around the line. We√showed that √ if the radius( resp. side of parallelogram section) is greater than 3 (resp. 2 3), then the digital line is at least 1-connected. In general, the digital lines obtained are “thick.” Thus, from a theoretical point of view, their usefulness should be evaluated differently from lines defined
686
V.E. Brimkov et al.
earlier (e.g., in [5,10,19].) The latter feature structure that makes them interesting mostly from a digital geometry point of view, which is particularly concerned with optimality issues. Instead, in the present work, our main goal was to investigate both the theoretical and practical worth to three dimensions of an approach that worked perfectly in 2D. Based on extensive experiments, we feel eligible to conclude that the digital lines obtained feature pretty satisfactory appearance. √ Provided that a sufficiently large value of the offset radius (e.g. greater than 3) is used, the digital lines are at least 1-connected. Another advantage of the proposed definition is that one can always be aware how far the integer points are from the continuous line. Note that it is not always easy to make a conclusion like this when using other definitions. Further research will be aimed at answering some theoretical questions. For example, it would be interesting to know the minimal values of the offset radius r (as functions of the continuous line coefficients a, b, and c) for which the digital line D(L) is always 0-, 1-, or 2-connected. Analogous question can be posed about the digital line obtained through a parallelepiped pseudo-offset.
Acknowledgements The authors are indebted to the three anonymous referees for some useful comments.
References 1. Andres, E.: Discrete linear objects in dimension n: the standard model. Graphical Models 65(1-3), 92–111 (2003) 2. Anton, F.: Voronoi diagrams of semi-algebraic sets, Ph.D. thesis, The University of British Columbia, Vancouver, British Columbia, Canada (January 2004) 3. Anton, F., Emiris, I., Mourrain, B., Teillaud, M.: The offset to an algebraic curve and an application to conics. In: Gervasi, O., Gavrilova, M.L., Kumar, V., Lagan´ a, A., Lee, H.P., Mun, Y., Taniar, D., Tan, C.J.K., et al. (eds.) ICCSA 2005. LNCS, vol. 3480, pp. 683–696. Springer, Heidelberg (2005) 4. Arrondo, E., Sendra, J., Sendra, J.R.: Genus formula for generalized offset curves. J. Pure and Applied Algebra 136(3), 199–209 (1999) 5. Coeurjolly, D., Debled-Rennesson, I., Teytaud, O.: Segmentation and length estimation of 3D discrete curves. In: Bertrand, G., Imiya, A., Klette, R. (eds.) Digital and Image Geometry. LNCS, vol. 2243, pp. 295–313. Springer, Heidelberg (2002) 6. Cohen-Or, D., Kaufman, A.: 3D Line voxelization and connectivity control. IEEE Computer Grpahics & Applications 17(6), 80–87 (1997) 7. Cox, D., Little, J., O’Shea, D.: Using Algebraic Geometry. Springer, New York (1998) 8. Debled-Rennesson, I.: Etude et reconnaissance des droites et plans discrets, Ph.D. thesis, Universit´e Louis Pasteur, Strasbourg, France (1995)
Offset Approach to Defining 3D Digital Lines
687
9. Debled-Rennesson, I., Domenjoud, E., Jamet, D.: Arithmetic Discrete Parabolas. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4292, pp. 480–489. Springer, Heidelberg (2006) 10. Debled-Rennesson, I., R´emy, J.-L., Rouyer-Degli, J.: Linear segmentation of discrete curves into blurred segments. Discrete Applied Mathematics 151(1-3), 122– 137 (2005) 11. Figueiredo, O., Reveill`es, J.-P.: New results about 3D digital lines. In: Proc. Internat. Conference Vision Geometry V, SPIE, vol. 2826, pp. 98–108 (1996) 12. Fiorio, C., Jamet, D., Toutant, J.-L.: Discrete Circles: an Arithmetical Approach Based On Norms. In: Proc. Internat. Conference Vision-Geometry XIV, SPIE, vol. 6066, p. 60660C (2006) 13. Hoffmann, C.M., Vermeer, P.J.: Eliminating extraneous solutions for the sparse resultant and the mixed volume. J. Symbolic Geom. Appl. 1(1), 47–66 (1991) 14. Kim, C.E.: Three dimensional digital line segments. IEEE Transactions on Pattern Analysis and Machine Intellignece 5(2), 231–234 (1983) 15. Klette, R., Rosenfeld, A.: Digital Geometry – Geometric Methods for Digital Picture Analysis. Morgan Kaufmann, San Francisco (2004) 16. Kong, T.Y.: Digital topology. In: Davis, L.S. (ed.) Foundations of Image Understanding, pp. 33–71. Kluwer, Boston (2001) 17. Rosenfeld, A.: Connectivity in digital pictures. Journal of the ACM 17(3), 146–160 (1970) 18. Rosenfeld, A., Klette, R.: Digital starightness – a review. Discrete Applied Mathematics 139(1-3), 197–230 (2004) 19. Toutant, J.-L.: Characterization of the closest discrete approximation of a line in the 3-dimensional space. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4291, pp. 618–627. Springer, Heidelberg (2006)
Curvature and Torsion Estimators for 3D Curves Thanh Phuong Nguyen and Isabelle Debled-Rennesson LORIA Campus Scientifique - BP 239 54506 Vandoeuvre-l`es-Nancy Cedex, France {nguyentp,debled}@loria.fr
Abstract. We propose a new torsion estimator for spatial curves based on results of discrete geometry that works in O(n log 2 n) time. We also present a curvature estimator for spatial curves. Our methods use the 3D extension of the 2D blurred segment notion [1]. These estimators can naturally work with disconnected curves.
1
Introduction
Geometric properties of curves are important characteristics to be exploited in geometric processing. They directly lead to applications in machine vision [2] and computer graphics [3]. In the planar case, many applications are based on the curvature property in domains such as curve approximation [4], geometry compression [3], and particularly in corner detection after the pioneer paper of Attneave [5]. In 3D space, torsion and curvature are the most important properties that permit to describe how a spatial curve bends. Several methods have been proposed for torsion estimation. Mokhtarian [6] used Gaussian smoothing to estimate it directly from torsion formula. Similarly, Kehtarnavaz et al. [7] used B-spline smoothing technique; Lewiner et al. [3] utilized weighted least-squares fitting techniques. Raluben Medina et al. [8] proposed two methods to estimate torsion and curvature values at each point of the curve. The first one utilized Fourier transform, the second one is based on the least squares fitting. These methods are applied for description of arteries in medical imaging. We propose in this paper a novel method for the estimation of local geometric parameters of a spatial curve. It uses a geometric approach and relies on results of discrete geometry on decomposition of a curve into maximal blurred segments [9,1,10]. This paper presents an extension to 3D of these results. The 3D curvature estimator given in [11] is extended here with the notion of blurred segment and it permits to study curves possibly noisy or disconnected. We also propose a new approach to the discrete torsion estimation.
This work is supported by ANR in the framework of the GEODIB project, BLAN 06 − 2 134999.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 688–699, 2008. c Springer-Verlag Berlin Heidelberg 2008
Curvature and Torsion Estimators for 3D Curves
689
We recall, in the Section II, 2D definitions and results [10] that we use. Section III presents how to extend these ideas into 3D space. Sections IV and V respectively propose a curvature and a torsion estimator. Last sections give experiments and conclusions.
2
Maximal 2D Blurred Segment of Width ν
The notion of blurred segments relies on the arithmetical definition of discrete lines [12] where a line, with slope ab , is the set of integer points (x, y) verifying μ ≤ ax − by < μ + ω (a, b, μ and ω being integer and gcd(a, b) = 1). Such a line is denoted by D(a, b, μ, ω). The notion of 2D blurred segment extends the notion of segment of a discrete line and permits more flexibility in operations such as recognition, segmentation of discrete curves. Let us recall definitions [1,10] that we use in this paper (see also the Fig. 1). Definition 1. Let Sb be a sequence of digital points. A discrete line D(a, b, μ, ω) is said optimal for Sb if each point of Sb belongs to ω−1 D and if its vertical width, max(|a|,|b|) , is equal to the vertical width of the convex hull of Sb (see the Fig. 1.a). Sb is a blurred segment of width ν iff there exists an optimal discrete line ω−1 D(a, b, μ, ω) of Sb such that max(|a|,|b|) ≤ ν. Let C be a discrete curve and Ci,j a sequence of points of C indexing from i to j. Suppose that the predicate ”Ci,j is a blurred segment of width ν” is denoted by BS(i, j, ν). Definition 2. Ci,j is called a maximal blurred segment of width ν and noted M BS(i, j, ν) iff BS(i, j, ν), ¬BS(i, j + 1, ν) and ¬BS(i − 1, j, ν) (see the Fig. 1.b). In [10] an algorithm is proposed to decompose a planar curve into maximal blurred segments for a given width and the theorem below is proved. The algorithm relies on operations of insertion (or deletion) of a point to (or from) the convex hull of the current studied segment. We have proven that the decomposition of a planar curve with n points into maximal blurred segments of width ν can be done in time O(n log2 n). y
x
Fig. 1. From left to right: a. D(5, 8, −8, 11) (blue and grey points) is the optimal discrete line of the sequence of grey points, b. the set of black points is a maximal blurred segment (MBS) of width 2
690
3 3.1
T.P. Nguyen and I. Debled-Rennesson
Maximal 3D Blurred Segment of Width ν 3D Blurred Segment of Width ν
The notion of 3D discrete line (see the references [13,14]) is defined as follows: Definition 3. A 3D discrete line [43], denoted D3D (a, b, c, μ, μ , e, e ), with a main vector (a, b, c) such that (a, b, c) ∈ ZZ 3 , and a ≥ b ≥ c is defined as the set of points (x, y, z) from ZZ 3 verifying: D
μ ≤ cx − az < μ + e μ ≤ bx − ay < μ + e
(1) (2)
with μ, μ , e, e ∈ ZZ. e and e are called arithmetical width of D. According to the definition, it is obvious that a 3D discrete line is bijectively projected into two projection planes as two 2D arithmetical discrete lines. Thanks to that property, we naturally define the notion of 3D blurred segment by using the notion of 2D blurred segment and by considering the projections of the sequence of studied points in the coordinate planes (see the Fig. 2.a). Definition 4. Let Sf3D be a sequence of points of ZZ3, Sf3D is a 3D blurred segment of width ν with a main vector (a, b, c) such that (a, b, c) ∈ ZZ 3 , and a ≥ b ≥ c if it possesses a said optimal discrete line, named D3D (a, b, c, μ, μ , e, e ), such that – D(a, b, μ , e ) is optimal for the sequence of projections of points of Sf3D in e −1 the plane (O, x, y) and max(|a|,|b|) ≤ ν, – D(a, c, μ, e) is optimal for the sequence of projections of points of Sf3D in e−1 the plane (O, x, z) and max(|a|,|c|) ≤ ν. A linear algorithm of 3D blurred segment recognition may be deduced from that definition. Indeed, we only need to use an algorithm of 2D blurred segment recognition in each projection plane. 3.2
Maximal 3D Blurred Segment of Width ν
In this section, we present an algorithm to obtain the sequence of 3D maximal blurred segments of width ν in time O(n log2 n) for any noisy 3D discrete curve C. This sequence is noted M BSν (C) = {M BSi (Bi , Ei , ν)}i=0 to m−1 with Bi (resp. Ei ) the index of the first (resp. last) point of the ith maximal blurred segment, M BSi , of C. This algorithm uses an algorithm to determine the 2D maximal blurred segment (see [10]) of the projections in the coordinate planes of the points of the studied curve.
Curvature and Torsion Estimators for 3D Curves
691
Algorithm 1. Algorithm for the segmentation of a curve C into maximal 3D blurred segments of width ν Data: C - discrete curve with n points, ν - width of the segmentation Result: M BSν - the sequence of maximal blurred segments of width ν of C begin k=0; Sb = {C0 }; M BSν = ∅; a = 0; b = 1; ω = b, μ = 0; while the widths of 2 blurred segments obtained by projecting the points of Sb in the coordinate planes are ≤ ν do k++; Sb = Sb ∪ Ck ; Determine D3D (a, b, c, μ, μ , e, e ), optimal discrete line of Sb ; (*) end bSegment=0; eSegment=k-1 ; M BSν = M BSν ∪ M BS(bSegment, eSegment, ν); while k < n − 1 do while the widths of 2 blurred segments obtained by projecting the points of Sb in the coordinate planes are > ν do Sb = Sb \ CbSegment ; bSegment++ ; Determine D3D (a, b, c, μ, μ , e, e ), optimal discrete line of Sb ; (*) end while the widths of 2 blurred segments obtained by projecting the points of Sb in the coordinate planes are ≤ ν do k++ ; Sb = Sb ∪ Ck ; Determine D3D (a, b, c, μ, μ , e, e ), optimal discrete line of Sb ; (*) end eSegment=k-1; M BSν = M BSν ∪ M BS(bSegment, eSegment, ν); end end
(*) To determine the optimal discrete line of the current 3D blurred segment Sb , we consider the characteristic of the two 2D blurred segments obtained in the planes of projection and combine them to obtain the characteristics of the optimal 3D discrete line of Sb . As the whole process is done in dimension 2, this algorithm has the same complexity as the one in dimension 2. So, we have the following result: The decomposition of a 3D curve into maximal blurred segments of width ν can be done in time O(nlog 2 n).
4
3D Discrete Curvature of Width ν
In this section, we’re interested in curvature estimation based on osculating circle. Deducing from [10], we present below the notion of discrete curvature of width ν at each point of a 3D curve (see the Fig. 2.b). This method extends the one proposed in [11] to the blurred segments and is adapted to noisy curves thanks to the width parameter.
692
T.P. Nguyen and I. Debled-Rennesson
Let C be a 3D discrete curve, Ck is a point of the curve. Let us consider the points Cl and Cr of C such that : l < k < r, BS(l, k, ν) and ¬BS(l − 1, k, ν), BS(k, r, ν) and ¬BS(k, r + 1, ν). The points Cl and Cr for a given point Ck of C are deduced from the sequence of maximal blurred segments of width ν of C. The estimation of the 3D curvature of width ν at the point Ck is determined thanks to the radius of the circle passing through the points Cl , Ck and Cr . To determine the radius Rν (Ck ) of the circumcircle of the triangle [Cl , Ck , Cr ], we use the formula given in [15] as follows: −−−→ −−−→ −−−→ Let s1 = ||Ck Cr ||, s2 = ||Ck Cl || and s3 = ||Cl Cr ||, then Rν (Ck ) =
s 1 s2 s3 (s1 + s2 + s3 )(s1 − s2 + s3 )(s1 + s2 − s3 )(s2 + s3 − s1 )
s Then, the curvature of width ν at the point Ck is Cν (Ck ) = Rν (C with s = k) −−−→ −−−→ sign(det(Ck Cr , Ck Cl )) (it indicates concavities and convexities of the curve). Thanks to the sequence of maximal blurred segments of width ν (M BSν ) of a 3D curve C, an algorithm for curvature estimation at each point of a 3D curve can be directly deduced from [10].
Algorithm 2. Width ν curvature estimation at each point of ζ Data: C 3D discrete curve of n points, ν width of the segmentation Result: {Cν (Ck )}k=0..n−1 - Curvature of width ν at each point of C begin Build M BSν = {M BSi (Bi , Ei , ν)}i=0 to m−1 (See the Algo. 1 ); m = |M BSν |; E−1 = −1; Bm = n; for i = 0 to m − 1 do for k = Ei−1 + 1 to Ei do L(k) = Bi ; for k = Bi to Bi+1 − 1 do R(k) = Ei ; end for i = 0(∗) to n − 1(∗) do Rν (Ci ) = Radius of the circumcircle to [CL(i) , Ci , CR(i) ]; s Cν (Ci ) = Rν (C ; i) end end (*) The bounds mentioned in the algorithm 2 are correct for a closed curve. In case of an open curve, the instruction becomes: for i = l to n - 1 - l with l fixed to a constant value.
5 5.1
Discrete Torsion of Width ν Preliminary
The 3D curvature is not sufficient to characterize the local property of a 3D curve. This parameter only measures how rapidly the direction of the curve changes. In case of a planar curve, the osculating plane does not change. For 3D
Curvature and Torsion Estimators for 3D Curves
693
Fig. 2. From left to right: a. D3D (45, 27, 20, −45, −81, 90, 90) optimal discrete line of the grey points, b. The curvature at the red point is defined as the inverse of circumcircle (passing through both blue points and the red point) radius
curves, torsion is a parameter that measures how rapidly the osculating plane changes. To clarify this notion, we recall below some definitions and results in differential geometry (see [16] for more details). Definition 5. Let r : I → R3 be a regular unit speed curve parameterized by t.
i T (t) (resp. N (t)) a unit vector in direction r (t) (resp. r (t)). So, N (t) is a normal vector to T (t). T (t) (resp. N (t)) is called the unit tangent vector (resp. normal vector) at t. ii |T (t)| = k(t) is called the curvature of r at t. iii The plane determined by the unit tangent and normal vectors (T (t) and N (t)), is called the osculating plane at t. The unit vector B(t) = T (t) ∧ N (t) is normal to the osculating plane and is called the binormal vector at t. iv |B (t)| = τ (t) is called the torsion of curve at t. Theorem 1. Let r : I → R3 be a spatial curve parameterized by t.
ii The torsion of r at t ∈ I: τ (t) =
|r ∧r | 3 |r | (r ∧r ).r 2 |r ∧r |
i The curvature of r at t ∈ I: k(t) =
Thanks to theorem 3, the torsion value at a point is 0 if the curvature value at this point is 0. 5.2
Discrete Torsion
Discrete torsion was studied in [3,6,8,7]. In this section, we propose a new geometric approach for the problem of torsion estimation that uses the definitions and results presented in the previous sections. Definitions Let ζ be a 3D discrete curve, Ck is k th point of the curve. Let us consider the points Cl and Cr of ζ such that : l < k < r, BS(l, k, ν)&¬BS(l − 1, k, ν) and BS(k, r, ν)&¬BS(k, r + 1, ν). Let’s recall that the curvature of width ν (see −−−→ −−−→ section 4.) is estimated by circumcircle of triangle Cl Ck Cr . If Ck Cl and Ck Cr
694
T.P. Nguyen and I. Debled-Rennesson
are colinear, the curvature value at Ck is 0, therefore the torsion value at Ck is 0. −−−→ −−−→ So, without loss of generality, we suppose that Cl Ck and Ck Cr are not colinear. −−−→ −−−→ In addition, the plane defined by Cl Ck and Ck Cr is noted (Cl , Ck , Cr ), and we propose the definition below. Definition 6. The osculating plane of width ν at Ck is the plane (Cl , Ck , Cr ). −−−→ → − Cl Ck The osculating plane (Cl , Ck , Cr ) has two unit tangent vectors : t1 = − −−→ |C k C r | −−−→ → − Ck Cr and t2 = −−−→ . Therefore, we have the binormal vector at the k th point: |C k C r | → − → − − → bk = t1 ∧ t2 = (bx , by , bz ) So, we propose the following definition of discrete torsion of width ν. → − Definition 7. The discrete torsion of width ν at Ck is the derivation of bk
Torsion estimator. Our proposed method for torsion estimation is based on → − the definition 7. Let us remark that the set { bk }n−1 k=0 can be constructed from the 2 set of maximal blurred segments in O(n log n) time. So, we can obtain torsion → − value by calculating the derivative at each position of { bk }n−1 k=0 . The traditional method for derivation estimation of discrete sequence is utilizing Gaussian kernel [17]. We propose below a geometric approach method to this problem. Let us consider the curve ζ1 = {P }ni=0 that is constructed by this rule: → −−−−→ − Pi Pi+1 = bi , i = 0, .., n − 1 (see the Fig. 3).
− → b0
− → b1
− → b2 → − b3
− → b4
− → b5
− → b0 P0
→ − − b4 → − b3 P4 → → b2 P3 − b1 P2 P1
− → b5 P5
P6
Fig. 3. The curve ζ1 is constructed from the sequence of binormal vector
→ − Proposition 1. The tangent vector at each point Pi of the curve ζ1 is bi (i = 0, .., n − 1). Proof. In differential geometry, the tangent vector of a curve r(t) at the point
−−−→ P P
0) Pt0 = r(t0 ) is defined as: t(t0 ) = r (t0 ) = limh→0 r(t0 +h)−r(t = limh→0 th0 . h Therefore, in discrete space, the tangent vector at the point Pi = α(i) can be −−−−→ → −−−−→ − estimated as t(i) = r(i+1)−r(i) = Pi P1i+1 = Pi Pi+1 = bi . 1
Proposition 2. The torsion value at each point of curve ζ corresponds to curvature value of ζ1 curve. Proof. Thanks to definition 8, the discrete torsion at Ck of ζ curve is the deriva→ − → − tion of bk . In addition, bk is the tangent vector at the k th point of ζ1 curve. So, this value is also curvature value at the k th point of ζ1 curve.
Curvature and Torsion Estimators for 3D Curves
695
Therefore, by using these two propositions, we can estimate torsion value at each point of ζ curve by determining curvature value at corresponding point of ζ1 curve. Our proposed method is presented in the algorithm 3, it uses the curvature estimator presented in section 3.3. (*) Same remark as for the Algorithm 2. Algorithm 3. Width ν torsion estimation at each point of ζ Data: ζ 3D discrete curve of n points, ν width of the segmentation Result: {Tν (Ck )}k=0..n−1 - Torsion of width ν at each point of ζ begin Build M BSν = {M BSi (Bi , Ei , ν)}i=0 to m−1 ; m = |M BSν |; E−1 = −1; Bm = n; for i = 0 to m − 1 do for k = Ei−1 + 1 to Ei do L(k) = Bi ; for k = Bi to Bi+1 − 1 do R(k) = Ei ; end for i = 0(∗) to → n − 1(∗) do − −−−−− −−−−−→ → − − → − Ci CL(I) → − C C → − → t1 = −−−−−→ ; t2 = −−i−−R(i) −→ ; bi = t1 ∧ t2; |Ci CL(i) |
|Ci CR(i) |
end → −−−−−→ − Construct ζ1 = {Pk }n k=0 , with Pk Pk+1 = bk ; Estimate the curvature value of width ν at each point of the ζ1 curve as torsion value of corresponding point of the ζ curve (see the Algo. 2); end
6
Experiments
We introduce some experiments of our method on some ideal 3D curves : helix, Viviani’s, spheric, horopter and hyper helix curves. The tests are done after a process of discretisation of these 3D curves (see the Fig. 6). We have tested our methods on this computer configuration: CPU Pentium 4 with 3.2GHz of speed, 1Gb of memory RAM, linux kernel 2.6.22-14 operating system. We introduce three criteria for measuring error: mean relative error (meanRE), max relative error (maxRE) and quadratic relative error (QRE). Let’s consider 2 sequences: {IRi }ni=1 (resp. {RRi }ni=1 ) the ideal result (resp. estimated result) at each position. So, we have: 1 |RRi − IRi | , maxRE = max n i=1 IRi n
meanRE = and
|RRi − IRi | IRi
n ⎧ ⎫ 1 ⎪ |RRi − IRi | ⎪2 ⎪ ⎪ QRE = ⎩ ⎭ . n i=1 IRi
, i = 1, .., n
696
T.P. Nguyen and I. Debled-Rennesson
Because the estimated result is not correct for the beginning and the end of the open curve (see the bounds mentioned in the algorithms 2 and 3), during the phase of error estimation we use the border parameter to eliminate this influence.
(a) Hyper helix curve
(b) Ideal torsion
(c) Estimated torsion
Fig. 4. Most of the hyper helix curve has a torsion value close to 0. So in this case, the obtained result is the worst.
In most cases of the studied curves (see the Table 1 and Fig. 6), the mean relative errors do not overtake 0.15, and the quadratic relative errors do not overtake 0.015. If the ideal torsion of the input curve has a value which is close to 0 at some positions, the obtained result is not very good. Let’s see the case of Viviani’s curve in the Table 1. In this case, the maximal relative error is high (15.6036). Fig. 5. Relatif error In spite of that, the mean relative error is acceptable (0.628899). In particular cases, if most of input curves has a torsion value which is close to 0, the obtained result is the worst (see the Fig. 4). Let’s consider the case of an hyper helix curve (see the Fig. 4). The problem is that the torsion approximation is not good at nearly-0 values. In spite of that, the approximation value is also close to 0 but relative rate between approximation value and ideal value is very high. In Fig. 5, we show the relation between approximation torsion and ideal torsion of an hyper helix curve from the index 15 and to the index 250. In this index interval, the ideal torsion is close to 0. Table 1. Error estimation on the estimated torsion result Curves No of points Border meanRE maxRE QRE Time (ms) Spheric 255 30 0.13174 0.34281 0.164156 280 Horopter 239 30 0.0827392 0.186243 0.0979267 300 Helix 760 30 0.0481056 0.514517 0.0813778 920 Viviani 274 30 0.628899 15.6036 1.62278 290 Hyper helix 740 30 8551.24 154378 24704.2 720
Curvature and Torsion Estimators for 3D Curves
(a) Helix curve
(b) Ideal torsion of an helix
(c) Estimated torsion of an helix
(d) Viviani’s curve
(e) Ideal torsion of a Viviani
(f) Estimated torsion of a Viviani
(g) Spheric curve
(h) Ideal torsion of a spheric
(i) Estimated torsion of a spheric
(j) Horopter curve
(k) Ideal torsion of an horopter
(l) Estimated torsion of horopter
Fig. 6. Experiments with width ν= 2
697
698
T.P. Nguyen and I. Debled-Rennesson Table 2. Error estimation on the estimated torsion result, threshold = 0.0005
Curves No points No considered points Border meanRE maxRE QRE Spheric 255 255 30 0.13174 0.34281 0.164156 Horopter 239 239 30 0.0827392 0.186243 0.0979267 Helix 760 760 30 0.0481056 0.514517 0.0813778 Viviani 274 270 30 0.456954 3.51278 0.615561 Hyper helix 740 196 30 1.15266 3.80191 1.62102
So, the relative error between approximation torsion and ideal torsion is very high, in spite of that the approximation value does not overtake 0.006. So, we propose to consider only the points whose ideal torsion value is greater than the threshold. In the error calculus, n is replaced by number of points whose ideal torsion value is greater than a threshold. The table 2 shows the approximated error with a threshold equal to 0.0005.
7
Conclusions
We have presented in this paper 2 methods to estimate curvature and torsion of a 3D curve. These methods benefit from the amelioration of curvature estimator in planar case [10], so they’re efficient. These estimators permit to discover local properties of spatial curve. We hope to identify and classify 3D objects by using these estimators. In the future, we will compare our methods with other existing methods [6,11,2,3] for curvature and torsion estimation. Morover, we will work with real discrete data of biology or medical area. We intend to present these works in a journal version.
References 1. Debled-Rennesson, I., Feschet, F., Rouyer-Degli, J.: Optimal blurred segments decomposition of noisy shapes in linear time. Computers & Graphics 30 (2006) ´ 2. Poyato, A.C., Garc´ıa, N.L.F., Carnicer, R.M., Madrid-Cuevas, F.J.: A method for dominant points detection and matching 2d object identification. In: Campilho, A.C., Kamel, M.S. (eds.) ICIAR 2004. LNCS, vol. 3211, pp. 424–431. Springer, Heidelberg (2004) 3. Lewiner, T., Gomes Jr., J.D., Lopes, H., Craizer, M.: Curvature and torsion estimators based on parametric curve fitting. Computers & Graphics 29, 641–655 (2005) 4. Salmon, J.P., Debled-Rennesson, I., Wendling, L.: A new method to detect arcs and segments from curvature profiles. In: ICPR (3), pp. 387–390 (2006) 5. Attneave, E.: Some informational aspects of visual perception. Psychol. Rev. 61 (1954) 6. Mokhtarian, F.: A theory of multiscale, torsion-based shape representation for space curves. Computer Vision and Image Understanding 68, 1–17 (1997) 7. Kehtarnavaz, N.D., de Figueiredo, R.J.P.: A 3-d contour segmentation scheme based on curvature and torsion. IEEE Trans. Pattern Anal. Mach. Intell. 10, 707– 713 (1988)
Curvature and Torsion Estimators for 3D Curves
699
8. Medina, R., Wahle, A., Olszewski, M.E., Sonka, M.: Curvature and torsion estimation for coronary-artery motion analysis. In: SPIE Medical Imaging, vol. 5369, pp. 504–515 (2004) 9. Feschet, F., Tougne, L.: Optimal time computation of the tangent of a discrete curve: Application to the curvature. In: Bertrand, G., Couprie, M., Perroton, L. (eds.) DGCI 1999. LNCS, vol. 1568, pp. 31–40. Springer, Heidelberg (1999) 10. Nguyen, T.P., Debled-Rennesson, I.: Curvature estimation in noisy curves. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 474–481. Springer, Heidelberg (2007) 11. Coeurjolly, D., Svensson, S.: Estimation of curvature along curves with application to fibres in 3d images of paper. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 247–254. Springer, Heidelberg (2003) 12. Reveill`es, J.P.: G´eom´etrie discr`ete, calculs en nombre entiers et algorithmique, Th`ese d’´etat. Universit´e Louis Pasteur, Strasbourg (1991) 13. Debled-Rennesson, I.: Reconnaissance des droites et plans discrets. PhD thesis, Louis Pasteur University (1995) 14. Coeurjolly, D., Debled-Rennesson, I., Teytaud, O.: Segmentation and length estimation of 3d discrete curves. In: Bertrand, G., Imiya, A., Klette, R. (eds.) Digital and Image Geometry. LNCS, vol. 2243, pp. 299–317. Springer, Heidelberg (2002) 15. Harris, J., Stocker, H.: Handbook of mathematics and computational science. Springer, Heidelberg (1998) 16. Oprea, J.: Differential geometry and its applications (2007) 17. Worring, M., Smeulders, A.W.M.: Digital curvature estimation. Computer Vision Graphics Image Processing: CVIU 58, 366–382 (1993)
Threshold Selection for Segmentation of Dense Objects in Tomograms W. van Aarle, K.J. Batenburg, and J. Sijbers IBBT - Vision Lab University of Antwerp, Belgium
[email protected],
[email protected],
[email protected]
Abstract. Tomographic reconstructions are often segmented to extract valuable quantitative information. In this paper, we consider the problem of segmenting a dense object of constant density within a continuous tomogram, by means of global thresholding. Selecting the proper threshold is a nontrivial problem, for which hardly any automatic procedures exists. We propose a new method that exploits the available projection data to accurately determine the optimal global threshold. Results from simulation experiments show that our algorithm is capable of finding a threshold that is close to the optimal threshold value.
1
Introduction
Tomography is a technique for obtaining images (known as tomograms) of the interior of an object from projection data, acquired along a range of angles. Quantitative information about objects, such as their shape or volume, is often extracted as a post-processing step. To obtain such information from the greylevel tomogram, segmentation has to be performed. If the scanned object consists of a small set of materials, each corresponding to an approximately constant grey level in the reconstruction, it is sometimes possible to combine reconstruction and segmentation in a single step, using techniques from discrete tomography [1]. In some cases, this can lead to a dramatic reduction in the number of projections required for an accurate reconstruction. In many tomography applications, one is only interested in a partial segmentation, separating an object of interest from the remaining part of the tomogram. In this paper, we focus on the problem of segmenting one or more homogeneous objects that have a single density (i.e., grey level in the tomogram), within a surrounding of varying density. In addition, we assume that the density of these objects is higher than that of the remaining materials. Metal implants (e.g., steel, aluminium, titanium) for example, are quite common in medical tomograms. Separating these implants from the surrounding tissue is a nontrivial task. Segmentation of dense, homogeneous objects is not only important in medical imaging applications, but is often required in the field of material science
This work was financially supported by the IBBT, Flanders.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 700–709, 2008. c Springer-Verlag Berlin Heidelberg 2008
Threshold Selection for Segmentation of Dense Objects in Tomograms
701
as well. Indeed, in that field, deformation of porous materials as a function of stress or heat, for example, is studied using dense marker particles. This requires accurate segmentation of the particles in the reconstructed tomogram. Global thresholding is a common choice for segmenting tomograms [2]. Typically, thresholds are selected based on the histogram of the tomogram [3]. If only a few materials are present and each of these corresponds to distinct grey level peak in the histogram, it is possible to accurately determine appropriate thresholds, for example by analyzing the concavity points on the convex hull of the histogram [4] or by modeling the histogram as a mixture of a series of Gaussian distributions [5]. The most popular global threshold selection method, however, is the clustering method of Otsu [6]. It minimizes the weighted sum of intraclass variances of the foreground and the background. The problem with histogram-based methods in the context of segmenting a homogeneous object in a continuous grey level image is that there are no guaranteed histogram peaks representing the continuous background. Histogram-based methods are particularly inadequate if the object of interest is only slightly more dense that the surrounding materials. A different approach to segmentation of dense objects is provided by regionbased algorithms such as region growing [7] and watershed segmentation [8]. Region growing requires the user input of one or more seed points after which all their neighbouring pixels are evaluated and added to the segmentation if allowed by a certain inclusion criterium, such as a difference threshold, the choice of which is often subjective. The watershed method, does not require additional user input, but is susceptible to false minima (parts of the continuous background that have the largest density in their local neighbourhoud, but not in the entire image). The previously mentioned segmentation algorithms are all solely based on the tomogram, which is prone to reconstruction errors and artifacts. Recently, a new method was proposed for global [9] and local [10] threshold selection in tomograms, called PDM (projection distance minimization). Accurate segmentations can be achieved by exploiting the available projection data in addition to the tomogram. However, this approach requires that the scanned object only contains a few different densities. However, even though the PDM method cannot be used directly for segmenting dense objects in a continuous surrounding, similar ideas can be applied to this segmentation problem. In this paper, we will introduce a new global thresholding method for dense object segmentation that also exploits the projection data. For each candidate segmentation, the projections of the segmented object are subtracted from the measured projection data, after which the remaining part of the image is reconstructed and checked for consistency with the residual projections. The threshold for which maximal consistency is obtained is selected for the segmentation. The paper is structured as follows. In Section 2, the tomography setting is introduced and the dense object segmentation problem is formally stated. Section 3 describes our threshold selection algorithm in detail. Experimental results are presented in Section 4. Section 5 concludes the paper.
702
2
W. van Aarle, K.J. Batenburg, and J. Sijbers
Notation and Concepts
The unknown physical object from which projection data has been acquired is represented by a grey value image f : R2 → R. We denote the set of all such functions f by F , also called the set of images. Projections are measured along lines lθ,t = {(x, y) ∈ R2 : x cos θ + y sin θ = t}, where θ represents the angle between the line and the y-axis y and t represents the coordinate along the projection axis; see Fig. 1. t Denote the set of all functions s : R×[0, 2π) → R x O by S. The Radon transform F → S is defined by ∞ R(f )(t, θ) = f (x, y)δ(x cos θ+y sin θ−t) dx dy. −∞
Fig. 1.
Basic
setting
of
with δ(.) denoting the Dirac delta function. The transmission tomography function R(f ) ∈ S is also called the sinogram of f . In practice, Radon transform data is only available for a finite set of projection angles and for a finite set of lines. In our approach, the image f ∈ F is also discretized. Similar to algebraic reconstruction methods (i.e., ART, SART, see [11]), the image is represented on a rectangular grid of width w and height h. Put n = wh. We assume that the image is zero outside this rectangle. Let m denote the total number of measured detector values (for all angles) and let p ∈ Rm denote the measured data. The Radon transform can be modeled as a linear operator W that maps the image v ∈ Rn (representing the object) to the vector p of measured data, called the projection operator : W v = p. (1) We represent W by an m×n matrix W = w1 · · · wn , where wi denotes the ith column vector of W . The vector p is called the forward projection or sinogram of v. Not all vectors in Rm are the sinogram of an image v, i.e. the projection operator is not surjective. For the continuous case, the set of all sinograms has been characterized by Ludwig and Helgason [12,13]. They describe a necessary and sufficient set of conditions for a function s ∈ S to be a sinogram. An important role in our proposed algorithm is played by sets of sinograms that correspond to images that are zero on a subset A ⊂ {1, . . . , n} of pixels. A sinogram p ∈ Rm is called A-consistent if p ∈ span{wa : a ∈ / A}. The Euclidean distance between p and the nearest A-consistent vector is called the A-inconsistency of p. The notation introduced above allows us to define the problem concerning the segmentation of a dense homogeneous object within a continuous surrounding: Problem 1. Let v ∈ Rn be an unknown image. Put ρ = maxi (vi ). Put A = {a : va = ρ}. Suppose that p = W v is given. Reconstruct A.
Threshold Selection for Segmentation of Dense Objects in Tomograms
703
The continuous image v contains a discrete object of grey level ρ, corresponding to the set of pixels A, which have the highest grey value among all pixels of v. A reconstruction of A corresponds directly to a segmentation of the dense object. Note that we assume that the density ρ is not known in advance, which is common for practical tomogram segmentation problems. Problem 1 may not always have a unique solution and in its general form, as stated here, it is computationally hard. In fact, if v is a black-and-white image (i.e., vi ∈ {0, 1} for all i), it leads to a discrete tomography problem, which is NP-hard for certain projection matrices W [14]. We therefore consider a simpler segmentation problem, where an image v ˜ for which W v ˜ ≈ p is to be segmented by global thresholding. The image v ˜ may be obtained by applying a reconstruction algorithm, such as Filtered Backprojection or SART, to the given projection data. The reconstructed set A˜ is then formed by all pixels in v ˜ that have a value above the selected threshold τ ∈ R. This leads to the following, somewhat informal threshold estimation problem: Problem 2. Let v, A, ρ be as in Problem 1. Suppose that v ˜ ∈ Rn is given such that W v ˜ ≈ p. Find τ ∈ R such that A˜ := {a ∈ {1, . . . , n} : v˜a ≥ τ } is a good approximation of A. The concept of “good approximation of A” can be made precise by taking the symmetric difference between A and A˜ as a quantitative measure of their correspondence.
3
Algorithm
In this section, we introduce the Segmentation Consistency Maximization (SCM) algorithm for solving Problem 2. Before giving a concise description of the SCM algorithm, we will first provide a global overview. Our approach for estimating the optimal threshold τ is based on two main ideas: measuring if a selected threshold is either too low (overestimation of A) or too high (underestimation of A), respectively. Let τ be a candidate threshold and let A˜ := {a : va ≥ τ } denote the reconstructed set of pixels for the dense object. Reorder {1, . . . , n} and the the pixels corresponding columns of W , such that W = WA˜ WB , where WA˜ contains ˜ According to the reconthe columns of W corresponding to the pixels in A. ˜ structed set A, the unknown image v should satisfy va = ρ for a ∈ A˜ and vb < ρ ˜ As W v = p, this means that for a perfect reconstruction A˜ = A, the for b ∈ / A. ˜ residual sinogram pB = WB v = p − a∈A˜ ρwa should be A-consistent, i.e., the measured sinogram p consists of (1) the forward projection of the dense object and (2) the forward projection of the continuous surrounding, which is zero in the interior of the dense object. Now suppose that the reconstructed set A˜ is an overestimation of A, i.e., ˜ Although there may be many images that correspond to the sinogram A ⊂ A. p, fixing all pixels in A˜ to the density ρ will generally lead to an inconsistent ˜ residual sinogram for the surrounding pixels. By computing the A-inconsistency,
704
W. van Aarle, K.J. Batenburg, and J. Sijbers
we obtain a measure for the amount of overestimation. Instead of computing ˜ the A-inconsistency of the residual sinogram directly, we resort to an iterative reconstruction algorithm for the sake of computational efficiency: – First, a least-square solution of the system WB v ˜B = pB is computed, using the iterative SIRT algorithm as described in [15]. ˜ – Next, the A-inconsistency is computed as ||WB v ˜B − pB ||2 . x 10 Now suppose that the reconstructed set A˜ is an 2.5 underestimation of A, i.e., A˜ ⊂ A. In that case, the 2 ˜ residual sinogram pB will be A-consistent. Here, 1.5 we can make effective use of the experimental con1 vergence properties of the SIRT algorithm. As the 0.5 segmented dense object becomes smaller (i.e., the 0 0 20 40 60 80 100 fixed pixels (%) threshold τ is increased), the remaining set of pix˜ becomes larger, resulting els B = {1, . . . , n}\A in slower convergence for the iterative SIRT algo- Fig. 2. The relation between the the number of fixed pixels rithm. If we terminate after a fixed number of it˜ and the A-inconsistency after ˜ erations, the A-inconsistency ||WB vB − pB ||2 will 10 iterations of the SIRT algotherefore generally increase along with the thresh- rithm [15] old τ . Fig. 2 shows an experimental confirmation of this algorithm property. For a Modified Shepp-Logan phantom image, the ˜ A-inconsistency after 10 iterations of SIRT was computed, for increasingly small random subsets of A˜ (indicated as a percentage of the dense object). Each experiment was repeated 5 times. The figure shows a strictly increasing relation between the size of the set of remaining pixels (not belonging to the dense object) ˜ and the computed A-inconsistency. The SCM algorithm incorporates the properties described above, by using the ˜ A-inconsistency found after a fixed number of SIRT iterations as a quantitative measure for the quality of the selected threshold. Starting at a low threshold, i.e., an overestimation of A, the threshold is gradually increased while keeping ˜ ˜ As long as track of the A-inconsistency of the corresponding segmentation A. ˜ ˜ A ⊂ A, the A-inconsistency will typically decrease. It will start to increase again, once A˜ ⊂ A. Of course, there may be a threshold interval where neither A ⊂ A˜ ˜ nor A˜ ⊂ A. Our experimental results in the next section suggest that the Ainconsistency can be used as an effective measure for the segmentation quality within this range as well. A flow chart of the SCM algorithm is shown in Fig. 3. In Fig. 4, a pseudo code description of the algorithm is given. Starting from an initial tomogram v ˜ for which W v ˜ ≈ p and a low threshold for the dense object, the algorithm searches ˜ for the minimum of the A-inconsistency. $íLQFRQVLVWHQF\ DIWHULWHUDWLRQV
10
4
Results
Simulation experiments have been performed, based on two phantom images of size 256×256, one representing a mouse femur with a metallic cylinder (Fig. 5(a))
Threshold Selection for Segmentation of Dense Objects in Tomograms
705
Sinogram p Reconstruct
Tomogram
Threshold
Dense Object
s~
v ~
?
p0B + pA~ p
W s~
p pA~
2
+
Residual Sinogram’ pB0
Forward Project
Dense Object Sinogram pA~
Forward Project
Wv ~B
Residual Tomogram
Reconstruct
-
Residual Sinogram pB
v ~
B
Fig. 3. Scheme of the Segmentation Consistency Maximization (SCM) Algorithm
Input: reconstructed image v ˜ such that W v ˜ ≈ p; τ0 := initial threshold (specified by user); stepsize := initial stepsize (specified by user); i := 0; repeat Put A˜ := {a : v˜a ≥ τi }; ˜ ρi := average grey level value of {˜ vj : j ∈ A}; ˜ put s˜j := ρi if j ∈ A; ˜ for j = 1, . . . , n: put s˜j := 0 if j ∈ / A; pA˜ := W s˜; pB := p − pA˜ ; ˜ Compute a least-square solution of W v ˜B = pB while keeping v ˜B j = 0 for all j ∈ A and restricting the remaining entries to the range [0, τi ]; pB := W v ˜B ; scoreτi := ||pB + pA˜ − p||2 ; if scoreτi < scoreτopt do τopt := τi ; τi+1 := τi + stepsize; else stepsize := stepsize ∗ 0.5; τi+1 := τi−1 + stepsize; i := i + 1; until abs(stepsize) < minimum stepsize (specified by user); Output: Aopt := {a : v˜a ≥ τopt };
Fig. 4. Pseudo code of the Segmentation Consistency Maximization (SCM) algorithm. The initial stepsize should be a positive number iff τ0 is chosen smaller than τopt .
706
W. van Aarle, K.J. Batenburg, and J. Sijbers
(a) femur phantom
(b) tomogram
(c) SCM segmentation
(d) foam phantom
(e) tomogram
(f) SCM segmentation
Fig. 5. (a,d) 256×256 phantom images; (b,d) tomograms v ˜; (c,f) the optimal segmentations s˜opt found by the SCM algorithm
and one representing a foam containing a number of dense marker particles (Fig. 5(d)). Both phantom images contain one or more areas of constant grey level, as well as a continuously varying background. For both phantoms, a parallel-beam sinogram was simulated using 100 equally spaced projection angles between 0 and 180◦ . For realism, we added Poisson noise to these sinograms. Grey level reconstructions were then computed using 100 iterations of the SIRT algorithm described in [15], shown in Fig. 5(b) and Fig. 5(e). Note that in the SIRT reconstructions, the grey level of the dense objects is no longer constant. In particular, it varies gradually at the boundary of the objects, which illustrates the importance of the selection of a proper threshold for the segmentation. In the first experiment, we compared the threshold chosen by the SCM algorithm with the optimal threshold (fewest pixel differences with the phantom image). The number of misclassified pixels and the SCM scores were calculated for an entire range of global thresholds. These are plotted in Fig. 6(a) and Fig. 6(c) for the femur tomogram and in Fig. 6(b) and Fig. 6(d) for the foam tomogram. Ideally, the distance between the minima of these two curves should be as small as possible. The optimal threshold for the femur tomogram lies at 204,4 for a total of 1104 misclassified pixels. The SCM algorithm suggests a threshold of 200,8, resulting in a segmentation (Fig. 5(c)) with 1126 misclassified pixels, only 22 more the best achievable using global thresholding and the tomogram of
Threshold Selection for Segmentation of Dense Objects in Tomograms 1200 amount of misclassified pixels
amount of misclassified pixels
10000 8000 6000 4000 2000 0 140
160
180
200 threshold
220
240
1000 800 600 400 200 0 140
260
145
(a) femur pixel error 1.11
160 threshold
165
170
175
180
x 10
1.1 SCM score
1.4 SCM score
155
10
x 10
1.2 1 0.8 0.6 140
150
(b) foam pixel error
11
1.6
707
1.09 1.08 1.07
160
180
200 threshold
220
(c) femur SCM score
240
260
1.06 140
145
150
155
160 threshold
165
170
175
180
(d) foam SCM score
Fig. 6. (a,b) number of misclassified pixels for a range of global threshold values; (c,d) quantitative score calculated by the SCM algorithm for the same range of global thresholds. The minima (circle) of these plots should lie close together.
Fig. 5(b). For the foam tomogram, SCM found that the global threshold should be 172,1 (Fig. 6(d)), whereas the optimal threshold is 173,3 (Fig. 6(b)), thus misclassifying only 7 pixels more than the absolute minimum (144 vs 137). In the second experiment, we compare the SCM algorithm to two other segmentation techniques: the Otsu clustering method and watershed segmentation, where first morphological filters are used to remove the noise and to find foreground markers. We performed this experiment with a range of different contrasts between the foreground and the background. Again, we started from the phantoms shown in Fig. 5(a) and 5(d). The background of these images was multiplied such that the ratio between the foreground and the maximum value of the background is 0.65, 0.70, . . . , 0.95. We then created sinograms and tomograms the same way as in the first experiment. From these, we generated the optimal segmentations according to the Otsu, watershed and SCM algorithms. For the thresholding methods, Otsu and SCM, we examined the suggested thresholds and compared them to the optimal threshold, where the pixel error is smallest. The results of this are shown in Fig. 7(a) and 7(b). It is clear that the thresholds chosen by the SCM algorithm are a good approximation of the optimal thresholds, independent of the contrast. The SCM algorithm also clearly outperforms the Otsu clustering method. We then focused on the number of misclassified pixels. This is shown in Fig. 7(c) and 7(d). The solid line represents the error of the optimal threshold and we can not expect the SCM algorithm (dashed line) to go lower than
708
W. van Aarle, K.J. Batenburg, and J. Sijbers 240
threshold
200
250 Opt. threshold SCM Otsu
200 threshold
220
180 160
Opt. threshold SCM Otsu
150 100
140 50
120 100 0.6
0.65
0.7 0.75 0.8 0.85 background/foreground ratio
0.9
0 0.6
0.95
0.65
0.7 0.75 0.8 0.85 background/foreground ratio
(a)
amount of misclassified pixels
amount of misclassified pixels
600 Opt. threshold SCM Watershed
1000 800 600 400 200 0.6
0.95
(b)
1400 1200
0.9
0.65
0.7 0.75 0.8 0.85 background/foreground ratio
(c)
0.9
0.95
500
Opt. threshold SCM Watershed
400 300 200 100 0 0.6
0.65
0.7 0.75 0.8 0.85 background/foreground ratio
0.9
0.95
(d)
Fig. 7. (a,b) optimal thresholds for different levels of contrast (optimal, Otsu and SCM); (c,d) number of misclassified pixels for different levels of contrast (optimal threshold, Otsu, Watershed segmentation and SCM)
this line. Yet, it approximates the optimal line very well. Generally, the SCM algorithm also outperforms the watershed algorithm (dash-dotted line). Only in Fig. 7(c) where the contrast is very low, the watershed method performs better. This is because these tomograms have a very poor accuracy, meaning that any global thresholding algorithm will suffer. We expect that this problem can be overcome by using local thresholding with the SCM technique. A disadvantage of the SCM algorithm is its computational requirement. This can be attributed for the most part to the SIRT reconstruction that is performed every time a segmentation is scored. On our test PC, running at 2.6 GHz, the running time of each segmentation measurement is around 65s. This can be improved by using GPU-computing. Using modern GPU-cards we were able to gain a speedup of 40 on iterative reconstruction techniques such as SIRT. We expect that the same speedup can be reached in the SCM algorithm.
5
Conclusions
To obtain quantitative information about an object from its tomographic projection data, it is often necessary to compute a segmented reconstruction. In this paper, we have considered a variant of this segmentation problem, where a single dense object of constant density must be separated from a continuously varying background. Global thresholding is a well-known and common segmentation approach for this problem. However, the automatic selection of the optimal
Threshold Selection for Segmentation of Dense Objects in Tomograms
709
threshold value is a difficult problem if the background contains a large number of distinct grey level values. We proposed the Segmentation Consistency Maximization (SCM) algorithm for threshold selection, which exploits the available projection data. The SCM algorithms is based on the concept of A-inconsistency, which determines the consistency of the sinogram that remains after the dense object has been subtracted from the measured projections. Experimental results demonstrate that the proposed SCM algorithm can be used effectively for determining accurate thresholds.
References 1. Herman, G.T., Kuba, A.: Discrete Tomography: Foundations, Algorithms and Applications. Birkh¨ auser (1999) 2. Eichler, M.J., Kim, C.H., M¨ uller, R., Guo, X.E.: Impact of thresholding techniques on micro-CT image based computational models of trabecular bone. In: ASME Advances in Bioengineering, vol. 48, pp. 215–216 (September 2000) 3. Glasbey, C.A.: An analysis of histogram-based thresholding algorithms. Graphical Models and Image Processing 55(6), 532–537 (1993) 4. Rosenfeld, A., Torre, P.: Histogram concavity analysis as an aid in threshold selection. IEEE Trans. Syst., Man, Cybern. 13, 231–235 (1983) 5. Ridler, T.W., Calvard, S.: Picture thresholding using an iterative selection method. IEEE Trans. Syst., Man, Cybern. 8, 630–632 (1978) 6. Otsu, N.: A threshold selection method from gray level histograms. IEEE Trans. Syst., Man, Cybern. 9, 62–66 (1979) 7. Yu, X., YlaJaaski, J.: A new algorithm for image segmentation based in region growing and edge detection. In: Proc. Int. Symp. Circuits and Systems, vol. 1, pp. 516–519 (1991) 8. Beucher, S., Meyer, F.: The morphological approach to segmentation: the watershed transformation. In: Dougherty, E.R. (ed.) Mathematical Morphology in Image Processing, pp. 433–481. Marcel Dekker, New York (1993) 9. Batenburg, K.J., Sijbers, J.: Optimal threshold selection for tomogram segmentation by reprojection of the reconstructed image. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 563–570. Springer, Heidelberg (2007) 10. Batenburg, K.J., Sijbers, J.: Selection of Local Thresholds for Tomogram Segmentation by Projection Distance Minimization. In: Coeurjolly, D., Sivignon, I., Tougne, L., Dupont, F. (eds.) DGCI 2008. LNCS, vol. 4992, pp. 380–391. Springer, Heidelberg (2008) 11. Kak, A.C., Slaney, M.: Principles of Computerized Tomographic Imaging. In: Volume Algorithms for reconstruction with non-diffracting sources, pp. 49–112. IEE Press, New York (1988) 12. Ludwig, D.: The Radon Transform on Euclidean space. Commun. Pure Appl. Math. 19, 49–81 (1966) 13. Helgason, S.: The Radon Transform. Birkh¨ auser, Boston (1980) 14. Gardner, R.J., Gritzmann, P.: Uniqueness and complexity in discrete tomography. In: Discrete Tomography: Foundations, Algorithms and Applications, pp. 85–111. Birkh¨ auser (1999) 15. Gregor, J., Benson, T.: Computational Analysis and Improvement of SIRT. IEEE Trans. Medical Imaging 27(7), 918–924 (2008)
Comparison of Discrete Curvature Estimators and Application to Corner Detection B. Kerautret1 , J.-O. Lachaud2 , and B. Naegel1, 1
LORIA, Nancy-University - IUT de Saint Di´e des Vosges 54506 Vandœuvre -l`es-Nancy Cedex {kerautre,naegelbe}@loria.fr 2 LAMA, University of Savoie 73376 Le Bourget du Lac
[email protected]
Abstract. Several curvature estimators along digital contours were proposed in recent works [1,2,3]. These estimators are adapted to non perfect digitization process and can process noisy contours. In this paper, we compare and analyse the performances of these estimators on several types of contours and we measure execution time on both perfect and noisy shapes. In a second part, we evaluate these estimators in the context of corner detection. Finally to evaluate the performance of a non curvature based approach, we compare the results with a morphological corner detector [4].
1
Introduction
Extracting geometric features like perimeter, area or curvature has an important role in the field of pattern recognition. The applications are various from pattern matching to pattern analysis like for example, discrimination of similar handwriting numerals [5]. Recently, three new methods were proposed to compute curvature on noisy discrete data. First Nguyen et al. extend the estimator of osculating circles proposed by Coeurjolly et al. [6] by using blurred segments [1]. By this way, the estimation of osculating circles had a better behavior on noisy contours and was also meaningful for a non connected (but ordered) set of points. Then a second approach, proposed by Kerautret and Lachaud [2], suggests to minimize curvature while satisfying geometric constraints derived from tangent directions computed on the discrete contour. The robustness to noise of this approach was given also by blurred segments but used in a different manner. The third curvature estimator, proposed at the same time by Malgouyres et al. [3], suggests to use discrete binomial convolution to obtain a convergent estimator adapted to noisy data.
This work was partially funded by the ANR project GeoDIB, n◦ ANR-06-BLAN0225. Bertrand Kerautret was partially funded by a BQR project of Nancy University.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 710–719, 2008. c Springer-Verlag Berlin Heidelberg 2008
Comparison of Discrete Curvature Estimators
711
In this work, we propose to evaluate experimentally these three estimators by using different test contours and by measuring the precision of the estimation with several grid sizes. After comparing the curvature obtained with low and large resolution, we measure the robustness of the estimator on noisy contours. A second objective of this paper is to apply the most stable and robust of these estimators to a concrete application of corner detection (see the previous figure). Moreover we compare the obtained results with a recent morphological corner detector [4]. This paper is organized as follows. In the following section, after introducing briefly some notions of digital blurred segments, we give a rapid description of the three estimators. Then the comparison of these estimators is given in section 3. Section 4 shows the application to corner detection and describes the comparative evaluation.
2
Discrete Curvature Estimators
Before describing the curvature estimators previously mentioned, we address the notion of digital blurred segments used by the two first estimators. Blurred segments were introduced by Debled et al. [7]. Note that a comparable algorithm was proposed by Buzer [8]. The notion of blurred segments is based on the arithmetic definition of digital straight line and uses the computation of the convex hull to determine its vertical geometric width ν. From this definition the authors proposed an algorithm for the recognition of blurred segments of width ν. Note that the two following curvature estimators use the version of the algorithm which is not restricted to the hypothesis that points are added with increasing x coordinate (or y coordinate). The next floating figure illustrates two blurred segments of width 5 computed from the central black pixel. 2.1
Estimator Based on Osculating Circles (CC and NDC)
The curvature estimator proposed by Nguyen and R Debled-Rennesson [7] follows the same concept than the estimator proposed by Coeurjolly et al. [6] (called CC estimator). The latter is based on the estimation of osculating circles. More precisely, by denoting C BS(l, k) BS(k, r) a discrete curve, Ci,j the sequence of points going from Ci to Cj and BSν (i, j) the predicate “Ci,j is a blurred segment of width ν”. We consider the points Cr and Cl defined such that: BSν (l, k) ∧ ¬BSν (l − 1, k) ∧ BSν (k, r) ∧ ¬BSν (k, r + 1). With this definition, they determine the curvature of width ν at the point Ck from the radius of −−−→ −−−→ the circle passing through Ck , Cl and Cr . By noting s1 = ||Ck Cr ||, s2 = ||Ck Cl || −−−→ and s3 = ||Cl Cr ||, the authors give the following expression of radius Rν of the circle associated to the point Ck : Rν (Ck ) =
s 1 s2 s3 (s1 + s2 + s3 )(s1 − s2 + s3 )(s1 + s2 − s3 )(s2 + s3 − s1 )
712
B. Kerautret, J.-O. Lachaud, and B. Naegel
−−−→ −−−→ If the vectors Ck Cl and Ck Cl are not collinear then the curvature of width ν can −−−−−−−−→ Ck Cr ,Ck Cl )) be determined with sign(det( . Otherwise the curvature is set to 0. Rν (Ck ) 2.2
Global Minimization Curvature Estimator (GMC)
The estimator proposed by Kerautret and Lachaud is based on two main ideas [2]. The first one is to take into account all the possible shapes that have the same digitization defined by the initial contour and to select the most probable contour. This selection is done with a global optimization by minimizing the squared curvature. From this idea, we can expect a precise result even with a low contour resolution. The second idea is to obtain an estimator not sensitive to noise or to non perfect digitization process with blurred segments. For the purpose of minimizing curvature, the tangential cover [9] is computed on the discrete contour and the minimal and maximal possible tangent values are defined for each pixel. Fig. 1 illustrates both tangential cover (a) and bounds on the tangent directions (b). Minimizing the curvature consists in moving each point (xi , yi ) along the y axis and between the bounds defined by ymin , ymax in order to minimize the slope of the line joining (xi , yi ) to (xi+1 , yi+1 ). The global minimization is applied with a relaxation process. For the robustness to noise and non perfect digitization process, the discrete maximal segments were replaced by the discrete maximal blurred segments previously described. Note that the definition of the minimal and maximal slopes of the blurred segments is notably different from the non blurred case (see [2] for more details). Bounds on tangent directions Ymin Ymax
9
Constraints for each variable Ymin Ymax
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0 0
10
20
30
(a)
40
50
60
70
0
10
20
30
40
50
60
70
(b)
Fig. 1. Illustration of bounds defined from maximal segments (a) (extracted from the previous floating figure). (b) shows the constraints defined on each variable.
2.3
Binomial Convolution Curvature Estimator (BCC)
Malgouyres et al. proposed an algorithm to estimate derivatives with binomial convolutions [3] which is claimed to be multigrid convergent. They define the
Comparison of Discrete Curvature Estimators
713
operator ΨK which modifies the function F : Z → Z by the convolution with kernel K : Z → Z. For instance the backward finite difference is given by Ψδ F where the kernel δ is defined by: ⎧ ⎨ 1 if a = 0 δ(a) = −1 if a = 1 ⎩ 0 otherwise Then the authors give the smoothing kernel defined as: ⎧ n ⎪ ⎪ if n is even and a ∈ {− n2 , ..., n2 } ⎪ ⎪ ⎨ a + n2 n Hn (a) = n−1 if n is odd and a ∈ {− n+1 ⎪ n+1 2 , ..., 2 } ⎪ a + ⎪ 2 ⎪ ⎩ 0 otherwise. Finally the derivative kernel Dn is given by Dn = δ ∗ Hn and the derivative estimator is 21n ΨDn F . An interesting point of the method is that higher order derivatives have similar expressions: Dn2 = δ ∗ δ ∗ Dn
(1)
And for the curvature we compute: Dn2 (x) ∗ Dn (y) − Dn2 (y) ∗ Dn (x)) Dn (x)2 + Dn (y)2
(2)
The value of n is defined by n = h2(α−3)/3 , where h represents the grid size and α ∈]0, 1] is an additional parameter which can be associated to the amount of noise. More precisely, a value close to 0 allows more noise than a value close to 1.
3
Experimental Comparisons
The objective is to measure the precision obtained with the previously described estimators and to give execution times. For this purpose several data sets were generated with different grid sizes (h), for coarse (h = 1), medium (h = 0.1) and fine resolution (h = 0.01). Fig. 2 illustrates the two shapes defined with a grid step equals to 1 (images (a) and (b)). The choice of these test shapes can be justified since it allows to analyse the performances of the estimators both on smooth shapes with quick turns and on polygonal shapes with corners. We have applied the three estimators on the previous data. For the BCC estimator, the parameter α was set to 1 and the value of n was thus set to h−4/3 . For the other estimators, no parameters were used. From the resulting curvature graphs of Fig. 2, it can be seen that for both the flower and the polygon, the BCC estimator shows numerous oscillations with coarse grid sizes (h=1 and h=0.1). But the finer the grid size is, the more the values of curvature of the BCC estimator are stable. It appears that it is not the case with the CC estimators.
714
B. Kerautret, J.-O. Lachaud, and B. Naegel
Table 1. Mean squared curvature error (upper part of tabular) and execution times obtained with the estimators applied on shapes of Fig. 2 and on a circle of radius 20 (lower part of tabular) shape Flower Circle Polygon h 1 0.1 0.01 1 0.1 0.01 1 0.1 0.01 CC 0.0945 0.0225 0.0079 0.0005 0.0009 0.0013 0.0004 0.0003 8.1e-05 GMC 0.0966 0.0346 0.0049 2.5e-07 3.2e-10 4.3e-08 0.0113 0.3089 3.428 BCC 0.0855 0.0185 0.0081 0.0178 0.0012 0.0001 0.0232 0.0261 0.0510 CC(ms) 6 84 891 6 55 637 8 82 870 GMC(ms) 0 75 2593 2 363 2673 0 4 67 BCC(ms) 0 18 4514 0 14 3275 0 18 4501
Indeed the oscillations appear to be more important with fine grid size. For the GMC estimator we can see that it is very stable since no oscillations are visible. From the squared curvature errors, there are no superior estimators for all test shapes and each estimator has a preferred type shape. Execution times were measured on these curvature estimations (Tab. 1). On average we can see that the CC estimator is faster than GMC or CC estimator. An exception can be seen for the results obtained on the polygon, since GMC is faster than CC and BCC. It is simply due to the fact that the optimisation process is performed only on points on the frontier of constant tangential cover areas (see Fig. 1). And for a polygon this number of variable is largely smaller than on a circular shape. In order to measure the stability of the curvature estimators on noisy shapes, noise was added according to the model proposed by Kanungo [10]. Fig. 3 shows such noise addition (images (a) and (b)). From these results, we can see that the GMC estimator is very stable compared to the NDC estimator. Even with width 2 the results are stable if we expect a small part with negative curvature (image (f)). In the same way, the results of the BCC estimators appear satisfying even if a large size n was needed to obtain stable values. The fact that there are no oscillation constitutes a real advantage to design a simple and efficient curvature based-algorithm. In the following, we therefore exploit the stability of the GMC estimator in order to implement a robust corner detector.
4
Application to Corner Detection
We introduce a simple algorithm to detect corner points. The main steps of the algorithm for convex corner points are: 1. for all contour points (pi )i∈I compute curvature κ(pi ) with the GMC estimator with width ν. 2. detect all the maximal curvature regions defined by sets of consecutive points: Rk = {(pi )i∈[a,b] | ∀i, κ(pi ) = κ(pa ) ∧ κ(pa−1 ) < κ(pa ) ∧ κ(pb+1 ) < κ(pb ) ∧κ(pa ) > 0}
Comparison of Discrete Curvature Estimators
715
Real flower curvature GS 0.01
0.5 0 -0.5 -1 -1.5 -2
(a)
-2.5
(b)
-3 0
5000
10000
15000
20000
25000
(c) 1
0.7
GMC estimator CC estimator BCC estimator
GMC estimator CC estimator BCC estimator
0.6
0.5
0.5 0.4
0
0.3 -0.5
0.2 0.1
-1
0 -0.1
-1.5
-0.2 -2
-0.3 0
50
100
150
200
250
0
50
(d) grid size = 1
150
14
GMC estimator CC estimator BCC estimator
1
100
200
250
(e) grid size = 1 GMC estimator CC estimator BCC estimator
12
0
10 8
-1
GMC|CC
GMC|CC
6
-2 4 -3
2
GMC|BCC
CC|BCC
0
-4 -2 0
500
1000
1500
2000
2500
0
500
(f) grid size = 0.1
1500
200
GMC estimator CC estimator BCC estimator
1
1000
2000
2500
(g) grid size = 0.1 GMC estimator CC estimator BCC estimator
180 160 140
0
120 100
GMC|CC
-1
GMC|CC
80 60
-2
40
CC|BCC
20
GMC|BCC
-3
0 -20
0
5000
10000
15000
(h) grid size = 0.01
20000
25000
0
5000
10000
15000
20000
25000
(i) grid size = 0.01
Fig. 2. Comparisons of the CC, GMC and BCC estimators on the flower (a,d,f,h) and on the polygon (b,e,g,i). (c) shows the real curvature associated to the flower shape.
716
B. Kerautret, J.-O. Lachaud, and B. Naegel
(a)
(b)
0.4
0.4
NDC estimator width=2 NDC estimator width=3 NDC estimator width=4
0.3
NDC ν = {2, 3}
0.2
NDC estimator width=2 NDC estimator width=3 NDC estimator width=4
0.3 0.2
0.1
0.1
0
0
-0.1
-0.1
NDC ν = {3, 4}
-0.2
NDC ν = {3, 4}
-0.2
-0.3
-0.3
-0.4
-0.4 0
200
400
600
800
1000
1200
0
200
(c) NDC estimator 0.1
600
0.14
GMC estimator width=2 GMC estimator width=3 GMC estimator width=4
0.05
400
800
1000
1200
(d) NDC estimator GMC estimator width=2 GMC estimator width=3 GMC estimator width=4
0.12 0.1
0
0.08 -0.05 0.06 -0.1 0.04 -0.15
0.02
-0.2
0
-0.25
-0.02 0
200
400
600
800
1000
1200
0
(e) GMC estimator 0.06
400
600
800
1000
1200
(f) GMC estimator 0.06
BCC estimator mask size=200 BCC estimator mask size=300 BCC estimator mask size=400
0.04
200
BCC estimator mask size=200 BCC estimator mask size=300 BCC estimator mask size=400
0.05
0.02 0.04
0 -0.02
0.03
-0.04 0.02
-0.06 -0.08
0.01
-0.1 0 -0.12 -0.14
-0.01 0
200
400
600
800
(g) BCC estimator
1000
1200
0
200
400
600
800
1000
1200
(h) BCC estimator
Fig. 3. Results obtained on the noisy version on the shapes of figure 2. Each row shows the result obtained for each estimator used with different parameters associated to the amount of noise (width ν for GMC et NDC and mask size for the BCC estimator).
Comparison of Discrete Curvature Estimators
717
3. for each region Rk mark the point p(a+b)/2 as a corner. 4. (optional) select only corner points with curvature bigger than a threshold value κmin . Note that a quantification of the curvature field was applied in order to simplify the detection of the local minima/maxima (set to 1e−3). In the rest of this paper we have not applied the optional step (4) since it adds a new parameter and it could be fixed for each particular application. The concave corner detection is deduced by replacing the maximal by the minimal curvature regions. The previous algorithm was applied on the standard image collection available on-line [11] (see Fig. 4). All the corner detections were performed with the GMC estimators with the standard width ν = 2 which is an usual value for real data presenting a weak amount of noise. We can see on Fig. 4 that some parts are considered as corner which are only local minima. Note that points could be easily removed by applying a small threshold on the curvature values of the local maxima. For this standard dataset our results are comparable to previous works, but we can note that we have only one parameter which can be unchanged for all data considered as non noisy. This is not the case for example for corner detector proposed by [4] as described below. The images of Fig. 5 were generated from the previous test images and with the same noise model as mentioned in the previous section. The value of the width ν used for the curvature estimation was set to 4 according to the quantity of noise. Note that we are actually working on an automated determination of this parameter. Despite the important amount of noise, the corners were well detected. The only defect can be seen on the left wing of the plane (h), but it disappears when using a width equals to 5. Even with a lot of noise, corners can be detected as illustrated on the introducing figure of page 1 (ν = 13). Note that the total execution time was approx. 3s for the test images of Fig. 4 and Fig. 5. We have compared our new corner detector with a recent morphological method identified as BAP (Base Angle Points) [4] which shows good performance in comparison with other techniques of the literature. This method is
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4. Results of corner detection obtained with the GMC estimator (with width ν = 2). Convex and concave corners are respectively represented by crosses and disks.
718
B. Kerautret, J.-O. Lachaud, and B. Naegel
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 5. Results obtained on noisy version of the original test images. The width ν of the GMC estimator was set to 4 for all the images.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 6. Results obtained with BAP algorithm on the noisy version of the original test images. The crosses and circles represent respectively corner points detected with parameters (λ, L, H) = (2, 2, 1) and (λ, L, H) = (3, 1, 1).
based on the analysis of the residuals of an opening of the original image considered as a set I with a disk structuring element of size λ1 : C(I) = I\γBλ (I), where γB (X) = ∪i {Bi | Bi ⊆ X}. These components contain potential corner points. Some filtering can be performed to suppress irrelevant components (using a size criterion for example). For each residual component is computed an isosceles triangle composed of two base points B, C and of the corner point A. The length L of the base segment |BC| and the height H of the triangle can be used to construct a filter which discriminates the true corners from noisy parts. We have experimented this technique using three parameters: the radius λ of the disk structuring element, and the minimal values L and H for each residual component. Fig. 6 show the results obtained with the BAP method on the noisy 1
This operator is called white top-hat [12].
Comparison of Discrete Curvature Estimators
719
shapes. For each experiment, despite the manual tune of these parameters to obtain the best possible results, the results are not convincing on noisy data compared to our proposed method.
5
Conclusion
We have compared several recent discrete curvature estimators adapted to noisy discrete data. The mean error obtained on three sorts of shape does not discriminate a special estimator since for each shape a specific estimator gives the smallest mean error. For execution time, the CC estimator is faster than the other. On the point of view of stability the GMC estimator shows few oscillation even on noisy data. After the evaluation of this estimators we have proposed to apply the most stable estimator to corner detection. The obtained results show very important robustness to noise and outperform other recent corner detectors. In future works we will perform a global comparison of our corner detector with other existing works.
References 1. Nguyen, T., Debled-Rennesson, I.: Curvature estimation in noisy curves. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 474–481. Springer, Heidelberg (2007) 2. Kerautret, B., Lachaud, J.: Robust estimation of curvature along digital contours with global optimization. In: Coeurjolly, D., Sivignon, I., Tougne, L., Dupont, F. (eds.) DGCI 2008. LNCS, vol. 4992, pp. 334–345. Springer, Heidelberg (2008) 3. Malgouyres, R., Brunet, F., Fourey, S.: Binomial convolutions and derivatives estimations from noisy discretizations. In: Coeurjolly, D., Sivignon, I., Tougne, L., Dupont, F. (eds.) DGCI 2008. LNCS, vol. 4992, pp. 370–379. Springer, Heidelberg (2008) 4. Chang, X., Gao, L., Li, Y.: Corner detection based on morphological disk element. In: Proceedings of the 2007 American Control Conference, pp. 1994–1999. IEEE, Los Alamitos (2007) 5. Yang, L., Suen, C.Y., Bui, T.D., Zhang, P.: Discrimination of similar handwritten numerals based on invariant curvature features. Pat. Rec. 38, 947–963 (2005) 6. Coeurjolly, D., Miguet, S., Tougne, L.: Discrete curvature based on osculating circle estimation. In: Arcelli, C., Cordella, L.P., Sanniti di Baja, G. (eds.) IWVF 2001. LNCS, vol. 2059, pp. 302–303. Springer, Heidelberg (2001) 7. Debled-Rennesson, I., Feschet, F., Rouyer-Degli, J.: Optimal blurred segments decomposition of noisy shapes in linear times. Computers and Graphics (2006) 8. Buzer, L.: An elementary algorithm for digital line recognition in the general case. ´ Damiand, G., Lienhardt, P. (eds.) DGCI 2005. LNCS, vol. 3429, In: Andr`es, E., pp. 299–310. Springer, Heidelberg (2005) 9. Feschet, F., Tougne, L.: Optimal time computation of the tangent of a discrete curve: Application to the curvature. In: Bertrand, G., Couprie, M., Perroton, L. (eds.) DGCI 1999. LNCS, vol. 1568, pp. 31–40. Springer, Heidelberg (1999) 10. Kanungo, T.: Document Degradation Models and a Methodology for Degradation Model Validation. PhD thesis, University of Washington (1996) 11. Chetverikov, D., Szabo, Z.: (1999), http://visual.ipan.sztaki.hu/corner/corner click.html 12. Soille, P.: Morphological Image Analysis: Principles and Applications. Springer, Berlin (2003)
Computing and Visualizing Constant-Curvature Metrics on Hyperbolic 3-Manifolds with Boundaries Xiaotian Yin1 , Miao Jin2 , Feng Luo3 , and Xianfeng David Gu1 1 Computer Science Department, State University of New York at Stony Brook {xyin,gu}@cs.sunysb.edu 2 Center for Advanced Computer Studies, University of Louisiana at Lafayette
[email protected] 3 Department of Mathematics, Rutgers University,
[email protected]
Abstract. Almost all three dimensional manifolds admit canonical metrics with constant sectional curvature. In this paper we proposed a new algorithm pipeline to compute such canonical metrics for hyperbolic 3manifolds with high genus boundary surfaces. The computation is based on the discrete curvature flow for 3-manifolds, where the metric is deformed in an angle-preserving fashion until the curvature becomes uniform inside the volume and vanishes on the boundary. We also proposed algorithms to visualize the canonical metric by realizing the volume in the hyperbolic space H3 , both in single period and in multiple periods. The proposed methods could not only facilitate the theoretical study of 3-manifold topology and geometry using computers, but also have great potentials in volumetric parameterizations, 3D shape comparison, volumetric biomedical image analysis and etc.
1
Introduction
Many practical problems in computer graphics and geometric modeling can be reduced to computing special metrics (i.e. edge length) on a given geometric object represented as a mesh. One of the most widely considered problems is parameterization, which maps a given surface or volume mesh to a parameter domain with regular and simple shape; in another word, assigns a new set of edge length to the original mesh. Furthermore, many applications need a metric with special curvature properties. Intuitively, curvature measures how the space is locally distorted with respect to the Euclidean space. In many applications it is natural to ask for a metric with uniformly distributed curvature. We call it a constant curvature metric. Actually the existence of such canonical metrics on surfaces is justified G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 720–730, 2008. c Springer-Verlag Berlin Heidelberg 2008
Computing and Visualizing Constant-Curvature Metrics
721
by the Uniformization Theorem [1]. Similarly, most 3-manifolds (like those we are studying) admit constant curvature metric. In the surface parameterization literature, there are many works of computing metrics with special curvature distributions; the readers are referred to the parameterization survey papers [2] and [3] for general information. Among all those works, there are several which are able to compute the constant curvature metric for general surfaces, such as [4] and [5]. For volumes (or 3-manifolds), on the other hand, there is less work on computing particular metrics in the computer graphics literature. Most of such work is based on parameterization. To name a few, Wang et al. [6] parameterize brain volumes with solid ball via 3D harmonic mapping. Li et al. [7] computes 3D harmonic deformation between volumes with the same topology. Martin et al. [8] use harmonic functions to parameterize femur volumes, which are topologically balls with special shapes, by extending the boundary surface parameterization into the inside of the volume. All these methods assign a metric to the volume in some sense, but none of them target at uniform curvature distributions. In this work, we investigate the computation of constant curvature metric for 3-manifolds. In particular, we focus our attention on a special class of volumes, called hyperbolic 3-manifolds, whose boundaries are surfaces of genus greater than one, such as the Thurston’s knotted Y shape in figure 4. These volumes can be used to model some human soft tissues or biochemical structures. The computation is based on discrete 3D curvature flows, where the metric is evolving under the driven of the curvature. Luo [9] laid down the theoretical foundations of discrete curvature flows for hyperbolic 3-manifolds with boundaries. Our computational method is directly based on that work. The contribution of this paper can be briefly outlined as follows. 1. We proposed a discrete computational method based on curvature flow for hyperbolic 3-manifolds with boundaries. The resulting canonical metric induces constant curvature inside the volume and zero curvature on the boundary (i.e. geodesic boundary). 2. We proposed an algorithm to realize and visualize such canonical metrics in the hyperbolic space H3 . Both the single period representation (i.e. fundamental domain) and multiple period representation (i.e. universal covering space) can be computed. 3. We analyzed the convergence and stability of the proposed algorithms through experiments, and pinpointed several potential applications in both science and engineering fields. In the rest of the paper, we will introduce some underlying concepts and theories in section 2. Then the algorithm pipeline is presented in section 3, together with detailed discussion for each step. In section 4 we show the convergence and performance of the proposed methods, and pinpoint several potential applications. Finally we conclude the paper in section 5.
722
2
X. Yin et al.
Concepts and Theories
In this section, we provide some background knowledge that is necessary to understand the algorithm pipeline. Only the most related concepts and theories are presented here; for further details, the readers are referred to textbooks like [10], [1] and [11]. Hyperbolic Tetrahedron and Truncated Hyperbolic Tetrahedron. A 3manifold can be triangulated using tetrahedra. If one assigns a hyperbolic metric to a tetrahedron, it is called a hyperbolic tetrahedron, such as the one [v1 v2 v3 v4 ] shown in figure 1a, where each face fi is a hyperbolic plane, and each edge eij is a hyperbolic line segment. If a 3-manifold has boundaries, it can also be tessellated using truncated tetrahedron. In this case, the 3-manifold is called hyperideal triangulated. Again, a truncated tetrahedron assigned with hyperbolic metric is called truncated hyperbolic tetrahedron (figure 1b). It can be constructed from a hyperbolic tetrahedron by cutting off each vertex vi with a hyperbolic plane perpendicular to the edges eij , eik , eil . Each truncated hyperbolic tetrahedron has four right-angled hyperbolic hexagon faces (figure 1d), which can be glued to other hexagon faces and therefore become interior faces in a truncated tetrahedron mesh; and also four hyperbolic triangle faces (figure 1c), which will compose the boundary surface of the truncated tetrahedron mesh. y1
v1
v1 f2
f4
f2
v4 v3 f1
v2
(a)
θ1
θ2
θ2
θ6
f3
f4
x3
θ4 f3 v4
v3 f1 θ5
θ3
x1
θ1 x2
x2
θ3
y3
y2 x1
v2
(b)
x3
(c)
(d)
Fig. 1. Hyperbolic tetrahedron (a) and truncated hyperbolic tetrahedron (b) with hyperbolic triangle faces (c) and right-angled hexagons (d)
The geometry (or the edge length) of the truncated tetrahedron can be determined by the six dihedral angles {θ1 , θ2 , · · · , θ6 } (figure 1b). For a hyperbolic triangle (figure 1c) with inner angles {θi , θj , θk }, the edge length xi (the one against θi ) can be determined by the hyperbolic cosine law: cosh xi = (cos θi + cos θj cos θk ) / (sin θj sin θk ). For a right-angled hyperbolic hexagon (1d) with three cutting edge {xi , xj , xk }, its internal edge yi (which is against xi ) can be computed using another hyperbolic cosine law: cosh yi = (cosh xi + cosh xj cosh xk ) / (sinh xj sinh xk ). By using these hyperbolic cosine laws, all the edge lengths can be computed from the set of dihedral angles. On the other hand, the dihedral angles can also be uniquely determined by the edge lengths using the inverse cosine laws.
Computing and Visualizing Constant-Curvature Metrics
723
Discrete Curvature. For surfaces represented as triangle meshes, the discrete curvature is represented as the angle deficit around vertices. For avertex vi , jk the curvature K(vi ) equals 2π − jk αjk i for internal vertex or π − jk αi for boundary vertex, where {αjk i } are the surrounding corner angels (figure 2a). For 3-manifolds represented as tetrahedron meshes, there are two types of discrete curvature representations: vertex curvature and edge curvature. For lij ijk kli tetrahedron [vi , vj , vk , vl ], let {αjkl i , αj , αk , αl } denote the solid angles at kl the vertices (figure 2b), βij be the dihedral angle on edge eij (figure 2c). The vertex curvature K(vi ) is also defined as angle deficit: 4π − jkl αjkl for intei jkl rior vertex or 2π − jkl αi for boundary vertex. The edge curvature K(eij ) is kl kl defined 2π − kl βij if eij is an interior edge, or π − kl βij otherwise. The two types of discrete curvatures are closely related. Actually, the vertex curvature can be completely determined by the edge curvature: K(vi ) = j K(eij ). vi
αijkl vi
vi
kl βij
vl
αjk i
vl vj
vk
(a)
vj
vk
(b)
vk
vj
(c)
Fig. 2. Discrete curvatures: vertex curvature (a) for 2-manifolds, vertex curvature (b) and edge curvature (c) for 3-manifolds
Discrete Curvature Flow. Curvature flow refers to a flow of the metric driven by the curvature. Given a hyperbolic tetrahedron with edge lengths lij and dihedral angles θij , the volume V of the tetrahedron is a function of the dihedral angles V = V (θ12 , θ13 , θ14 , θ23 , θ24 , θ34 ), and the Schlaefli formula can be expressed as ∂V /∂θij = −lij /2, and thus dV = −1 ij lij dθij . It can be further 2 proved that the volume of a hyperbolic truncated tetrahedron is a strictly concave function of the dihedral angles ([9]). From this volume function, Luo ([9]) defines another energy function, which is a strictly convex function of the edge length, and the discrete curvature flow is defined as the negative gradient flow of the energy function. The curvature flow for 3-manifolds owns some special properties. One of the most prominent properties is the so called Mostow rigidity [12]. It states that the geometry of a finite volume hyperbolic manifold (for dimension greater than two) is determined by the fundamental group. Intuitively, different 3-manifolds have equivalent constant curvature metrics if they have the same topology. As a consequence, the tessellation will not affect the computational results the discrete curvature flow on 3-manifolds. Utilizing this special property, we are allowed to reduce the computational complexity of 3-manifold curvature flow by using the
724
X. Yin et al.
simplest tessellation for a given 3-manifold, as we will see in the algorithm section 3.1. For example, the Thurston’s knotted Y shape can be either represented as a high resolution tetrahedral mesh (figure 4d) or a mesh with only 2 truncated tetrahedra (figure 3a), and the resulting canonical metrics are identical. Hyperbolic Space Model. There are several realizations of the 2D and 3D hyperbolic space. In this work, we use the upper half plane model for the 2D hyperbolic H2 = {(x, y) ∈ R2 |y > 0}, with the Riemannian metric ds2 = 2 space 2 2 dx + dy /y . In this model, hyperbolic lines are circular arcs and half lines orthogonal to the x-axis. The rigid motion is given by the so-called M¨ obius transformation, (az + b)/(cz + d), where ac − bd = 1, a, b, c, d ∈ R. Similarly, we use the upper half space model to realize 3D hyperbolic space H3 = {(x, y, z) ∈ R3 |z > 0}, with Riemannian metric ds2 = dx2 + dy 2 + dz 2 /z 2 . In H3 , the hyperbolic planes are hemispheres or vertical planes, whose equators are on the xy-plane. The xy-plane represents all the infinity points in H3 . The rigid motion in H 3 is determined by its restriction on the xy-plane, which is a M¨ obius transformation on the plane, in the form of (az + b)/(cz + d), where ac − bd = 1, and a, b, c, d ∈ C.
3
Algorithms
Given a 3-manifold’s boundary surface, represented as a triangular mesh, our algorithm pipeline will go through the following steps: 1. Tessellate the volume with tetrahedra, and simplify the tessellation to minimum number of truncated tetrahedra; (section 3.1) 2. Compute the canonical hyperbolic metric using discrete 3D curvature flow; (section 3.2) 3. Realize and visualize the volume with the computed canonical metric in hyperbolic space H3 . (section 3.3) 3.1
Tessellation and Simplification
Given the boundary surfaces of a 3-manifold, we use the volumetric Delaunay triangulation algorithm introduced in [13] to tessellate the interior of the volume with tetrahedra. Then the tessellation is simplified by the following steps. Denote the boundary surface of a 3-manifold M as ∂M = {S1 , S2 , · · · , Sn }. First, create a cone vertex vi for each boundary component Si ; for each triangle face fj ∈ Si , create a new tetrahedron Tji , whose vertex set consists of vi and the vertices of fj . In this way, M is augmented with a set of cone vertices and a set of new tetrahedra. Second, take the edge collapsing operation (shown in figure 3c-d) iteratively, until all the vertices are removed except for those cone vertices {v1 , v2 , · · · , vn } created in the previous step. Denote the resulting tetrahedral ˜ . Finally, for each tetrahedron T˜i ∈ M ˜ , trim off its vertices (which are mesh as M cone vertices) by the original boundary surface, and thus make it a truncated tetrahedron, denoted as Ti .
Computing and Visualizing Constant-Curvature Metrics v1
C1
d1
C1
A1 B1
d1
b1
a1
D1
c1 D2
a2
b2 C2
d1 a1
B2
a2
c2
d2
C2
B2
B1
b2 c1
A2
(a)
v1
v0
v5
A2 A1
(b)
v5
v2
c 2 b1
d1
d2
v2
725
v0(v6)
v4
v3
v3
v6 (c)
v4
(d)
Fig. 3. Simplified tessellation of the Y shape with only two truncated tetrahedra (a) glued together according to the pattern (b). The simplification is carried out through edge collapsing, which turns (c) into (d) by identifying v0 and v6 .
As an example, the Y shape can be simplified to two truncated tetrahedra {T1 , T2 } (figure 3a), together with the gluing pattern (figure 3b) among their hexagon faces. For each Ti , let {Ai , Bi , Ci , Di } be its four hexagon faces, let {ai , bi , ci , di } be the truncated vertices. Its gluing pattern is given as follows, where the arrow → means to identify the former and the later: A1 B1 C1 D1
→ B2 → A2 → C2 → D2
{b1 → c2 , d1 → a2 , c1 → d2 } {c1 → b2 , d1 → c2 , a1 → d2 } {a1 → a2 , d1 → b2 , b1 → d2 } {a1 → a2 , b1 → c2 , c1 → b2 }
Please note that the edge collapsing operation does not change the fundamental group of the volume, and the simplified 3-manifold is topologically equivalent to the original 3-manifolds. From the discussion in section 2, it guarantees the later computation can be carried out on the simplified tessellation instead of the original one. 3.2
Metric Computation Via Curvature Flow
Given an hyperideal triangulated 3-manifold, define the discrete metric function as x : E → R+ , where E is the set of edges in the triangulation. The discrete curvature function can be defined as K : E → R. Given an edge eij ∈ E, its edge length and edge curvature can be represented as xij and Kij respectively. From the discussion in section 2, the edge curvature can be determined by the dihedral angles, which in turn is a function of edge length. Therefore, the set {Kij } can be calculated from {xij }. Then the discrete curvature flow is then defined as dxij = Kij , (1) dt From this differential equation, the deformation of the metric is driven by the edge curvature, and the whole process is like a heat diffusion. Any numerical
726
X. Yin et al.
method for solving the discrete heat diffusion problem can be applied to solve the curvature flow equation. And in practice, we set the initial edge length to be xij = 1. 2 During the flow, the total edge curvature ij Kij is strictly decreasing. When the flow reaches the equilibrium state, both the edge curvature and the vertex curvature vanish. The boundary surface will become a hyperbolic geodesic, while all the curvature (which is negative) is uniformly distributed within each truncated hyperbolic tetrahedron. Due to the fact the total curvature is negative, the resulting metric is a hyperbolic one. 3.3
Realization and Visualization
Once the canonical hyperbolic metric is computed, one is ready to realize it in the hyperbolic space H3 . There are two ways to realize the metric. The first one is a single period representation (figure 6), which is a union of multiple truncated hyperbolic tetrahedra. The second is a multiple period representation (figure 7), which is composed by multiple copies of the single period representation glued together nicely. As explained in section 2, we will use the upper half space and upper half plane as the model for H3 and H2 respectively. Realization of a Single Truncated Hyperbolic Tetrahedron. Given the edge length of a truncated hyperbolic tetrahedron, its dihedral angles will be uniquely determined and the truncated tetrahedron can be realized in H3 uniquely up to rigid motions. Actually, its embedding is determined by the position of its four right-angled hexagon faces f1 , f2 , f3 , f4 and that of its four triangle faces v1 , v2 , v3 , v4 . Each of these faces now is a hyperbolic plane (i.e. semi-sphere, see figure 5b), separating H3 into two half spaces. By choosing the right half space for each face and taking the intersection of all these half spaces, one will get the realization of the truncated hyperbolic tetrahedron (figure 5c). To compute the position of each hyperbolic plane, let’s consider its intersection with the infinity plane z = 0, which is an Euclidean circle (figure 5a). Here we reuse the symbol fi and vj to represent the intersection circle by the hyperbolic plane fi and vj respectively. As shown in figure 5a, all the circles can be computed explicitly, such that circle fi and circle fj intersect at the given dihedral angle θk , while circle vi is orthogonal to circles fj , fk , fl . In order to remove the ambiguity caused by rigid motion, we fix circle f1 to be line y = 0, f2 to be line y = tan θ1 x, and normalize the circle f3 to have radius 1. Once the intersection circles are settled, we can directly construct hemispheres (i.e. hyperbolic planes) whose equators are those circles. By choosing the right half space for each hemisphere and using CSG operations to compute the intersection of these half spaces, we can get a visualization of a single truncated hyperbolic tetrahedron as shown in figure 5c. Realization of a Single Period. A single period representation of the whole 3-manifold with canonical hyperbolic metric is a union of all its constituting truncated hyperbolic tetrahedra. It can be constructed as the following. First,
Computing and Visualizing Constant-Curvature Metrics
(a)
(b)
(c)
727
(d)
Fig. 4. An example of hyperbolic 3-manifold, Thurston’s knotted Y shape, constructed from a solid ball with three entangled tunnels removed. (a) and (b) show the boundary surface, (c) and (d) show the internal tessellation with tetrahedra. v4 f2 v3
v1 θ2 θ2
θ1
f4
θ4
f3 θ3
θ6 θ6 θ4 θ5
θ5
f1
θ3 v2
(a)
(b)
(c)
Fig. 5. Realization of a truncated hyperbolic tetrahedron (c) in H by taking CSG among hemispheres (b) based on their intersection circles with the infinity plane z = 0 (a) 3
Fig. 6. Fundamental domain of the Y shape: the single period realization of the canonical metric in H3 from various views
Fig. 7. Universal Covering Space of the Y shape: the multiple period realization of the canonical metric in H3 from various views
728
X. Yin et al.
realize one truncated hyperbolic tetrahedron T0 as explained above. Then, pick another not-embedded truncated hyperbolic tetrahedron T1 , which is neighboring to T0 through hexagon faces f1 ∈ T1 with f0 ∈ T0 . Compute a M¨ obius transformation in H3 , which rigidly move T1 to such a position that f1 ∈ T1 can be perfectly glued to f0 ∈ T0 . Now we get a partially embedded volume. Repeat the process of picking, moving and gluing a neighboring truncated hyperbolic tetrahedron, until the whole volume is embedded. The above algorithm is essentially a bread-first-search (BFS) in the given 3manifold; it results in a tree spanning all the truncated hyperbolic tetrahedra in the volume. Due to the nature of the constant curvature hyperbolic metric, such gluing (or spanning) operation can be carried out seamlessly, until finally all the truncated tetrahedra are glued together nicely into a simply connected volume, which is a topological ball. Such a single period representation is usually called the fundamental domain for the original volume (see [11]). Figure 6 visualizes fundamental domain of the Y shape embedded in H3 . Realization of Multiple Periods. A multiple period representation of the canonical hyperbolic metric is a union of multiple copies of the fundamental domain, and is usually called the universal covering space (UCS) of the original 3-manifolds (see [11]), which is also a simply connected topological ball. Similar to the realization of the fundamental domain, the UCS representation can also be constructed through a sequence of gluing operations; the difference is, the primitive construction block is the embedded fundamental domain rather than the truncated hyperbolic tetrahedron. The gluing operation here can be explained as the following. Recall the algorithm for realizing a fundamental domain, any two truncated hyperbolic tetrahedra are glued through a pair of hexagon faces. After the algorithm is complete, there will be some hexagon faces left open, without gluing to any neighboring face. This is natural because otherwise the fundamental domain will not be simply connected. All the open hexagon faces are grouped into several connected components, each component constitutes a gluable face for the whole fundamental domain. All the gluable faces can be coupled nicely; that is, for each gluable face, there exist another unique gluable face in the same fundamental domain such that they are able to attach to each other nicely. Actually, the fundamental domain can be viewed as a result of cutting the original volume open through the coupling gluable faces. And two copies of the fundamental domain can be glued together through one pair of the gluable faces. Different to the construction of one fundamental domain, the gluing operation among fundamental domains can be repeated infinitely many times, getting infinitely many copies involved in the UCS. In practice, we can only afford to realize a finite portion of the UCS, as the one visualized in figure 7.
4
Experiments and Applications
We tested our algorithms on about 120 hyperbolic 3-manifolds, including the Thurston’s Y shape, tessellated with truncated hyperbolic tetrahedra. All the
Computing and Visualizing Constant-Curvature Metrics
729
experiments converge, and the resulting numerical metrics are perfectly consistent with the results computed using algebraic methods, by a difference less than 1e − 8. From the experiments, we also notice that the stability of the algorithms depends on two issues: the initial metric and the tessellation. Due to numerical errors, for certain initial assignments of edge lengths, the angle computed using hyperbolic cosine law is too close to zeros and thus leads to instability. In our experiments, an initial metric with all edge lengths being 1.0 will always lead to the desired solution. On the other hand, the curvature flow may fail on certain tetrahedralization for the given 3-manifolds; it is probably because the critical point of the volume energy is too close to the boundary of the metric space. It remains an open problem to find the conditions under which the tessellation will guarantee the convergence to the desired metric. Regarding applications, the canonical metrics computed on surfaces have brought lights to huge amount of applications in geometric modeling, computer graphics, computer vision, visualization and biomedical data analysis. We believe that our methods for 3-manifolds will also play fundamental roles in many science and engineering fields in the future. And here we only address some of the potential applications in very brief words, while leaving the detailed discussions of particular applications to future exploration, which is far beyond the scope of this paper. Firstly, the theoretical study of the 3-manifold topology and geometry has been a hot area in the mathematical society for many years; but most 3-manifolds cannot be realized in the Euclidean space and even hard to imagine. we hope that with our work the researchers could gain more insights into the structure of 3-manifolds and therefore facilitate their study, at least in a discrete sense. Secondly, The realization of the canonical metric actually gives a canonical domain for the original 3-manifolds. This domain might be used for the purpose of volumetric texture mapping, volume discretization and remeshing, volume spline construction, volume registration and comparison, and etc, just like how the canonical parameter domains of surfaces help tackle the 2D problems. Thirdly, we noticed that the volumes we studied here could be good models for some volumetric biomedical data, such as some soft tissues in the human body with the penetrating vessels removed, or the complementary space of some protein structures. It will be a very interesting research topic to apply our computational methods to these biomedical applications.
5
Conclusion
In this work we proposed a new computational method to compute constant curvature metrics for hyperbolic 3-manifolds with high genus boundary surfaces. The algorithm is directly based on discrete volumetric curvature flow. Experiments show the convergence and stability of the algorithm. In order to visualize the metric, we proposed algorithms to realize the metric in the upper half space model of the hyperbolic space. The whole pipeline can not only
730
X. Yin et al.
facilitate the theoretical study of 3-manifold topology and geometry, but also have potential applications in many engineering fields, like volume parameterization, biomedical data analysis and etc. How to apply our methods to those specific applications would be a very interesting future research direction. Meanwhile, it is also a challenging research topic to extend the current framework to other type of 3-manifolds.
References 1. Petersen, P.: Riemannian Geometry. Springer, Heidelberg (1997) 2. Sheffer, A., Praun, E., Rose, K.: Mesh parameterization methods and their applications. Foundations and Trends in Computer Graphics and Vision 2 (2006) 3. Floater, M.S., Hormann, K.: Surface parameterization: a tutorial and survey. In: Advances in Multiresolution for Geometric Modelling, pp. 157–186. Springer, Heidelberg (2005) 4. Jin, M., Kim, J., Luo, F., Gu, X.: Discrete surface ricci flow. IEEE Transaction on Visualization and Computer Graphics (2008) 5. Springborn, B., Schr¨ oder, P., Pinkall, U.: Conformal equivalence of triangle meshes. In: SIGGRAPH 2008 (2008) 6. Wang, Y., Gu, X., Thompson, P.M., Yau, S.T.: 3d harmonic mapping and tetrahedral meshing of brain imaging data. In: Proceeding of Medical Imaging Computing and Computer Assisted Intervention (MICCAI), St. Malo, France (2004) 7. Li, X., Guo, X., Wang, H., He, Y., Gu, X., Qin, H.: Harmonic volumetric mapping for solid modeling applications. In: Proceeding of Symposium on Solid and Physical Modeling, pp. 109–120 (2007) 8. Martin, T., Cohen, E., Kirby, M.: Volumetric parameterization and trivariate bspline fitting using harmonic functions. In: Proceeding of Symposium on Solid and Physical Modeling (2008) 9. Luo, F.: A combinatorial curvature flow for compact 3-manifolds with boundary. Electron. Res. Announc. Amer. Math. Soc. 11, 12–20 (2005) 10. Guggenheimer, H.W.: Differential Geometry. Dover Publications (1977) 11. Gu, X., Yau, S.T.: Computational Conformal Geometry. Higher Education Press, China (2007) 12. Mostow, G.D.: Quasi-conformal mappings in n-space and the rigidity of the hyperbolic space forms. Publ. Math. IHES 34, 53–104 (1968) 13. Si, H.: Tetgen: A quality tetrahedral mesh generator and three-dimensional delaunay triangulator, http://tetgen.berlios.de/
Iris Recognition: A Method to Segment Visible Wavelength Iris Images Acquired On-the-Move and At-a-Distance Hugo Proenc¸a Dep. of Computer Science, IT - Networks and Multimedia Group Universidade da Beira Interior, Covilh˜a, Portugal
[email protected]
Abstract. The dramatic growth in practical applications for iris biometrics has been accompanied by many important developments in the underlying algorithms and techniques. Among others, one of the most active research areas concerns about the development of iris recognition systems less constrained to users, either increasing the image acquisition distances or the required lighting conditions. The main point of this paper is to give a process suitable for the automatic segmentation of iris images captured at the visible wavelength, on-the-move and within a large range of image acquisition distances (between 4 and 8 meters). Our experiments were performed on images of the UBIRIS.v2 database and show the robustness of the proposed method to handle the types of non-ideal images resultant of the aforementioned less constrained image acquisition conditions.
1 Introduction Being an internal organ, naturally protected, visible from the exterior and supporting contactless data acquisition, the human iris has, together with the face, the potential of being covertly imaged. Several issues remain to achieve deployable covert iris recognition systems and, obviously, these type of systems will constitute a tradeoff between data acquisition constraints and recognition accuracy. This area motivates growing interests on the research community and constituted the scope of a large number of recent publications. It is expectable that far less constrained image acquisition processes increase the heterogeneity of the captured images, according to varying lighting conditions, subjects’ poses, perspectives and movements. In this context, the image segmentation stage plays a major role, as it is the one that more directly should handle this heterogeneity. Also, as it is one of the earliest stages of the complete recognition process, it acts as basis of any further stages and any failure will compromise the success of the whole process. Figure 1 establishes a comparison between images that typically result of constrained image acquisition processes, on the near infra-red wavelength (figure 1a) and images acquired under less constrained imaging conditions, at-a-distance and on-the visible wavelength (figure 1b). Apart evident differences in the image homogeneity, several types of data obstructing portions of the iris texture in the visible wavelength image can G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 731–742, 2008. c Springer-Verlag Berlin Heidelberg 2008
732
H. Proenc¸a
(a) Near infra-red image, acquired under high constrained con-
(b)
ditions (ICE database [1]).
the-move (UBIRIS.v2 database [2]).
Visible wavelength image, acquired at-a-distance and on-
Fig. 1. Illustration of the typical differences between close-up iris images acquired under high constrained conditions in the near infra-red wavelength (figure 1a) and images acquired in the visible wavelength, within less constrained conditions (figure 1b)
be observed, whose increase the challenges of performing accurate biometric recognition. Also, less constrained acquisition protocols lead to the appearance of non-frontal, defocused or motion blurred images. This work focuses on the segmentation of visible wavelength close-up iris images, captured at-a-distance and on-the-move, under varying lighting conditions and with minimal image acquisition constraints. It can be divided into three parts: at first, we overview the most significant iris segmentation methods, specially those recently published, and establish some common and distinguishable characteristics between them. Later, we empirically describe some of the reasons that makes those methods less suitable for the type of images we aim to deal with. Finally, we give a new iris segmentation method based on the neural Pattern Recognition paradigm that, as our experiments indicate, is suitable to deal with the aforementioned type of images. The remaining of this paper is organized as follows: Section 2 briefly summarizes the most cited iris segmentation methods, emphasizing those most recently published. A detailed description of the proposed method is given in Section 3. Section 4 reports our experiments and discusses the results and, finally, Section 5 gives the conclusions and points some directions for our further work.
2 Iris Segmentation The analysis of the most relevant iris segmentation proposals allowed us to identify two major strategies to perform the segmentation of the iris: use a rigid or deformable iris template or use its boundary. In most cases, the boundary approach is very similar to the early proposal of Wildes [3]: it begins by the construction of an edge-map, followed by the application of some geometric form fitting algorithm. The template-based strategies are in general more specific, although having as common point the maximization of some energy model that localizes both iris borders, as originally proposed by Daugman [4]. These methods were though to operate in noise-free close-up iris images captured in the near infra-red wavelength and, specifically the Daugman’s integrodifferential
Iris Recognition
733
operator, proved their effectiveness on multiple deployed systems that operate in constrained imaging conditions. The increase of segmentation robustness to several types of non-ideal images motivated a large number of proposals in the last few years. However, the large majority of these methods operate in near infra-red images, whose typically have higher contrast between the pupil and the iris regions and induces the usual option of start by the segmentation of the pupillary border. Oppositely, visible wavelength images usually have less contrast between the pupil and the iris, which explains the inversion in the order of the borders’ segmentation. Regarding the basis methodologies, various innovations were proposed, as the use of active contour models, either geodesic (e.g., [5]), based on Fourrier series (e.g., [6]) and on the snakes model (e.g., [7]). Here, the previous detection of the eye is a requirement to properly initialize contours and the heavy computational requirements can also be regarded as a weak point. Also, modifications to known form fitting processes were proposed, essentially to deal with off-angle images (e.g., [8] and [9]) and to improve performance (e.g., [10] and [11]). The detection of non-iris data that obstructs the discriminating information motivated the use of parabolic, elliptical and circular models (e.g., [12], and [11]) and the modal analysis of histograms [6]. In this compass, several authors constraint the success of their methods to image orthogonality, to the non-existence of significant iris obstructions or to the appearance of corneal reflections in specific image regions.
3 Proposed Method Figure 2 gives the block diagram of the proposed segmentation process, that can be divided into two major stages: sclera detection and iris segmentation. Having found that the sclera region usually remains as the most distinguishable under varying lighting conditions, we propose a feature extraction stage that will provide enough discriminant Captured Image Feature Set 1
Feature Extraction 1 (Sclera stage)
x1 , y1 , ..., z1 x2 , y2 , ..., z2 xn , yn , ..., zn
Feature Extraction 2 (Iris Stage) Sclera Map
Neural network classification
x1 , y1 , ..., z1 x2 , y2 , ..., z2 xn , yn , ..., zn Feature Set 2
Neural network classification Iris Map
Fig. 2. Block diagram of the proposed iris segmentation method
734
H. Proenc¸a
information to localize the sclera. Later, we take profit of the mandatory adjacency between the sclera and the iris and, together with a new feature set extracted from the original image, perform the detection of the regions that correspond to the noise-free iris pixels. It should be stressed that this process comprises three tasks that are typically separated in the specialized literature: iris detection, segmentation and detection of noisy regions. As it is shown in the experiments section, starting from a relatively small set of manually classified images that constitute the learning set, it is possible to use machine learning methods that will robust and quickly discriminate between the noise-free iris regions (used in the subsequent recognition stages) and the remaining data. In the following, we detail each feature extraction stage and the used classification models. 3.1 Feature Extraction Regarding the feature extraction, we had a primary concern: to exclusively evaluate features that are possible to compute in a single image scan, which is crucial to enable the application of the method to real-time applications. Previously, Viola and Jones [13] proposed a set of simple features, reminiscent of Haar basis functions, and used an intermediate image representation to compute them in a single image scan. Based on their definition we propose the extraction of a set of central moments within small image regions, based on the pixels intensity in different color spaces. For a given image I, Viola and Jones defined an integral image II y x II(x, y) = I(x , y ) (1) 1
1
where x and y denote respectively the image column and row. Also, they proposed a pair of recurrences to compute the integral image in a single image scan s(x, y) = s(x, y − 1) + I(x, y)
(2)
II(x, y) = II(x − 1, y) + s(x, y) (3) Based on the concept of integral image, the average intensity μ of the pixels within any rectangular region Ri , delimited by its upper-left (x1 , y1 ) and bottom-right (x2 , y2 ) corner coordinates, can be obtained accessing exclusively four array references. Let Ti = (x2 − x1 ) × (y2 − y1 ) be the total of pixels of Ri . y2 x2 1 I(x, y) Ti x y 1 1 x y y2 x1 2 2 1 = I(x, y) − I(x, y) Ti 1 y1 1 y1 x y y1 x2 2 2 1 = I(x, y) − I(x, y) Ti 1 1 1 1
μ(Ri ) =
−
y2 x1 1
1
I(x, y) −
y1 x1 1
I(x, y)
1
1 = II(x2 , y2 ) + II(x1 , y1 ) − II(x2 , y1 ) − II(x1 , y2 ) Ti
(4)
Iris Recognition
(a) Hue component.
(b) Crominance component.
735
(c) Red croma component.
Fig. 3. Illustration of the discriminating capacity of the color components used in the detection of the sclera regions. It is evident that pixels belonging to the sclera have respectively higher (figures 3a and 3b) and lower (figure 3c) intensity values than the remaining pixels. Also, we observed that this separability tends to remain stable, even on high heterogeneous images as those that constitute the scope of this work.
Similarly, the standard deviation σ of the intensity of the pixels within Ri is given by y2 2 1 x2 σ(Ri ) = I(x, y) − μ Ti x y 1 1 y2 x2 1 = I(x, y)2 − 2 I(x, y) μ + μ2 Ti x y 1 1 y2 y2 x2 1 x2 = I(x, y)2 − 2 μ I(x, y) + Ti μ2 Ti x y x1 y1 1 1
1 = II(x, y)2 − 2 μII(x, y) + Ti μ2 Ti
(5)
where μ is given by (4), II(x, y) is obtained from (3) and II(x, y)2 is similarly obtained, starting from an image where the intensity values appear squared. 3.2 Sclera Stage Based on the described average (μ) and standard deviation (σ) values within image regions, on the detection of the sclera we use a feature set with 11 components. For μ,σ μ,σ each image pixel we extract {x, y , hμ,σ 0,3,7 (x, y), u0,3,7 (x, y), cr0,3,7 (x, y)}, where x and y denote the pixel’s position, h(., .), u(., .), and cr(., .) denote regions of the hue, chrominance and red croma image components and the subscripts denote the respective radius values of such regions, centered at the respective pixel. The used color components were empirically selected, based on observations of their discriminating capacity between the sclera and the remaining data, as illustrated in figure 3. 3.3 Iris Stage In the detection of the noise-free iris regions we also used the previously described average (μ) and standard deviation (σ) values, together with the information that came from
736
(a)
H. Proenc¸a
Output of the sclera detection
(b) sc← (x, y) map.
(c) Saturation component.
(d) Color chrominance component.
stage.
Fig. 4. Features used in the detection of the iris regions
the previous sclera detection stage, taking profit of the mandatory adjacency between the iris and the sclera. Here, our main concern was to use components of various color spaces that maximize the separability between the sclera and the iris. For each image μ,σ pixel we extracted {x, y, sμ,σ 0,3,7 (x, y), u0,3,7 (x, y), sc←,→,↑,↓ (x, y), }, where s(., .), and u(., .) are regions of the saturation and color chrominance image components and the subscripts denote the respective radius values of such regions, centered at the respective pixel. Again, the used color components were empirically selected, based on observations of their respective discriminating capacity between the sclera and the iris, as illustrated in figure 4. sc(., .) denotes a feature map that measures the proportion of pixels belonging to the sclera in the left (←), right (→), upper (↑) and bottom (↓) directions, regarding the reference pixel (x, y). This maps are specially relevant to provide information about the relative localization between the iris and the sclera, as it is illustrated in figure 4b. sc← (x, y) = μ Rsc (1, y − 1), sc→ (x, y) = μ Rsc (x, y − 1), sc↑ (x, y) = μ Rsc (x − 1, 1), sc↓ (x, y) = μ Rsc (x − 1, y),
(x, y) (W, y) (x, y) (x, H)
(6) (7) (8) (9)
where μ(.) is given by (4) and Rsc (., .), (., .) denote regions of the sclera map (figure 4a) delimited respectively by its top-left and bottom-right corner coordinates. W and H are respectively the image width and height. As we describe in the next section, for the purpose of classification we used feedforward neural networks, whose are known to be extremely fast classification models. Thus, apart the accuracy and robustness factors, when comparing with the large majority of the iris segmentation methods described in section 2, the computational performance of the proposed method is regarded as a significant advantage. 3.4 Supervised Machine Learning Process In both classification stages of the proposed segmentation method we followed the neural pattern recognition paradigm. This is justified by the networks ability to discriminate
Iris Recognition
737
data in high complex hyperspaces, providing good generalization capacity and usually without requiring any user-parameterized thresholds. We used multilayered perceptron feed-forward networks with one hidden layer, varying in our experiments the number of neurons of the hidden layer. We adopted the convention that the input nodes are not counted as a layer. All the used networks have as many neurons in the input layer as the dimension of the feature space and a single neuron in the output layer, due to the desired binary classification task. Our choice is justified by three fundamental learning theory issues: model capacity, computational and sample complexity. At first, regarding model capacity, it is known that this type of networks can form arbitrary complex decision boundaries. Also, they are accepted as high performance classification models, which is not affected by the size of the training data. Finally, regarding the sample complexity, the use of the backpropagation learning algorithm propitiates good generalization capabilities using a relatively small set of images in the learning stages, as it is detailed in the experiments section.
4 Experiments This section describes the experiments that were performed in the evaluation of the proposed classification method. We detail the used data sets and the process that enabled us to automatically obtain the error rates. 4.1 Data Sets Our experiments were performed in images of the UBIRIS.v2 [14], a multi-session iris images database which - singularly - contains data captured in the visible wavelength, at-a-distance and on on-the-move subjects, being its images acquired within non-constrained and varying lighting conditions. The significantly higher range of distances between the subjects and the imaging framework (from four to eight meters) is one of the major distinguishable points between the UBIRIS.v2 database and others with similar purposes. Through visual inspection, fourteen different factors that
Fig. 5. Examples of close-up iris images acquired at varying distances (between four and eight meters), from on-the-move subjects, under high dynamic lighting conditions and without requiring subjects cooperation
738
H. Proenc¸a
deteriorate the image quality were detected and classified into one of two categories: local or global, as they affect exclusively image regions or the complete image. The first category comprises iris obstructions, reflections and partial images, while the later comprises the poor focused, motion-blurred, rotated, off-angle, improper lighting and out-of-iris images. A comparison between a close-up iris image with good quality (upper-left image) and images that contain at least one of the aforementioned noise factors is given in figure 5. 4.2 Ground Truth In the evaluation of the proposed method we used the data sets delivered to participants of the NICE.I contest [2], whose are part of the complete UBIRIS.v2 iris image database. Both the learning and test data sets comprise 500 close-up iris images and 500 correspondent binary maps that were made by humans and distinguish between the noise-free regions of the iris and all the remaining types of data. Images have fixed dimensions of 400 × 300 pixels, giving a total of 60 000 000 pixels per data set. Considering the process of noise-free iris segmentation as a binary classification task, this value allowed us to obtain 95% confidence intervals for the results given in this paper of approximately ±1, 29% × 10−2 . 4.3 Learning Algorithms Both learning stages of our classification models were based in the backpropagation algorithm. Originally, this learning strategy updates the network weights and biases in the direction of the negative of the gradient, the direction in which the performance function E decreases most rapidly, being E a squared error cost function given by 1 p 2 ||y i −di || , where yi and di are respectively the network and the desired outputs i=1 2 and p the number of train patterns given to the network in the learning stage. There are many variations of the backpropagation algorithm, which fundamentally aim to increase the learning performance, resulting in the network convergence performance from ten to one hundred times faster. Typicall variants fall into two categories: the first one uses heuristic techniques, as the momentum technique or the varying learning rates. The second category uses standard numerical optimization techniques, as the search across the conjugate directions (with Fletcher-Reeves [15] or Powell-Beale [16] updates) or quasi-Newton algorithms (Broyden, Fletcher, Goldfarb, and Shanno [17] and one-secant [18] update rules) that, although based on the Hessian matrix to adjust weights and biases do not require the calculation of second derivatives, this matrix at each iteration of the algorithm. In the following section we give results about the error rates obtained by each of the used backpropagation variants, both in the learning and classification stages, regarding the number of images used in the learning process and the networks’ topology. 4.4 Results and Discussion The method proposed in this paper has - at least - 3 parameters that have impact in its final accuracy: the used learning algorithm, the networks’ topology and the amount of
Iris Recognition
739
Table 1. Comparison between the average error rates observed in the learning and classification stages, regarding the variants of the backpropagation algorithm used in our experiments Learning Algorithm
Time (Sc)
Learning Error (Sc) Classification Error (Sc)
Time (Ir)
Learning Error (Ir) Classification Error (Ir)
Fletcher-Reeves [15] 2808 ± 7.35 0.027 ± 2.1E −4
0.029 ± 2.7E −4
3320 ± 8.98
0.020 ± 1.8E −4
0.021 ± 1.8E −4
Powell-Beale [16]
2751± 8.20
0.026± 2.3E −4
0.029± 2.7E −4
3187 ± 9.30
0.020± 2.0E −4
0.022± 2.1E −4
Broyden et al. [17]
4807± 9.14
0.026± 3.2E −4
0.031± 3.5E −4
5801 ± 10.52 0.019± 2.7E −4
0.023 ± 2.9E −4
One-secant [18]
2993± 7.13
0.030± 2.2E −4
0.034± 2.4E −4
3491 ± 8.61
0.024± 2.0E −4
0.031± 2.1E −4
(a) Error rates of the sclera classification stage.
(b) Error rates of the iris classification stage.
Fig. 6. Error rates obtained on the test data set, regarding the number of images used in the learning stage (depth axis) and the number of neurons in the network hidden layer (horizontal axis, expressed in proportion with the dimension of the feature space). The error rates are expressed in percentage and are averages of 20 neural networks with the respective configuration.
data (number of images) used to learn. As an exhaustive search for the optimal configuration leads to a 3D search and an impracticable number of possibilities to evaluate, we decided to start with the selection of the most suitable backpropagation learning algorithm for this type of problem. We built a set of neural networks with what we considered to be an apriori reasonable topology (3 layers with a number of neurons in the input and hidden layers equal to the dimension of the feature space) and used 30 images in the construction of the learning set, from which we randomly extracted 50 000 instances, half-divided between positives (iris) and negatives (non-iris) samples. Table 1 gives the obtained results. ”Learning Error” columns give the averages errors obtained in the learning stages, ”Time” the average elapsed time of the learning processes (in seconds), ”Classification Error” the average error obtained in the test set images. ”’Sc” and ’Ir’ denote respectively the sclera and iris classification stages. All the values are expressed in confidence intervals of 95%. The above described experiments led us to select the Fletcher-Reeves [15] learning method for the backpropagation algorithm and use it in subsequent experiments, namely in the search of the optimal networks’ topology and of the minimum number of images required in the learning set. Figure 6 gives 3D graphs that contain the obtained error rates in the test set, according to the number of images used in the training set (depth axis) and the number of neurons of the networks’ hidden layer (horizontal axis). The error rates are averages from 20
740
H. Proenc¸a
(a) Close-up iris image 1.
(b) Close-up iris image 2.
(c) Close-up iris image 3.
(d) Segmented noise-free iris data 1.
(e) Segmented noise-free iris data 2.
(f) Segmented noise-free iris data 3.
Fig. 7. Examples of the iris segmentation results obtained by the proposed method. The upper row contains the original images and the bottom row the correspondent segmentation of the noise-free iris regions.
neural networks and are expressed in percentage. Not surprisingly, we observed that the networks’ accuracy have direct correspondence with the number of neurons in the hidden layer and with the number of images used in the learning process. However, we concluded that these error rates tend to stabilize when more than 40 images were used in the training set, with a number of neurons in the hidden layer about 1.5 times the dimension of the feature space. Also, these observations were confirmed either in the sclera and in the iris classification models. Interestingly, the lowest error rates were obtained in the iris classification stage, which we justified by the useful information that iris detection networks receive about the sclera localization and lessen the difficulty of their classification task. The lowest iris classification error obtained was about 1.87%, which we considered highly acceptable. This gives about 2244 misclassified pixels per image, whose are possible to reduce if basic image processing methods were applied to the network’s output. For instance, morphologic operators (erosion) will eliminate small regions of black pixels separated from the main iris region whose often we observed to be cause of small errors, as it is illustrated in figure 7.
5 Conclusions Due to favorable comparisons with other biometric traits, the popularity of the iris has considerably grown over the last years and substantial attention was paid by both commercial and governmental organizations. Also, growing efforts are concentrated in finding the minimum level of image quality that enables recognition with enough confidence. In this paper we used the neural pattern recognition variant to propose a method that performs the detection of the eye, the iris segmentation and the discrimination of the
Iris Recognition
741
noise-free iris texture, analyzing images acquired on the visible wavelength under less constrained image acquisition processes. Our approach comprises two binary classification stages. At first, we used the HSV and YCbCr color spaces to provide us information about the sclera localization. This information is mixed with a new feature set to discriminate the noise-free iris regions of the images. We concluded that the proposed method accomplishes its major purposes and achieves very low error rates, even when starting from a relatively small set of images to learn appropriate classification models.
Acknowledgements We acknowledge the financial support given by ”FCT-Fundac¸a˜ o para a Ciˆencia e Tecnologia” and ”FEDER” in the scope of the PTDC/EIA/69106/2006 research project ”BIOREC: Non-Cooperative Biometric Recognition”.
References 1. National Institute of Standards and Technology: Iris challenge evaluation (2006), http://iris.nist.gov/ICE/ 2. Proenc¸a, H., Alexandre, L.A.: The NICE.I: Noisy Iris Challenge Evaluation, Part I. In: Proceedings of the IEEE First International Conference on Biometrics: Theory, Applications and Systems (BTAS 2007), Washington, pp. 27–29 (2007) 3. Wildes, R.P.: Iris recognition: an emerging biometric technology. Proceedings of the IEEE, U.S.A. 85(5), 1348–1363 (1997) 4. Daugman, J.G.: High confidence visual recognition of persons by a test of statistical independence. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(11), 1148–1161 (1993) 5. Ross, A., Shah, S.: Segmenting non-ideal irises using geodesic active contours. In: Proceedings of the IEEE 2006 Biometric Symposium, U.S.A, pp. 1–6 (2006) 6. Daugman, J.G.: New methods in iris recognition. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics 37(5), 1167–1175 (2007) 7. Arvacheh, E., Tizhoosh, H.: A study on Segmentation and Normalization for Iris Recognition. Msc dissertation, University of Waterloo (2006) 8. Zuo, J., Kalka, N., Schmid, N.: A robust iris segmentation procedure for unconstrained subject presentation. In: Proceedings of the Biometric Consortium Conference, pp. 1–6 (2006) 9. Vatsa, M., Singh, R., Noore, A.: Improving iris recognition performance using segmentation, quality enhancement, match score fusion, and indexing. IEEE Transactions on Systems, Mans and Cybernetics - B 38(3) (2008) 10. Liu, X., Bowyer, K.W., Flynn, P.J.: Experiments with an improved iris segmentation algorithm. In: Proceedings of the Fourth IEEE Workshop on Automatic Identification Advanced Technologies, pp. 118–123 (2005) 11. Dobes, M., Martineka, J., Dobes, D.S.Z., Pospisil, J.: Human eye localization using the modified hough transform. Optik 117, 468–473 (2006) 12. Basit, A., Javed, M.Y.: Iris localization via intensity gradient and recognition through bit planes. In: Proceedings of the International Conference on Machine Vision (ICMV 2007), pp. 23–28 (2007) 13. Viola, P., Jones, M.: Robust real-time face detection. International Journal of Computer Vision 57(2), 137–154 (2002)
742
H. Proenc¸a
14. Proenc¸a, H., Filipe, S., Santos, R., Oliveira, J., Alexandre, L.A.: Toward covert iris recognition: A database of visible wavelength images captured on-the-move and at-a-distance. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics (submitted, 2008) 15. Fletcher, R., Reeves, C.: Function minimization by conjugate gradients. Computer Journal 7, 149–154 (1964) 16. Powell, M.: Restart procedures for the conjugate gradient method. Mathematical Programming 12, 241–254 (1977) 17. Dennis, J., Schnabel, R.: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Englewood Cliffs (1983) 18. Battiti, R.: First and second order methods for learning: Between steepest descent and newton’s method. Neural Computation 4(2), 141–166 (1992)
3D Textural Mapping and Soft-Computing Applied to Cork Quality Inspection Beatriz Paniagua1, Miguel A. Vega-Rodríguez1, Mike Chantler2, Juan A. Gómez-Pulido1, and Juan M. Sánchez-Pérez1 1
Dept. Technologies of Computers and Communications, University of Extremadura, Escuela Politécnica. Campus Universitario s/n, 10071, Cáceres, Spain {bpaniagua,mavega,jangomez,sanperez}@unex.es 2 School of Mathematical & Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom
[email protected]
Abstract. This paper presents a solution to a problem existing in the cork industry: cork stopper/disk classification according to their quality. Cork is a natural and heterogeneous material; therefore, its automatic classification (seven quality classes exist) is very difficult. The solution proposed in this paper combines the extraction of 3D cork features and soft-computing. In order to evaluate the performance of the neuro-fuzzy network designed, we compare its results with other 4 basic classifiers working with the same feature space. In conclusion, our experiments showed that the best results in case of cork quality classification were obtained with the proposed system that works with the following features: depth+intensity combined feature, weighted depth, second depth level feature, root mean square roughness and other three textural features (wavelets). The obtained classification results have highly improved other results reported in similar studies.
1 Introduction Cork is the biological term for the tissue that the cork tree produces around the trunk. The most important industrial application of cork is the production of stoppers and disks for sealing champagnes, wines and liquors [1]. Due to cork is a highly heterogeneous material, the classification process has been carried out, traditionally, by human experts. At the moment, there are several models of electronic machines for the classification of cork stoppers and disks in the market. The performance of these machines is acceptable for high quality stoppers/disks, but for intermediate or low quality, the number of samples classified erroneously is large (around 40%). In conclusion, the stoppers/disks should be re-evaluated by human experts later. This slows down and increases the price of the process enormously. We have to add to these antecedents the fact that Spain is the 2nd world producer of cork [2], and that in Extremadura (a south-western region of Spain), the cork industry is one of its most important industries (10% of the world cork). All these motivations have lead us to the development of this research, whose main objective is the construction of a computer vision system for cork classification based on advanced methods of image processing and feature extraction. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 743–752, 2008. © Springer-Verlag Berlin Heidelberg 2008
744
B. Paniagua et al.
There are other similar studies that try to solve the problem of cork quality classification. In our case, as we will see in the results section, our system obtains a very good classification result, with only an error rate of 4.85%. We can observe a big improvement if we compare our system with the work performed by GonzalezAdrados et al [3] (33.33% error rate), focused on the heuristic feature extraction methods combined with Discriminant Analysis. In case of Vega-Rodriguez et al [4], they studied a system that detected a simplified cork quality standard (only 5 classes of the 7 regular classes) that works with a joint system of reconfigurable hardware (FPGAs) and mainly thresholding techniques. Although the process times are very good, the final performance in terms of error rates is worst (13.94%) than those obtained by our system. In the case of Costa et al [5], they used a system based on CDA (Canonical Discriminant Analysis) and SDA (Stepwise Discriminant Analysis) techniques, obtaining an error rate of 14%. Chang et al [6], however, designed a visual system based on morphological filtering, contour extraction and neuronal networks. In this case the error rates are also higher than those obtained by our proposed system, probably due to the low complexity of the features extracted (around 6.7% of error rate). Finally, if we review the system based on feature extraction and Bayesian classifiers for cork quality classification (this system has a simplified cork quality standard, 5 classes) performed by Radeva et al [7] we can see that our system improves the error rates obtained by them (mean of 6% of error rate). The rest of the paper is organized as follows: section 2 describes briefly the data used for the development of our experiments. In section 3, we present the 3D textural features used by the classifiers. Then, section 4 shows the different aspects of the Neuro-Fuzzy Network design. Finally, section 5 displays the final results for the proposed system, while section 6 exposes the conclusions.
2 Data Used The Cork Image DataBase (CID) used in our experiments consists in 700 images taken from 350 cork disks. There are seven different quality classes, 50 disks in each class. The 3D acquisition system was a laser camera from the ShapeGrabber Company [8][9]. The storage is made with “3pi” files, created by the ShapeGrabber Software to represent scanned data in ASCII. When scanning with the ShapeGrabber system, the data are acquired following the rule: one profile at a time. A profile is the data collected from the processing of one full laser line. The points are ordered in a profile in ascending order along the X-axis. For each of these points, the coordinates of the point (x, y, and z), the intensity value and the order of the point in the profile are given. We will use this information to extract the 3D cork quality features in which this paper is focused on. There are a lot of commercial applications to visualize those 3pi files for observation purposes (figure 1). We need to highlight that the initial classification, in which this study is based on, has been made by a human expert from IPROCOR (in Spanish: “Instituto de la Madera, el Corcho y el Carbón Vegetal”, in English: “Research Institute for the Wood, Cork and Vegetable Coal”) [10]. We suppose this classification is optimal/perfect and we want to know which classifier obtains the most similar classification results.
3D Textural Mapping and Soft-Computing Applied to Cork Quality Inspection
745
Fig. 1. (a) Class 0 disk (b) Class 1 disk (c) Class 2 disk (d) Class 3 disk (e) Class 4 disk (f) Class 5 disk (g) Class 6 disk
3 3D Textural Mapping In order to be able to run our final system, we have first to extract a collection of features that can detect cork quality. For this, we evaluated a whole set of 3D features obtained from the 3D maps acquired with the laser camera. This features were based on the 3D special characteristics of the cork surface. We will see the features that obtained best results on cork quality classification in detail in the next sections. 3.1 Combined Depth and Intensity We decided to study a new feature as an extension of other feature we studied in early stages of the research and that produced good results in a 2D approach [11]. Concretely, the feature extracted is conformed by the percentage in the cork area occupied by 3D points that are not only dark (intensity) but also deep (coordinate Z). A cork stopper with a high value in this feature will glimpse having a low quality, because the higher this value is, more deep defects will exist within the cork. 3.2 Weighted Depth This 3D feature tries to consider the 3D points in the cork sample by evaluating them in base to their depth, to develop after that a weighted percentage of the different defect pixels. This 3D feature was designed because it was relevant to consider the severity of the different defects within the cork surface. Thus, as deepest is the defect as serious can be considered (equation 1). WeightedDepth =
( p0 * 0) + ( p1 * 1) + ... + ( p 6 * 6) totalamountofpixels
(1)
Where pi is the number of pixels in the depth level i. The main difference with those percentages that consider the amount of defects in the total area of the cork sample is that now the different severity of these pixels is considered.
746
B. Paniagua et al.
3.3 Level Based After considering the different depth levels in a joint way with the weighted depth feature, we thought that possibly could be interesting to analyze each one of the different depth levels independently. For this, we studied the performance of the different features composed by each one of the different depth levels. We obtained the following results (see figure 2).
Fig. 2. Final results for the different depth levels
The best results were obtained by the feature composed by the percentage of pixels on the second textural depth level within the cork surface. Thanks to this study we discovered empirically considering the different depth levels can help the system to detect the quality in the cork samples. 3.4 Root Mean Square Roughness Once we had information about the cork defects depth, a new feature was designed. Thanks to the information obtained from the 3D maps, we can now evaluate the roughness of the cork texture. This is indeed an important descriptor in case of cork quality, because the roughness of the texture will define the average looking of the stopper, that is one of ways the human experts used to perform their evaluation to decide whether a cork stopper will or not will belong to a certain class. The extraction of this feature had some constraints, because after observe the 3D maps database in detail, we realized that some samples had physical deformations. This fact is not strange considering that the cork samples are made of a natural material, which is affected by the environmental conditions of the place where they are stored: temperature and humidity. Also the automatic machines that produce the cork samples can introduce mechanical defects when producing the samples. This facts were not relevant on the 2D approach, but now they are key aspects to take into account to do a consistent analysis. This sample side deviation (figure 3) consists in that that some pixels will be considered important depth defects, while they are, in fact, part of the cork nondefective surface in the lower part of the slope produced by the side deviation. Also some problems can be found when it comes to tilted lateral sides on the samples as well.
3D Textural Mapping and Soft-Computing Applied to Cork Quality Inspection
747
Fig. 3. Deviation on the disk surface
We designed a way to cope with these aforementioned problems. The problem with the tilted lateral sides can be easily solved considering the borders of the stopper as outliers and removing them. For this purpose, a procedure for the edge removal was developed. The slopes produces on the frontal sides deviations can be ignored if we fit a plane into the cork surface with linear regressions, and we compute our roughness feature by using the theoretical plane that fits with the cork surface. Edge Removal Procedure. Despite this first step could be considered trivial, it was not. We had to design an edge removal procedure with no spatial information (figure 4).
Fig. 4. Edge removal procedure design
We have no spatial information because the 3pi files store the different profiles in a sequential way, and there is not a straight forward way to know where a specific pixel is placed. We define the stopper radio as r (to make it easier, we consider r = 1), and we want to discover the radio after the edge removal s and also the number of points that are removed from the profile n, in case of each profile. We can obtain this information by first considering the angle of s in case of each one of the profiles. This angle α, can be defined by the equivalent equations in (2).
cos α =
k r
sin α =
y r
(2)
Where k is the ½ original length of the profile, and y is the ½ original length of the profile when the edges have been removed. Thus, and using the Pythagorean Theorem, we can discover the number of points that have to be removed from each side of one profile, x: We calculate n (n=k-x) for each profile and in this way we can remove the outliers inside the 3D map, solving in that way the first deviation problem in the maps, prior the feature extraction.
748
B. Paniagua et al.
Linear Regressions. In statistics, a linear regression is a regression method that models the relationship between a dependent variable y, and some independent variables Xi, i = 1, ..., p (equation 3).
y = β 0 + β 1 X 1 + β 2 X 2 + ... + β p X p
(3)
Where β0 is the constant term and the βi are the respective parameters of the independent variables, and p is the number of parameters to be estimated in the linear regression. In this case we want to compute this same model (parameters a, b and c) using the 3D coordinates obtained with the laser scanner. This z’ will be our optimal depth value regarding to the plane, with no deformations and no deviations. By comparing this value with the real one z, that is the value acquired with the 3D scanner, we can compute our roughness feature. The Root Mean Square Roughness feature is defined in equation 4. n −1
RootMeanSquareRoughness =
∑ (z k =0
k
− zk′ ) 2
(4)
n
Where n is the number of points in the file.
4 Neuro-Fuzzy Classifier Previous experimentations with neural networks showed the lacks of classical Neural Networks Computing in this application field (cork classification). We hope to find some improvement in these previous results by extending the research to NeuroFuzzy Computing, which seems to be more suitable to our problem. Neuro-fuzzy [12] networks are systems that include both aspects of the neural networks in the way they are systems with the ability to learn and to generalize, and aspects of the fuzzy logic since they work with logical reasoning based on inference rules. The Neuro-fuzzy approach has been successfully tested over other application fields, like [13]. Our basis neural network architecture is a Back-Propagation and it is made by three basic layers that will be explained in detail in the following sections. 4.1 Input Layer The input layer is the one that receives external entrances. In our system, in order to classify a cork disk in a specific class, we will use the best features studied in our 3D textural mapping. These features have been those that have obtained the best cork classification rates: combined depth+intensity, weighted depth, second depth level feature, root mean square roughness and other three textural features (wavelets). These last three features are not mentioned in this paper, but they obtained competitive error rates values and we decided to include them in the final system to give it more complexity. Therefore, input layer will be composed by 7 input neurons. 4.2 Output Layer The output layer will give the class to which an input pattern (cork stopper) belongs, and also the membership degree of the cork stoppers for every quality class. For that
3D Textural Mapping and Soft-Computing Applied to Cork Quality Inspection
749
purpose we defined a fuzzy membership function, which will allow fuzzy data modeling when a feature can not be classified accurately between two contiguous cork classes. In our problem, we have 7 cork quality classes in a 6-dimensional feature space. The weighed distance Zik for the i-th input pattern to each of our cork quality classes is defined as:
Z ik =
⎡ Fij − Okj ⎤ ⎥ j =1 ⎢ ⎣ Vkj ⎥⎦ for k=1..7 and j=1..6 6
∑⎢
(5)
Where Fij is the value of the j-th input features component of the i-th pattern. Let the 6-dimensional vectors Okj and Vkj denote the mean and the standard deviation of the jth input features respectively of the numerical training data for the k-th class. The weighted distance with Vkj is used to take care of the variance of the classes so that a feature with higher variance has less significance in characterizing a class. This distance, Zik, will be stored in a 7-position array, and this vector will define the distance that separates each input pattern from each class. Knowing the Zik definition, we can define now the membership fuzzy function μk for each cork class, in case of an input feature pattern Fi as: ⎛ ⎜
μk = 1 − ⎜⎜ ⎜ ⎜ ⎝
Zik −min( Zik ) k
( Zik )−min( Zik ) max k k
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
(6) for k=1..7
Obviously, μk(Fi) lies in the interval [0,1]. Here, the larger the distance of a pattern from a class, the lower its membership value to that class. These membership values will be calculated at the same time for every class, and they will be the optimum exit for an input pattern. This exit will shown the membership degree to each cork quality class. As we have 7 classes, we will need in our output layer 7 neurons that present these membership values. So, our output layer will need 7 output neurons. 4.3 Hidden Layer The intermediate or hidden layers are those that are between the input and the output layer, and where each of the neurons must be completely interconnected to the neurons in the input and the output layer. The number of hidden neurons determines the learning rate of the Neuronal Network. There is not an accurate recipe to decide the optimum number of hidden neurons to solve a specific problem. Some authors suggest that the neuronal network must have the minimum hidden neurons that obtain a better working behaviour [14][15][16]. This can be done evaluating the performance of the different architectures with the training patterns. We discovered that, for our input-output configuration, the minimum optimum number of hidden neurons was 15. Therefore, the final architecture for the system will be 7x15x7.
750
B. Paniagua et al.
5 Results In this last study, in order to test the performance of the final system, we will compare the results obtained for it with the results obtained in case of other four classification algorithms: Classic Neural classifier, K-means classifier, K-nearest neighbors’ classifier and Euclidean classifier (figure 5).
Fig. 5. Final results for the studied classifiers
After evaluating all these different classifiers in our problem, we can conclude that the Neuro-Fuzzy classifier is the classifier that obtained the best results. The inclusion of fuzzy logic and the change of the neural network architecture have widely improved the wrong classification percentages obtained by the other classification algorithms. We also present these results using a comparison table in order to show the wrong classified samples in case of each classifier (table 1). Table 1. Wrong classified samples and final error rates for the studied classifiers Classification Algorithm
C0
C1
C2
C3
C4
C5
C6
Final Error Rate
Classical NN
47
36
7
2
2
5
2
28.85%
K-means
1
4
18
23
1
0
0
13.42%
Euclidean
0
2
5
4
8
3
5
7.71%
K-nearest N Fuzzy NN
1 1
0 1
3 2
4 1
8 6
4 2
5 4
7.14% 4.85%
Remember that we have 7 quality classes and 50 cork disks per class (a total of 350 cork disks). As we can observe, the classical neural network has many problems in order to difference classes 0 and 1 (the best ones). On the other hand, K-means has difficulties with the intermediate classes (2 and 3). The rest of classifiers obtain good results, although the best one is clearly the fuzzy neural network (4.85% error rate). We compared the percentage obtained by our system (using the fuzzy NN) and the one obtained by the electronic classification machines that exist in the cork industry at
3D Textural Mapping and Soft-Computing Applied to Cork Quality Inspection
751
present (around 40%) and our result widely improves the results that can be found now in an industrial environment. Also our results are better than the ones obtained by other authors: [3] (33.33% error rate), [4] (13.94%), [5] (14% error rate), [6] (6.7% error rate) and [7] (mean of 6% of error rate).
6 Conclusions In this paper we have performed a deep survey about 3D cork features extraction and soft-computing. Concretely, we have studied several features based on the depth of the different points in the cork surface and one feature based on the roughness of the cork texture itself. Finally, for the evaluation of all these features we have proposed our own methodology based on the combined use of these features with a softcomputing based classifier. According to the experimental results we can say that, in case of cork classification, the best system is the one that combines cork texture, cork defects, 3D information, and a classifier that allows some cork quality class overlapping (Neuro-Fuzzy classifier). In conclusion, the experiments on real cork stoppers/disks show the high effectiveness and utility of our approach. Furthermore, the use of this system could reduce costs, time and also many of the current conflicts in the cork industry due to the lack of a universal standard of stopper quality.
References 1. Fortes, M.A.: Cork and Corks. European Rev. 1, 189–195 (1993) 2. CorkQC: The Natural Cork Quality Council, Industry Statistics (2008), http://www. corkqc.com 3. Gonzalez-Adrados, J.R., Lopes, F., Pereira, H.: The Quality Grading of Cork Planks with Classification Models based on Defect Characterization. Holz als Roh- und Werkstoff 58(1-2), 39–45 (2000) 4. Vega-Rodriguez, M.A., Sanchez-Perez, J.M., Gomez-Pulido, J.A.: Using Computer Vision and FPGAs in the Cork Industry. In: Proceedings of the IEEE Mechatronics and Robotics 2004 (2004) 5. Costa, A., Pereira, H.: Decision Rules for Computer-Vision Quality Classification of Wine Natural Cork Stoppers. American Journal of Enology and Viticulture 57, 210–219 (2006) 6. Chang, J., Han, G., Valverde, J.M., Griswold, N.C., Duque-Carrillo, J.F., SánchezSinencio, E.: Cork Quality Classification System using a Unified Image Processing and Fuzzy-Neural Network Methodology. IEEE Transactions on Neural Networks 8(4), 964– 974 (1997) 7. Radeva, P., Bressan, M., Tobar, A., Vitrià, J.: Real-time Inspection of Cork Stoppers using Parametric Methods in High Dimensional Spaces. In: The IASTED Conference on Signal and Image Processing SIP (2002) 8. ShapeGrabber Inc., ShapeGrabber CentralTM Manual (2005) 9. ShapeGrabber Inc., ShapeGrabber Laser System SG Series Manual (2005) 10. ICMC: Instituto del Corcho, Madera y Carbón Vegetal, Instituto de Promoción del Corcho (ICMC-IPROCOR (2008), http://www.iprocor.org
752
B. Paniagua et al.
11. Paniagua-Paniagua, B., Vega-Rodríguez, M.A., Gómez-Pulido, J.A., Sánchez-Pérez, J.M.: Comparative Study of thresholding techniques to evaluate Cork Quality, Visualization, imaging, and image processing (VIIP 2006), vol. I, pp. 447–452 (2006) 12. Jang, J.S.R., Sun, C.T., Mizutani, E.: Neuro-Fuzzy and Soft Computing. Prentice Hall, New Jersey (1997) 13. Monzon, J.E., Pisarello, M.I.: Identificación de Latidos Cardíacos Anómalos con Redes Neuronales Difusas, Comunicaciones Científicas y Tecnológicas. Universidad Nacional del Nordeste, Chaco-Corrientes, Argentina, E-038 (2004) (in Spanish) 14. Masters, T.: Practical neuronal networks recipes in C++. Academic Press, London (1993) 15. Smith, M.: Neuronal networks for statistical modeling, Van Nostrand Reinhold, New York (1993) 16. Rzempoluck, E.J.: Neuronal network data analysis using simulnet. Springer, New York (1998)
Analysis of Breast Thermograms Based on Statistical Image Features and Hybrid Fuzzy Classification Gerald Schaefer1 , Tomoharu Nakashima2 , and Michal Zavisek3 1
School of Engineering and Applied Science, Aston University, U.K 2 Department of Computer Science and Intelligent Systems, Osaka Prefecture University, Japan 3 Faculty of Electrical Engineering and Communication, Brno University of Technology, Czech Republic
Abstract. Breast cancer is the most commonly diagnosed form of cancer in women accounting for about 30% of all cases. Medical thermography has been shown to be well suited for the task of detecting breast cancer, in particular when the tumour is in its early stages or in dense tissue. In this paper we perform breast cancer analysis based on thermography. We employ a series of statistical features extracted from the thermograms which describe bilateral differences between left and right breast areas. These features then form the basis of a hybrid fuzzy rulebased classification system for diagnosis. The rule base of the classifier is optimised through the application of a genetic algorithm which ensures a small set of rules coupled with high classification performance. Experimental results on a large dataset of nearly 150 cases confirm the efficacy of our approach.
1
Introduction
Breast cancer is the most commonly diagnosed form of cancer in women accounting for about 30% of all cases [1]. Despite earlier, less encouraging studies, which were based on low-capability and poorly calibrated equipment, medical thermography (or medical infrared imaging), which uses cameras with sensitivities in the infrared to provide a picture of the temperature distribution of the human body or parts thereof [2], has been shown to be well suited for the task of detecting breast cancer, in particular when the tumour is in its early stages or in dense tissue [3,4]. Early detection is important as it provides significantly higher chances of survival [5] and in this respect infrared imaging outperforms the standard method of mammography which can detect tumours only once they exceed a certain size. On the other hand, tumours that are small in size can be identified using thermography due to the high metabolic activity of cancer cells which leads to an increase in local temperature that can be picked up in the infrared. In our earlier work [6] we have extracted a series of statistical image features from the breast thermograms to describe bilateral differences between the left G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 753–762, 2008. c Springer-Verlag Berlin Heidelberg 2008
754
G. Schaefer, T. Nakashima, and M. Zavisek
and right side that hint at the presence of a tumour, and have employed a fuzzy if-then rule base system as classifier. As every rule based system it suffers from the curse of dimensionality and we therefore had to reduce the rule set so that only two attributes each were present in every rule. In this paper we take a different approach to arrive at a compact and effective rule base and apply a genetic algorithm that optimises the features and parameters of the fuzzy rules. The resulting classification system is more compact and hence faster than our previous approach while maintaining the same good classification performance as is proved by experimental results on a set of nearly 150 cases where we achieve a correct classification rate of about 80%, which is comparable to other imaging modalities such as mammography.
2
Breast Thermogram Feature Analysis
In our work we restrict out attention to frontal view images of the breasts. As has been shown earlier, an effective approach to automatically detect cancer cases is to study the symmetry between the left and right breast [7]. In the case of cancer presence the tumour will recruit blood vessels resulting in hot spots and a change in vascular pattern, and hence an asymmetry between the temperature distributions of the two breast. We follow this approach and (manually) segment the areas corresponding to the left and right breast from the thermograms. Once segmented, we convert the breast regions to a polar co-ordinate representation as this simplifies the calculation of several of the features that we employ. A series of statistical features is then calculated to provide indications of symmetry between the regions of interest (i.e. the two breasts). In the following we describe the features we employ. 2.1
Basic Statistical Features
Clearly the simplest feature to describe a temperature distribution such as those encountered in thermograms is to calculate its statistical mean. As we are interested in symmetry features we calculate the mean for both breasts and use the absolute value of the difference of the two. Similarly we calculate the standard temperature deviation and use the absolute difference as a feature. We also employ the absolute differences of the median temperature and the 90-percentile as further descriptors. 2.2
Moments
Image moments are defined as mpq =
M−1 −1 N
xp y q g(x, y)
(1)
y=0 x=0
where x and y define the pixel location and N and M the image size. We utilise moments m01 and m10 which essentially describe the centre of gravity of the
Analysis of Breast Thermograms Based on Statistical Image Features
755
breast regions, as well as the distance (both in x and y direction) of the centre of gravity from the geometrical centre of the breast. For all four features we calculate the absolute differences of the values between left and right breast. 2.3
Histogram Features
Histograms record the frequencies of certain temperature ranges of the thermograms. In our work we construct normalised histograms for both regions of interest (i.e., left and right breast) and use the cross-correlation between the two histograms as a features. From the difference histogram (i.e., the difference between the two histograms) we compute the absolute value of its maximum, the number of bins exceeding a certain threshold (empirically set to 0.01 in our experiments), the number of zero crossings, energy and the difference of the positive and negative parts of the histogram. 2.4
Cross Co-occurrence Matrix
Co-occurrence matrices have been widely used in texture recognition tasks [8] and can be defined as (k)
γTi ,Tj (I) = PRp1 ∈ITi ,p2 ∈I [p2 ∈ ITj , |p1 − p2 | = k]
(2)
|p1 − p2 | = max |x1 − x2 |, |y1 − y2 |
(3)
with where Ti and Tj denote two temperature values and (xk , yk ) denote pixel locations. In other words, given any temperature Ti in the thermogram, γ gives the probability that a pixel at distance k away is of temperature Tj . In order to arrive at an indication of asymmetry between the two sides we adopted this concept and derived what we call a cross co-occurrence matrix defined as (k)
γTi ,Tj (I(1), I(2)) = PRp1 ∈I(1)Ti ,p2 ∈I(2) [p2 ∈ I(2)Tj , |p1 − p2 | = k]
(4)
i.e., temperature values from one breast are related to temperatures of the second side. From this matrix we can extract several features [8]. The ones we are using are γk,l Homogeneity G = (5) 1 + |k − l| k l 2 Energy E = γk,l (6) k
Contrast C =
k
and Symmetry S = 1 −
l
|k − l|γk,l
(7)
l
k
l
|γk,l − γl,k |
(8)
756
G. Schaefer, T. Nakashima, and M. Zavisek
We further calculate the first four moments m1 to m4 mp = (k − l)p γk,l k
2.5
(9)
l
Mutual Information
The mutual information M I between two distributions can be calculated from the joint entropy H of the distributions and is defined as M I = HL + HR + H with HL = −
PL (k) log2 pL (k)
(10)
(11)
k
HR = −
PR (l) log2 pR (l)
l
H=
k
PLR (k, l) log2 pL,R (k, l)
l
and pLR (k, l) = xk,l / pL (k) =
x(k, l)
(12)
k,l
pLR (k, l)
l
pR (k) =
pLR (k, l)
k
and is employed as a further descriptor. 2.6
Fourier Analysis
As last feature descriptors we calculate the Fourier spectrum and use the difference of absolute values of the ROI spectra. The features we adopt are the difference maximum and the distance of this maximum from the centre. 2.7
Summary
To summarise we characterise each breast thermogram using the following set of features: 4 basic statisical features, 4 moment features, 8 histogram features, 8 cross co-occurrence features, mutual information and 2 Fourier descriptors. We further apply a Laplacian filter to enhance the contrast and calculate another subset of features (the 8 cross co-oocurance features together with mutual information and the 2 Fourier descriptors) from the resulting images. In total we hence end up with 38 descriptors per breast thermogram which describe the asymmetry between the two sides. We normalise each feature to the interval [0;1] to arrive at comparable units between descriptors.
Analysis of Breast Thermograms Based on Statistical Image Features
3
757
Fuzzy Rule-Based Classification
Pattern classification typically is a supervised process where, based on set of training samples that have been manually classified by experts, a classifier is derived that automatically assigns unseen data samples to pre-defined classes. Let us assume that our pattern classification problem is an n-dimensional problem with C classes (in clinical diagnosis such as the detection of breast cancer C is typically 2) and m given training patterns xp = (xp1 , xp2 , . . . , xpn ), p = 1, 2, . . . , m. Without loss of generality, we assume each attribute of the given training patterns to be normalised into the unit interval [0, 1]; that is, the pattern space is an n-dimensional unit hypercube [0, 1]n . In this study we use fuzzy if-then rules of the following type as a base of our fuzzy rule-based classification systems: Rule Rj : If x1 is Aj1 and . . . and xn is Ajn then Class Cj with CFj , j = 1, 2, . . . , N,
(13)
where Rj is the label of the j-th fuzzy if-then rule, Aj1 , . . . , Ajn are antecedent fuzzy sets on the unit interval [0, 1], Cj is the consequent class (i.e. one of the C given classes), and CFj is the grade of certainty of the fuzzy if-then rule Rj . As antecedent fuzzy sets we use triangular fuzzy sets. Our underlying fuzzy rule-based classification system consists of N fuzzy ifthen rules each of which has a form as in Equation (13). There are two steps in the generation of fuzzy if-then rules: specification of antecedent part and determination of consequent class Cj and the grade of certainty CFj . The antecedent part of fuzzy if-then rules is specified manually. Then the consequent part (i.e. consequent class and the grade of certainty) is determined from the given training patterns [9]. In [10] it is shown that the use of the grade of certainty in fuzzy if-then rules allows us to generate comprehensible fuzzy rule-based classification systems with high classification performance. 3.1
Fuzzy Rule Generation
Let us assume that m training patterns xp = (xp1 , . . . , xpn ), p = 1, . . . , m, are given for an n-dimensional C-class pattern classification problem. The consequent class Cj and the grade of certainty CFj of the if-then rule are determined in the following two steps: 1. Calculate βClass h (j) for Class h as βClass h (j) =
μj (xp ),
(14)
xp ∈Class h
where μj (xp ) = μj1 (xp1 ) · . . . · μjn (xpn ), and μjn (·) is the membership function of the fuzzy set Ajn .
(15)
758
G. Schaefer, T. Nakashima, and M. Zavisek
ˆ that has the maximum value of βClass h (j): 2. Find Class h βClass hˆ (j) = max {βClass k (j)}. 1≤k≤C
(16)
If two or more classes take the maximum value, the consequent class Cj of the rule Rj can not be determined uniquely. In this case, specify Cj as Cj = φ. ˆ takes the maximum value, let Cj be Class h. ˆ The grade of If a single class h certainty CFj is determined as βClass hˆ (j) − β¯ CFj = h βClass h (j)
with β¯ = 3.2
ˆ h=h
βClass h (j)
C −1
.
(17)
(18)
Fuzzy Reasoning
Using the rule generation procedure outlined above we can generate N fuzzy ifthen rules as in Equation (13). After both the consequent class Cj and the grade of certainty CFj are determined for all N rules, a new pattern x = (x1 , . . . , xn ) can be classified by the following procedure: 1. Calculate αClass h (x) for Class h, j = 1, . . . , C, as αClass h (x) = max{μj (x) · CFj |Cj = h},
(19)
2. Find Class h that has the maximum value of αClass h (x): αClass
h (x)
= max {αClass k (x)}. 1≤k≤C
(20)
If two or more classes take the maximum value, then the classification of x is rejected (i.e. x is left as an unclassifiable pattern), otherwise we assign x to Class h .
4
Hybrid Fuzzy Classification
While the basic fuzzy rule based system introduced above provides a reliable and accurate classifier, it suffers - as do many other approaches - from the curse of dimensionality. In the case of our fuzzy classifier, the number of generated rules increases exponentially with the number of attributes involved and with the number of partitions used for each attribute. As preliminary experiments [11] have shown that fuzzy classifiers with less than 10 divisions per attribute were not sufficient to handle the data at hand, the resulting complexity of the generated rule base is very resource demanding, both in terms of computational complexity and in required memory allocation. We are therefore interested in arriving at a more compact classifier that affords the same classification performance while not suffering from the problems.
Analysis of Breast Thermograms Based on Statistical Image Features
759
In our previous work [6] we applied a rule splitting step to deal with the problem of dimensionality. By limiting the number of attributes in each rule to 2, a much smaller rule base was developed. In this paper we take a different approach to arrive at an even more compact rule base by developing a hybrid fuzzy classification system through the application of a genetic algorithm (GA). The fuzzy if-then rules used do not change and are still of the same form as the one given in Equation (13), i.e., they contain a number of fuzzy attributes and a consequent class together with a grade of certainty. Our approach of using GAs to generate a fuzzy rule-based classification system is a Michigan style algorithm [12] which represents each rule by a string and handles it as an individual in the population of the GA. A population consists of a pre-specified number of rules. Because the consequent class and the rule weight of each rule can be easily specified from the given training patterns as shown in Section 3, they are not used in the coding of each fuzzy rule (i.e., they are not included in a string). Each rule is represented by a string using its antecedent fuzzy sets. 4.1
Genetic Operations
First the algorithm randomly generates a pre-specified number Nrule of rules as an initial population (in our experiments we set Nrule = 100). Next the fitness value of each fuzzy rule in the current population is evaluated. Let S be the set of rules in the current population. The evaluation of each rule is performed by classifying all the given training patterns by the rule set S as described in Section 3. The winning rule receives a unit reward when it correctly classifies a training pattern. After all the given training patterns are classified by the rule set S, the fitness value f itness(Rq ) of each rule Rq in S is calculated as f itness(Rq ) = NCP(Rq ),
(21)
where NCP(Rq ) is the number of correctly classified training patterns by Rq . It should be noted that the following relation holds between the classification performance NCP(Rq ) of each rule Rq and the classification performance NCP(S) of the rule set S used in the fitness function: NCP(S) = NCP(Rq ). (22) Rq ∈S
The algorithm is implemented so that only a single copy is selected as a winner rule when multiple copies of the same rule are included in the rule set S. In GA optimisation problems, multiple copies of the same string usually have the same fitness value. This often leads to undesired early convergence of the current population to a single solution. In our algorithm, only a single copy can have a positive fitness value and the other copies have zero fitness which prevents the current population from being dominated by many copies of a single or few rules. Then, new rules are generated from the rules in the current population using genetic operations. As parent strings, two fuzzy if-then rules are selected from the current population and binary tournament selection with replacement is
760
G. Schaefer, T. Nakashima, and M. Zavisek
applied. That is, two rules are randomly selected from the current population and the better rule with the higher fitness value is chosen as a parent string. A pair of parent strings is chosen by iterating this procedure twice. From the selected pair of parent strings, two new strings are generated by a crossover operation. We use a uniform crossover operator where the crossover points are randomly chosen for each pair of parent strings. The crossover operator is applied to each pair of parent strings with a pre-specified crossover probability pc . After new strings are generated, each symbol of the generated strings is randomly replaced with a different symbol by a mutation operator with a prespecified mutation probability pm . Usually the same mutation probability is assigned to every position of each string. Selection, crossover, and mutation are iterated until a pre-specified number Nreplace of new strings are generated. Finally, the Nreplace strings with the smallest fitness values in the current population are removed, and the newly generated Nreplace strings added to form a new population. Because the number of removed strings is the same as the number of added strings, every population consists of the same number of strings. That is, every rule set has the same number of rules. This generation update can be viewed as an elitist strategy where the number of elite strings is (Nrule − Nreplace ). The above procedures are applied to the new population again. The generation update is iterated until a pre-specified stopping condition is satisfied. In our experiments we use the total number of iterations (i.e., the total number of generation updates) as stopping condition. 4.2
Algorithm Summary
To summarise, our hybrid fuzzy rule-based classifier works as follows: Step 1: Parameter Specification. Specify the number of rules Nrule , the number of replaced rules Nreplace , the crossover probability pc , the mutation probability pm , and the stopping condition. Step 2: Initialization. Randomly generate Nrule rules (i.e., Nrule strings of length n) as an initial population. Step 3: Genetic Operations. Calculate the fitness value of each rule in the current population. Generate Nreplace rules using selection, crossover, and mutation from existing rules in the current population. Step 4: Generation Update (Elitist Strategy). Remove the worst Nreplace rules from the current population and add the newly generated Nreplace rules to the current population. Step 5: Termination Test. If the stopping condition is not satisfied, return to Step 3. Otherwise terminate the execution of the algorithm. During the execution of the algorithm, we monitor the classification rate of the current population on the given training patterns. The rule set with the highest classification rate is chosen as the final solution.
Analysis of Breast Thermograms Based on Statistical Image Features
5
761
Experimental Results
For our experiment we utilised a dataset of 146 thermograms (29 malignant and 117 benign cases). It should be noted that this dataset is significantly larger than those used in previous studies (e.g. [7]). For all thermograms we calculate a feature vector of length 38 as outlined in Section 2. We perform standard 10-fold cross-validation on the dataset where the patterns are split into 10 disjoint sets and the classification performance of one such set based on training the classifier with the remaining 90% of samples evaluated in turn for all 10 combinations. The hybrid fuzzy classifier that we use, employs 100 rules which are generated through application of the GA with a crossover probability of 0.9 and a mutation probability of 0.1. The results are listed in Table 1 which gives the classification rate (CR) of the classifiers with 10 to 15 partitions (due to space limitations we restrict our attention only to these which provided the best classification performance). For comparison we also provide the results from our earlier work in [6]. It can be seen that a correct classification rate of about 80% is achieved, which is comparable to that achieved by other techniques for breast cancer diagnosis with mammography typically providing about 80%, ultrasonography about 70%, MRI systems about 75% and DOBI (optical systems) reaching about 80% diagnostic accuracy [13]. Also, our new approach gives even slightly better results compared to our earlier work, despite the much smaller rule base of only 100 rules. We can therefore conclude that our presented approach is indeed useful as an aid for diagnosis of breast cancer and should prove even more powerful when coupled with another modality such as mammography. Table 1. Results of breast cancer thermogram classification on test data based on 10-fold cross validation # fuzzy partitions 10 11 12 13 14 15
6
CR from [6] 78.05 76.57 77.33 78.05 79.53 77.43
CR hybrid fuzzy approach [%] 80.27 79.18 80.89 77.74 79.25 78.90
Conclusions
In this paper we presented a computational approach to the diagnosis of breast cancer based on medical infrared imaging. Asymmetry analysis of breast thermograms is performed using a variety of statistical features. These features are then fed into a hybrid fuzzy if-then rule based classification system which outputs a diagnostic prediction of the investigated patient. Experimental results on a large dataset of thermograms confirm the efficacy of the approach providing a classification accuracy of about 80% which is comparable to the performance achieved by other techniques including mammography.
762
G. Schaefer, T. Nakashima, and M. Zavisek
Acknowledgements The first author would like to acknowledge the Daiwa Foundation and the Sasakawa Foundation for supporting this work.
References 1. American Cancer Society: (Cancer facts and figures), http://www.cancer.org/docroot/STT/stt 0.asp 2. Jones, B.: A reappraisal of infrared thermal image analysis for medicine. IEEE Trans. Medical Imaging 17(6), 1019–1027 (1998) 3. Anbar, N., Milescu, L., Naumov, A., Brown, C., Button, T., Carly, C., AlDulaimi, K.: Detection of cancerous breasts by dynamic area telethermometry. IEEE Engineering in Medicine and Biology Magazine 20(5), 80–91 (2001) 4. Head, J., Wang, F., Lipari, C., Elliott, R.: The important role of infrared imaging in breast cancer. IEEE Engineering in Medicine and Biology Magazine 19, 52–57 (2000) 5. Gautherie, M.: Thermobiological assessment of benign and maligant breast diseases. Am. J. Obstet. Gynecol. 147(8), 861–869 (1983) 6. Schaefer, G., Nakashima, T., Zavisek, M., Yokota, Y., Drastich, A., Ishibuchi, H.: Thermography breast cancer diagnosis by fuzzy classification of statistical image features. In: IEEE Int. Conference on Fuzzy Systems, pp. 1096–1100 (2007) 7. Qi, H., Snyder, W.E., Head, J.F., Elliott, R.L.: Detecting breast cancer from infrared images by asymmetry analysis. In: 22nd IEEE Int. Conference on Engineering in Medicine and Biology (2000) 8. Haralick, R.: Statistical and structural approaches to texture. Proceedings of the IEEE 67(5), 786–804 (1979) 9. Ishibuchi, H., Nozaki, K., Tanaka, H.: Distributed representation of fuzzy rules and its application to pattern classification. Fuzzy Sets and Systems 52(1), 21–32 (1992) 10. Ishibuchi, H., Nakashima, T.: Effect of rule weights in fuzzy rule-based classification systems. IEEE Trans. Fuzzy Systems 9(4), 506–515 (2001) 11. Schaefer, G., Nakashima, T., Zavisek, M., Drastich, A., Yokota, Y.: Cost-sensitive classification of breast cancer thermograms. Thermology International 16(3) (2006) 12. Ishibuchi, H., Nakashima, T.: Improving the performance of fuzzy classifier systems for pattern classification problems with continuous attributes. IEEE Trans. on Industrial Electronics 46(6), 1057–1068 (1999) 13. Zavisek, M., Drastich, A.: Thermogram classification in breast cancer detection. In: 3rd European Medical and Biological Engineering Conference, pp. 1727–1983 (2005)
Efficient Facial Feature Detection Using Entropy and SVM Qiong Wang, Chunxia Zhao, and Jingyu Yang School of Computer Science and Technology Nanjing University of Science and Technology, Nanjing, China
[email protected]
Abstract. In this paper, an efficient algorithm for facial feature detection is presented. Complex regions in a face image, such as the eye, exhibit unpredictable local intensity and hence high entropy. We use this characteristic to obtain eye candidates, and then these candidates are sent to a SVM classifier to get real eyes. According to the geometry relationship of human face, mouth search region is specified by the coordinates of the left eye and the right eye. And then precise mouth detection is done. Experimental results demonstrate the effectiveness of the proposed method.
1 Introduction Facial feature detection (eyes, nose, mouth and etc.) plays an important role in many facial image interpretation tasks such as face recognition, face verification, face tracking, face expression recognition and 3D face modeling. Generally, there are two types of information available for facial feature detection [1]: (1) local texture around a given feature, for example, the pixel values in a small region around an eye, and (2) the geometric configuration of a given set of facial features, e.g. both eyes, nose, mouth and etc. Many different methods for modeling these types of information have been proposed. In Ref. [1] a method for facial feature detection was proposed which utilizes the Viola and Jones face detection method [2], combined with the statistical shape models of Dryden and Mardia [3]. In Ref. [4] an efficient method was proposed for eye detection that used iris geometries to determine the region candidates which possibly contain the eye, and then the symmetry, for selecting the couple of eyes. In Ref. [5], Gabor feature is used to extract eyes. The EOF (entropy of likelihood) feature points are found to do feature selection and correspondence for face images in Ref. [6]. In this paper, we propose an efficient approach combining image entropy and SVM classifier to precisely extract the eyes and locate the mouth with the coordinate information of eyes. We address the problem of facial feature detection, so our research work is based on face detection. The rest of this paper is organized as follows. In Section 2, the eye candidates extraction method will be introduced. Eyes verification will be presented in Section 3. In Section 4, mouth detection algorithm will be introduced. Some experimental results will be demonstrated in Section 5 to corroborate the proposed approach. Section 6 concludes the paper. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 763 – 771, 2008. © Springer-Verlag Berlin Heidelberg 2008
764
Q. Wang, C. Zhao, and J. Yang
2 Eye Candidates Extraction Complex regions in a face image, such as the eyes, exhibit unpredictable local intensity and hence have higher entropy than skin region, as illustrated in Fig. 1. This fact leads us to use entropy as a measure for uncertainty and unpredictability. We use this characteristic to obtain eye candidates.
(a)
(b)
(c)
(d)
Fig. 1. Entropy comparison between eye region and skin region. Entropy of (a) and (c) is higher than (b) and (d).
2.1 Image Entropy The basic concept of entropy in information theory has to do with how much randomness there is in a signal or random event. An alternative way to look at this is to talk about how much information is carried by the signal. Claude E. Shannon [7] defines entropy in terms of a discrete random event x , with possible states (or outcomes) 1...n as: n n ⎛ 1 ⎞ ⎟⎟ = −∑ p (i ) log 2 p (i ) H ( x) = ∑ p (i ) log 2 ⎜⎜ i =1 i =1 ⎝ p (i ) ⎠
(1)
Efficient Facial Feature Detection Using Entropy and SVM
765
Conversion from probability p (i ) to entropy h(i ) is illustrated in Fig. 2, and shows that probabilities close to zero or one produce low entropy and intermediate values produce entropies near 0.5.
Fig. 2. Conversion from probability pi to entropy hi
Shannon shows that any definition of entropy satisfying his assumptions will be of the form: n
− K ∑ p (i ) log p (i )
(2)
i =1
where K is a constant (and is really just a choice of measurement units). The texture of the input image can be characterized by using the entropy which is a statistical measure of randomness. For an image x , quantised to M levels, the entropy H x is defined as: M −1⎛ ⎛ 1 ⎞⎞ ⎞ ⎛ M −1 H x = ∑ ⎜⎜ pi log 2 ⎜⎜ ⎟⎟ ⎟⎟ = −⎜ ∑ ( pi log 2 pi )⎟ i =0 ⎝ ⎠ ⎝ i =0 ⎝ pi ⎠ ⎠
where
(3)
pi ( i = 0 ... M − 1 ) is the probability of the i th quantiser level being used
(often obtained from a histogram of the pixel intensities). For grey image, the value of M is 256. 2.2 Eye Candidates Extraction Our work is focus on facial feature location, so face area is detected by using upright frontal face detector [2]. Then eye candidates are extracted on the detected face area. A square window moves on the upper part of detected face to extract eye candidates by calculating the entropy value in each window. The size of moving window is calculated according to the face size, Eq. (4) gives the relationship.
766
Q. Wang, C. Zhao, and J. Yang
win _ eye = win _ face / 4.6
(4)
where win _ eye is the slide length of moving square window; win _ face is the slide length of detected face region. The areas where their local entropy is above average are considered as eye candidates and sent to the eye verifier. Fig. 3 shows some result of eye candidates extraction.
Fig. 3. Examples of eye pair candidates extraction
In order to detect faces in different scales, the facial image is repeatedly scaled by a factor of 1.2. In each scale, all eye pair candidates are extracted and verified by the eyes verifier which will be described in the next section. Consequently, all the faces in one image can be detected.
3 Eye Verifier After eye candidates are extracted, an eye verifier is applied to obtain real eyes. We train a SVM classifier to do eye verification. 3.1 Support Vector Machine In this paper, we choose the SVM as the classifying function. One distinctive advantage this type of classifier has over traditional neural networks is that SVMs achieve better generalization performance. SVM is a patter classification algorithm developed by V. Vapnik and his team [8]. It is a binary classification method that finds the optimal linear decision surface based on the concept of structural risk minimization. Given a set of N examples:
(x1,y1 ), ...(xi ,yi ), ...(xN ,yN ) xi ∈ R N , yi ∈{−1,1} In case of linear separable data, maximum margin classification aims to separate two classes with hyperplane that maximizes distance of supports vectors. This Optimal Separating Hyperplane can be expressed as following formula: N
f ( x) = ∑ α i yi ( xiT x) + b
(5)
i =1
This solution is defined in terms of subset of training samples (supports vectors) whose α i is non- zero.
Efficient Facial Feature Detection Using Entropy and SVM
767
In the case of linearly non-separable patterns, SVM is to perform non-linear mapping of input vector into high dimensional dot product space F . In general, however, the dimension of the feature space is very large, so we have the technical problem of computing high dimensional spaces. Kernel method gives the solution to this problem. In Eq. (5), substituting
xiT x to ϕ T ( xi )ϕ ( x) leads to the following formula: N
f ( x) = sgn[ ∑ yiα iϕ T ( xi )ϕ ( x) + b]
(6)
i =1
This kernel method is backed up by Mercer’s theorem. Thus the formula for nonlinear SVM with kernel is N
f ( x) = ∑ α i yi k ( xi , x) + b
(7)
i =1
The requirement on the kernel
k ( xi , x) is to satisfy Mercer’s theorem. Within this
requirement there are some possible inner product kernels. There are Gaussian Radial Basis Functions (RBFs), polynomial functions, and sigmoid polynomials whose decision surfaces are known to have good approximation properties. In this paper, we choose Gaussian radial basis function as the kernel function. 3.2 Eye Verification We apply SVM classifier to verify the eye candidates. The training data used for generating eye verification SVM consists of 600 images of each class (eye and non-eye). Selection of proper non-eye images is very important to train SVM because performance of SVM is influenced by what kind of non-eye images is used. In the initial stage of training SVM, we use non-eye images similar to eyes such as eyebrows, nostrils and other eye-like patches. And we generate non-eye images using bootstrapping method [9]. Fig. 4 shows some training examples.
(a)
(b)
Fig. 4. Training samples. (a) positive samples. (b) negative samples.
4 Mouth Detection Mouth location is also an important part in facial expression recognition and face states recognition. After the real eyes are obtained, mouth is located sequentially. Firstly, mouth region is calculated according to the coordinates of left eye and right eye so that the searching region for mouth detection is effectively reduced. On this basis, precise mouth detection is done.
768
Q. Wang, C. Zhao, and J. Yang
4.1 Mouth Search Region Extraction A mouth search region is specified by the positions of the detected eyes regarding the geometric information of a face. That is, the eyes and mouth are located statistically as shown in Fig 5 [10].
Fig. 5. The geometry relationship between eyes and mouth
The mouth search region is represented with two coordinates ( M left , M top ) and ( M right , M bottom ) by equation (8).
⎡ M left ⎤ ⎡ 0.965 xleft + 0.035 xright ⎤ ⎢M ⎥ ⎢ 0.035 x + 0.965 x ⎥ left right ⎥ ⎢ right ⎥ = ⎢ ⎢ M top ⎥ ⎢ yeye + 0.64( xright − xleft )⎥ ⎥ ⎢ ⎥ ⎢ ⎣ M bottom ⎦ ⎣⎢ yeye + 1.44( xright − xleft ) ⎦⎥ where
y eye =
y left + y right 2
(8)
.
In Fig. 6, each white rectangle windows on the mouth is the mouth region calculated by two eyes coordinates.
Fig. 6. Mouth region extraction
4.2 Precise Mouth Detection Once the mouth search region is extracted, precise mouth location can be done by further image processing. Because mouth has lower pixel value in mouth search region, using binary image to segment mouth is feasible. Maximum filter and Minimum filter are applied to the mouth region image as in Eq. (9).
Efficient Facial Feature Detection Using Entropy and SVM
f ′ = MinFilter ( MaxFilter ( f )) − f
769
(9)
where f is original image and f ' is differential image. Then thresholding and close operation is applied to the differential image and mouth can be segmented, as shown in Fig. 7(a). Mouth center will be located by calculating gravity center of connected component. Fig. 7(b) shows some mouth detection results.
(a)
(b) Fig. 7. Mouth Detection. (a) mouth segmentation. (b) precise mouth detection.
5 Experimental Results The proposed approach was tested on the JAFFE face dataset and ORL face dataset. The JAFFE database consists of 213 frontal face images. The ORL dataset consists of 400 frontal face images from 40 individuals. Face is firstly detected, and then eyes and mouth are detected. To evaluate the precision of eye localization, a scale independent localization criterion [11] is used. This relative error measure compares the automatic location result with the manually marked locations of each eye. Let Cl and Cr be the manually
Cl ' and Cr ' be the detected positions, dl be the Euclidean distance between Cl ' and Cl , d r be the Euclidean distance between Cr ' and Cr , dlr be the Euclidean distance between the ground truth eye centers.
extracted left and right eye positions,
Then the relative error of this detection is defined as follows:
err =
max(d l , d r ) d lr
(10)
JAFFE contains only female faces and there is no mustache occlusion, the mouth detection rate is high. When err < 0.1 , the eye detection rate is 99.13%, the mouth
770
Q. Wang, C. Zhao, and J. Yang
detection rate is 99.32% based on eye detection. Our algorithm outperforms the Ref. [12] and Ref. [13]. Some detection results are shown in Fig. 8. However, some faces in ORL dataset contain glasses and mustache. When the glisten of glasses is too strong, the eye detection will fail, also when the occlusion on mouth is heavy, the mouth detection will fail. When err < 0.1 , the eye detection rate is 90.67%, the mouth detection rate is 97.76% based on eye detection. Fig. 9 shows some detection results.
Fig. 8. Detection results on JAFFE dataset
Fig. 9. Detection results on ORL dataset
6 Conclusions and Future Research In this paper we present an efficient facial features detection method based on entropy and SVM. Experimental results show that entropy measure can extract eye candidates effectively. Based on the precise eye detection, mouth search region can be calculated by the coordinates of two eyes. This makes mouth detection much easier. The experimental results demonstrate its efficiency. Future work will focus on resolving the occlusion on faces to improve the algorithm performance.
Acknowledgment This work was supported by National Natural Science Foundation of China (Grant No. 60503026, 60632050), and the Project of Science and Technology Plan of Jiangsu Province (Grant No. BG2005008).
References 1. Cristinacce, D., Cootes, T.: Facial Feature Detection Using AdaBoost with Shape Constraints. In: Proceedings of British Machine Vision Conference, pp. 231–240 (2003) 2. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In: Proceedings of Computer Vision and Pattern Recognition Conference, vol. 1, pp. 511–518 (2001)
Efficient Facial Feature Detection Using Entropy and SVM
771
3. Dryden, I., Mardia, K.V.: The Statistical Analysis of Shape. Wiley, London (1998) 4. D’Orazio, T., Leo, M., Cicirelli, G., Distante, A.: An Algorithm for Real Time Eye Detection in Face Images. In: Proceedings of 17th International Conference on Pattern Recognition, vol. 3, pp. 278–281 (2004) 5. Du, S., Ward, R.: A Robust Approach for Eye Localization Under Variable Illuminations. In: Proceedings of International Conference on Image Processing, vol. 1, pp. 377–380 (2007) 6. Toews, M., Arbel, T.: Entropy-of-likelihood Feature Selection for Image Correspondence. In: Proceedings of 9th International Conference on Computer Vision, vol. 2, pp. 1041–1047 (2003) 7. Shannon, C.E., Waver, W.: A Mathematical Theory of Communication. Bell System Technical Journal 27, 379–423 (1948) 8. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 9. Sung, K.K., Poggio, T.: Example-based Learning for View-based Human Face Detection. IEEE Trans. Pattern Anal. Mach. Intell. 20(1), 39–51 (1998) 10. Oh, J.S., Kim, D.W., et al.: Facial Component Detection for Efficient Facial Characteristic Point Extraction. In: Kamel, M.S., Campilho, A.C. (eds.) ICIAR 2005. LNCS, vol. 3656, pp. 1125–1132. Springer, Heidelberg (2005) 11. Jesorsky, O., Kirchberg, K.J., Frischholz, R.W.: Robust Face Detection Using the Hausdorff Distance. In: Bigun, J., Smeraldi, F. (eds.) AVBPA 2001. LNCS, vol. 2091, pp. 90– 95. Springer, Heidelberg (2001) 12. Zhou, Z.H., Geng, X.: Projection Functions for Eye Detection. Pattern Recognition 37, 1049–1056 (2004) 13. Ma, Y., Ding, X.Q., et al.: Robust Precise Eye Location under Probabilistic Framework. In: Proceedings of FGR, pp. 339–344 (2004)
Type-2 Fuzzy Mixture of Gaussians Model: Application to Background Modeling Fida El Baf, Thierry Bouwmans, and Bertrand Vachon Department of Mathematics, Images and Applications University of La Rochelle, France
[email protected]
Abstract. Background modeling is a key step of background subtraction methods used in the context of static camera. The goal is to obtain a clean background and then detect moving objects by comparing it with the current frame. Mixture of Gaussians Model [1] is the most popular technique and presents some limitations when dynamic changes occur in the scene like camera jitter, illumination changes and movement in the background. Furthermore, the MGM is initialized using a training sequence which may be noisy and/or insufficient to model correctly the background. All these critical situations generate false classification in the foreground detection mask due to the related uncertainty. To take into account this uncertainty, we propose to use a Type-2 Fuzzy Mixture of Gaussians Model. Results show the relevance of the proposed approach in presence of camera jitter, waving trees and water rippling.
1 Introduction The common approach for discriminating moving objects from the background is the background subtraction which is used in the field of video surveillance [2], optical motion capture [3,4,5] and multimedia applications [6]. In this context, background modeling is the first key step to obtain a clean background. The simplest way to model the background is to acquire a background image which doesn’t include any moving object. In some environments, the background isn’t available and can always be changed under critical situations like camera jitter, illumination changes, objects being introduced or removed from the scene. To take into account these problems of robustness and adaptation, many background modeling methods have been developed and the most recent surveys can be found in [2,7,8]. These background modeling methods can be classified in the following categories: Basic Background Modeling [9,10,11], Statistical Background Modeling [12,1,13], Fuzzy Background Modeling [14,15] and Background Estimation [16,17,18]. The models the most used are the statistical ones: The first way to represent statistically the background is to assume that the history over time of intensity values of a pixel can be modeled by a single Gaussian [9]. However, a unimodal model cannot handle dynamic backgrounds when there are waving trees, water rippling or moving algae. To solve this problem, the Mixture of Gaussians Models (MGM) has been used to model dynamic backgrounds [1]. This model has some disadvantages. Background having fast variations cannot be accurately modeled with just a few Gaussians (usually 3 to 5), causing problems for sensitive detection. So, G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 772–781, 2008. c Springer-Verlag Berlin Heidelberg 2008
Type-2 Fuzzy Mixture of Gaussians Model: Application to Background Modeling
773
a non-parametric technique was developed for estimating background probabilities at each pixel from many recent samples over time using Kernel density estimation [13] but it is time consuming. Finally, due to a good compromise between robustness and time/memory requirements MGM are the most used. In the MGM initialization, an expectation-maximization (EM) algorithm is used and allows to estimate MGM parameters from a training sequence according to the maximum-likelihood (ML) criterion. The MGM is completely certain once its parameters are specified. However, because of insufficient or noisy data in training sequence, the MGM may not accurately reflect the underlying distribution of the observations according to the ML estimation. It may seem problematical to use likelihoods that are themselves precise real numbers to evaluate MGM with uncertain parameters. To solve this problem, we propose to model the background by using a Type-2 Fuzzy Mixture of Gaussians Model (T2 FMGM). T2 FMGM was recently developed by Zeng et al. [19] to introduce descriptions of uncertain parameters in the MGM. T2 FMGM has proved their superiority in pattern classification [19]. The rest of this paper is organized as follows: In the section 2, we presented the basis of T2 FMGM. Then, the T2 FMGM is used for background modeling in section 3. In the section 4, experiments on indoor and outdoor scenes show that T2 FMGM outperform the crisp MGM when dynamic changes occurs.
2 Type-2 Fuzzy Mixture of Gaussians Model This section reviews the principle of T2 FMGM developed in [19]. Mixture of Gaussians Models (MGMs) have been widely used in density modelling and clustering. They have universal approximation ability because they can model any density function closely provided that they contain enough mixture components. The MGM is composed of K mixture components of multivariate Gaussian as follows: P (o) =
k
ωi η (o, μi ,
i)
(1)
i=1
where
k
ωi = 1 and ωi > 0. The multivariate Gaussian distribution is: 1 1 T −1 η (o, μ, ) = exp − (o − μ) (o − μ) (2) d 1 2 (2π) 2 | | 2 where μ is the mean vector, is the covariance matrix and d is the dimensionality of o. For the sake of simplicity, is considered as a diagonal covariance matrix. So, the MGM (Equation 1) is expressed as a linear combination of multivariate Gaussian distribution. To take into account the uncertainty, Zeng et al. [19] proposed recently T2 membership functions to represent multivariate Gaussian with uncertain mean vector or covariance matrix, and replace the corresponding parts in (Equation 1) to produce the T2 FMGM with uncertain mean vector (T2 FMGM-UM) or uncertain variance (T2 FMGM-UV). Given a d−dimensional observation vector o, the mean vector μ, and the i=1
774
F. El Baf, T. Bouwmans, and B. Vachon
diagonal covariance matrix uncertain mean vector is: η (o, μ ˜,
= diag σ12 , . . . , σd2 , the multivariate Gaussian with
2 2 1 o1 − μ1 1 od − μd )= . . . exp − (3) d 1 exp − 2 σ1 2 σd (2π) 2 | | 2
μ1 ∈ μ1 , μ1 , . . . , μd ∈ μd , μd 1
In the case of uncertain covariance matrix, it’s defined as: 2 2
1 1 o1 − μ1 1 od − μd ˜ η o, μ, = . . . exp − (4) d 1 exp − 2 σ1 2 σd (2π) 2 | | 2 σ1 ∈ [σ 1 , σ 1 ] , . . . , μd ∈ [σ d , σ d ] where μ ˜ and σ ˜ denote uncertain mean vector and covariance matrix respectively. Because, there is no prior knowledge about the parameter uncertainty, practically Zeng et al. [19] assume that the mean and standard deviation vary within intervals with uniform possibilities, i.e., μ ∈ μ, μ or σ ∈ [σ, σ]. Each exponential component in Equation 3 and Equation 4 is the Gaussian primary membership function (MF) with uncertain mean or standard deviation as shown in Fig.1. The shaded region is the footprint of uncertainty (FOU). The thick solid and dashed lines denote the lower and upper MFs. In the Gaussian primary MF with uncertain mean, the upper MF is: ⎧ ⎪ ⎨f o; μ, σ , if o < μ h (o) = 1, (5) if μ ≤ o < μ ⎪ ⎩ f (o; μ, σ) , if o > μ o−μ 2
2 1 o−μ where f o; μ, σ = exp − 21 and f (o; μ, σ) = exp − . σ 2 σ The lower MF is:
h (o) =
f (o; μ, σ) , if o ≤ f o; μ, σ , if o >
μ+μ 2 μ+μ 2
(6)
In the Gaussian primary MF with uncertain standard deviation, the upper MF is h (o) = f (o; μ, σ) and the lower MF is h (o) = f (o; μ, σ). The factor km and kν control the intervals in which the parameter vary as follows: μ = μ − km σ,
μ = μ + km σ, km ∈ [0, 3] , 1 σ = kν σ, σ = σ, kν ∈ [0.3, 1] . kν
(7) (8)
Because a one-dimensional gaussian has 99.7% of its probability mass in the range of [μ − 3σ, μ + 3σ], Zeng et al. [19] constrain km ∈ [0, 3] and kν ∈ [0.3, 1]. These factors also control the area of the FOU. The bigger km or the smaller kν , the larger the FOU, which implies the greater uncertainty.
Type-2 Fuzzy Mixture of Gaussians Model: Application to Background Modeling
a)
775
b)
Fig. 1. a): At the left, the Gaussian primary MF with uncertain mean. b): At the right, the Gaussian primary MF with uncertain std having uniform possibilities. The shaded region is the FOU. The thick solid and dashed lines denote the lower and upper MFs. Picture from [19].
3 Application to Background Modeling Each pixel is characterized by its intensity in the RGB color space. So, the observation o is a vector Xt in the RGB space and d = 3. Then, the MGM is composed of K mixture components of multivariate Gaussian as follows: P (Xt ) =
k
ωi,t η Xt , μi,t , i,t
(9)
i=1
where the parameters are K is the number of distributions, ωi,t is a weight associated to the ith Gaussian at time t with mean μi,t and standard deviation i,t . η is a Gaussian probability density function: 1 1 T −1 η (Xt , μ, ) = − (Xt − μ) (Xt − μ) (10) 3 1 exp 2 (2π) 2 | | 2 For the T2 FMGM-UM, the multivariate Gaussian with uncertain mean vector is: 2 1 1 Xt,c − μc η (Xt , μ ˜, ) = exp − (11) 3 1 2 σc (2π) 2 | | 2
with μc ∈ μc , μc and c ∈ {R, G, B}. For the T2 FMGM-UV, the multivariate Gaussian with uncertain variance vector is: 2
1 1 Xt,c − μc ˜ η Xt , μ, = exp − (12) 3 1 2 σc (2π) 2 | | 2 where σc ∈ [σ c , σ c ] and c ∈ {R, G, B}. Both the T2 FMGM-UM and T2 FMGM-UV can be used to model the background and we can expect that the T2 FMGM-UM will be more robust than the T2 FMGMUV. Indeed, only the means are estimated and tracked correctly over time in the MGM maintenance. The variance and the weights are unstable and unreliable as explained by Greiffenhagen et al. [20].
776
F. El Baf, T. Bouwmans, and B. Vachon
3.1 Training Training a T2 FMGM consists to estimate the parameters μ, and the factor km or kν . Zeng et al. [19] set the factors km and kν as constants according to prior knowledge. In our work, they are fixed depending to the video (see Section 4). Thus, parameters estimation of T2 FMGM includes three steps: – Step 1: Choose K between 3 and 5. – Step 2: Estimate MGM parameters by an EM algorithm. – Step 3: Add the factor km or kν to MGM to produce T2 FMGM-UM or T2 FMGMUV. Once the training is made, a first foreground detection can be processed. 3.2 Foreground Detection Foreground detection consists in classifying the current pixel as background or foreground. By using the ratio = rj = ωj /σj , we firstly ordered the K Gaussians as in [1]. This ordering supposes that a background pixel corresponds to a high weight with a weak variance due to the fact that the background is more present than moving objects and that its value is practically constant. The first B Gaussian distributions which exceed certain threshold T are retained for a background distribution:
b B = argminb (13) i=1 ωi,t > T The other distributions are considered to represent a foreground distribution. When the new frame incomes at times t + 1, a match test is made for each pixel. For this, we use the log-likelihood, and thus we are only concerned with the length between two bounds of the log-likelihood interval, i.e., H (Xt ) = ln (h (Xt )) − ln h (Xt ) . In Fig 1.a), the Gaussian primary MF with uncertain mean has: 2k |X −μ| m t , if Xt ≤ μ − km σ or Xt ≥ μ + km σ σ H (Xt ) = |Xt −μ| (14) 2 km km |Xt −μ| + 2 , if μ − km σ < Xt < μ + km σ 2σ2 + σ In Fig 1.b), the Gaussian primary MF with uncertain standard deviation has 2 1 |Xt − μ| H (Xt ) = 1/kν2 − kν2 2σ 2
(15)
μ and σ are the mean and the std of the original certain T1 MF without uncertainty. Both (14) and (15) are increasing functions in terms of the deviation |Xt − μ|. For example, given a fixed km , the farther the Xt deviates from μ, the larger H (Xt ) is in (12), which reflects a higher extent of the likelihood uncertainty. This relation-ship accords with the outlier analysis. If the outlier Xt deviates farther from the center of the class-conditional distribution, it has a larger H (Xt ) showing its greater uncertainty to the class model. So, a pixel is ascribed to a Gaussian if: H (Xt ) < kσ
(16)
Type-2 Fuzzy Mixture of Gaussians Model: Application to Background Modeling
777
where k is a constant threshold determined experimentally and equal to 2.5. Then, two cases can occurs: (1) A match is found with one of the K Gaussians. In this case, if the Gaussian distribution is identified as a background one, the pixel is classified as background else the pixel is classified as foreground. (2) No match is found with any of the K Gaussians. In this case, the pixel is classified as foreground. At this step, a binary mask is obtained. Then, to make the next foreground detection, the parameters must be updated. 3.3 Maintenance The T2 FMGM Maintenance is made as in the original MGM [1] as follows: – Case 1: A match is found with one of the K Gaussians. For the matched component, the update is done as follows: ωi,t+1 = (1 − α) ωi,t + α
(17)
where α is a constant learning rate. μi,t+1 = (1 − ρ) μi,t + ρXt+1 2 σi,t+1
= (1 −
2 ρ) σi,t
(18) T
+ ρ (Xt+1 − μi,t+1 ) (Xt+1 − μi,t+1 )
(19)
where ρ = αη (Xt+1 , μi , i ). For the unmatched components, μ and σ are unchanged, only the weight is replaced by ωj,t+1 = (1 − α) ωj,t . – Case 2: No match is found with any of the K Gaussians. In this case, the least probable distribution k is replaced with a new one with parameters: ωk,t+1 = Low Prior Weight μk,t+1 = Xt+1
(20) (21)
2 σk,t+1 = Large Initial Variance
(22)
Once a background maintenance is made, another foreground detection can be processed and so on.
Fig. 2. Background subtraction with illumination changes (PETS 2006 dataset). From left to right respectively, the original image (Frame 165), the segmented image obtained by the MGM, the result obtained using the T2 FMGM-UM and the T2 FMGM-UV.
778
F. El Baf, T. Bouwmans, and B. Vachon
4 Experimental Results We have applied the T2 FMGM-UM and T2 FMGM-UV algorithms to indoor and outdoor videos where different critical situations occur like camera jitter, movement in
Frame 271
Frame 373
Frame 410
Frame 465
Fig. 3. The first row are the original images, the second row are the ground truth images, the third row are the results obtained by using the MGM, the fourth row are the result obtained using the T2 FMGM-UM and the firth row are the result obtained by using the T2 FMGM-UV
Fig. 4. Overall performance
Type-2 Fuzzy Mixture of Gaussians Model: Application to Background Modeling Sequence: Campus Frame 1110
779
Sequence: Water Surface
Frame 1210
Frame 1120
Frame 1590
Fig. 5. Background subtraction with dynamic background. The first row shows the original frames for Campus and Water Surface sequences. The second row presents the segmented images obtained by the MGM. The Third and the fourth rows illustrate the result obtained using the T2 FMGM-UM and the T2 FMGM-UV respectively. Table 1. Performance analysis Method
Error Type MGM[1] False neg. False pos. T2 False neg. FMGM-UM False pos. T2 False neg. FMGM-UV False pos.
Frame 271 0 2093 0 203 0 3069
Frame 373 1120 4124 1414 153 957 1081
Frame 410 4818 2782 6043 252 2217 1119
Frame 465 2050 1589 2520 46 1069 1158
Total Error 18576 10631 10670
the background, illuminations change and shadows. The experiments are conducted to compare the results of T2 FMGM with the crisp MGM [1]. Note that the best results were obtained with the values 2 and 0.9, respectively for the factors km and kν .
780
F. El Baf, T. Bouwmans, and B. Vachon
4.1 Indoor Scene Videos PETS 2006 dataset [21] provides several video presenting indoor sequence in video surveillance context. In these video sequences, there are illumination changes and shadows. Fig.2 presents the results obtained by the MOG [1], the T2 FMGM-UM and the T2 FMGM-UV. It is noticed that the results obtained using the T2 FMGM-UM and the T2 FMGM-UV are better than using the crisp MOG. The silhouettes are well detected with the T2 FMGM-UM. T2 FMGM-UV is more sensitive because the variance is more unstable over time. 4.2 Outdoor Scene videos We have chosen three videos presenting different dynamic backgrounds: camera jitter, waving trees and water rippling. The first outdoor sequence that was tested involved a camera mounted on a tall tripod and comes from [22]. The wind caused the tripod to sway back and forth causing nominal motion in the scene. In Fig.3, the first row shows different current image. The second row and the third one show respectively the ground truth and the results obtained by the MGM proposed in [1]. It is evident that the motion causes substantial degradation in performance. The fourth and fith rows show respectively the results obtained by the T2 FMGM-UM and the T2 FMGM-UV. As for indoor scene, the T2 FMGM-UM and T2 FMGM-UV give better results than the crisp MOG. Numerical evaluation has been done in term of false positive and false negative. In Table.1 we can see that the T2 FMGM-UM and T2 FMGM-UV give less total error than the MGM. Furthermore, as shows in Fig.4, the T2 FMGM-UM generates the least false positive which is important in the context of target detection. We have also tested our method to the sequences Campus and Water Surface which come from [23]. Fig.5 shows the robustness of T2 FMGM-UM against waving trees and water rippling.
5 Conclusion Type 2 Fuzzy Mixture of Gaussians Model (T2 FMGM) is an elegant technique to model the background and allows to handle critical situations like camera jitter, illumination changes, movement in the background and shadow. The T2 FMGM-UM is more robust than the T2 FMGM-UV due to a better estimation of the mean than the variance. One future direction of this work is an adaptive version of the proposed method which allows to determine dynamically the optimal number of Gaussians.
References 1. Stauffer, C.: Adaptive background mixture models for real-time tracking. Computer Vision and Pattern Recognition 1, 246–252 (1999) 2. Cheung, S., Kamath, C.: Robust background subtraction with foreground validation for urban traffic video. In: EURASIP 2005 (2005) 3. Carranza, J., Theobalt, C., Magnor, M., Seidel, H.: Free-viewpoint video of human actors. ACM Transactions on Graphics 22, 569–577 (2003)
Type-2 Fuzzy Mixture of Gaussians Model: Application to Background Modeling
781
4. Mikic, I., Trivedi, M., Hunter, E., Cosman, P.: Human body model acquisition and tracking using voxel data. International Journal of Computer Vision 1, 199–223 (2003) 5. Horprasert, T., Haritaoglu, I., Wren, C., Harwood, D., Davis, L., Pentland, A.: Real-time 3d motion capture. In: Workshop on Perceptual User Interfaces (PUI 1998), vol. 1, pp. 87–90 (1998) 6. El Baf, F., Bouwmans, T., Vachon, B.: Comparison of background subtraction methods for a multimedia learning space. In: SIGMAP 2007 (2007) 7. Piccardi, M.: Background subtraction techniques: a review. In: SMC 2004, vol. 1, pp. 3199– 3104 (2004) 8. Elhabian, S., El-Sayed, K., Ahmed, S.: Moving object detection in spatial domain using background removal techniques. Recent Patents on Computer Science 1, 32–54 (2008) 9. Lee, B., Hedley, M.: Background estimation for video surveillance. In: IVCNZ 2002, vol. 1, pp. 315–320 (2002) 10. McFarlane, N., Schofield, C.: Segmentation and tracking of piglets in images. In: BMVA 1995, vol. 1, pp. 187–193 (1995) 11. Zheng, J., Wang, Y., Nihan, N., Hallenbeck, E.: Extracting roadway background image: A mode based approach. Journal of Transportation Research Report 1, 82–88 (2006) 12. Wren, C., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: Real-time tracking of the human body. In: PAMI 1997, vol. 19, pp. 780–785 (1997) 13. Elgammal, A., Harwood, D., Davis, L.: Non-parametric model for background subtraction. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 751–767. Springer, Heidelberg (2000) 14. Sigari, M., Mozayani, N., Pourreza, H.: Fuzzy running average and fuzzy background subtraction: Concepts and application. International Journal of Computer Science and Network Security 8, 138–143 (2008) 15. El Baf, F., Bouwmans, T., Vachon, B.: Fuzzy integral for moving object detection. In: IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2008 (2008) 16. Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. In: ICCV 1999 (1999) 17. Messelodi, S., Modena, C., Segata, N., Zanin, M.: A kalman filter based background updating algorithm robust to sharp illumination changes. In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 163–170. Springer, Heidelberg (2005) 18. Chang, R., Ghandi, T., Trivedi, M.: Vision modules for a multi sensory bridge monitoring approach. In: ITSC 2004, vol. 1, pp. 971–976 (2004) 19. Zeng, J., Xie, L., Liu, Z.: Type-2 fuzzy gaussian mixture models Pattern Recognition (June 13, 2008) 20. Greiffenhagen, M., Ramesh, V., Niemann, H.: The systematic design and analysis cycle of a vision system: A case study in video surveillance. In: IEEE CVPR 2001, vol. 2, p. 704 (2001) 21. http://www.cvg.rdg.ac.uk/PETS2006/data.html 22. http://www.cs.ucf.edu/yaser/backgroundsub.htm 23. http://perception.i2r.astar.edu.sg/bk model/bk index.html
Unsupervised Clustering Algorithm for Video Shots Using Spectral Division Lin Zhong, Chao Li, Huan Li, and Zhang Xiong School of Computer Science and Technology, Beihang University No.37 Xue Yuan Road, Haidian District, Beijing, P. R. China {zhonglin,lihuan}@cse.buaa.edu.cn, {licc,Xiongz}@buaa.edu.cn
Abstract. A new unsupervised clustering algorithm, Spectral-division Unsupervised Shot-clustering Algorithm (SUSC), is proposed in this paper. Key-fames are picked out to represent the shots, and color feature of key-frames are extracted to describe video shots. Spherical Gaussian Model (SGM) is constructed for every shot category to form effective descriptions of them. Then Spectral Division (SD) method is employed to divide a category into two categories, and the method is iteratively used for further divisions. After each iterative shot-division, Bayesian information Criterion (BIC) is utilized to automatically judge whether to stop further division. During this processes, one category may be dissevered by mistake. In order to correct these mistakes, similar categories will be merged by calculating the similarities of every two categories. This approach is applied to three kinds of sports videos, and the experimental results show that the proposed approach is reliable and effective.
1 Introduction With the development of multi-media and Internet technologies, the amount of video data multiplies exponentially. Therefore, how to analyze and organize video data effectively has been an important research topic. Content-based video retrieval and browsing systems usually include Shot division, Key-frame extraction, Scene analysis and Event detection. Shot boundary detection is used to parse video streams into individual shots and corresponding key-frames are extracted to stand for shots. Shot clustering groups the similar key-frames for further semantic abstraction. It is between the low-level visual features and high-level semantic concepts, and the performance has significant influences on the results of the following steps. Supervised clustering and unsupervised clustering are two major clustering methods. The former is more accurate, but training sets are needed training classifiers, while the latter, which has abilities of adaptations, does not need training sets. However, unsupervised clustering algorithm usually suffers from some problems: the optimal number of clusters is hard to estimate, and results are sensitive to the initialization of cluster centers. Three major methods, developed to find the optimal number of clusters without prior information, are as follows: 1. Enumeration based mechanism [1]. The method search the optimal number heuristically by using appropriate information criterion. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 782–792, 2008. © Springer-Verlag Berlin Heidelberg 2008
Unsupervised Clustering Algorithm for Video Shots Using Spectral Division
783
2. Merging based mechanism [2]. The method choose a number much more than the optimal number to execute clustering, then merge the similar clusters by calculating least information entropy lost. 3. K-means based iterative mechanism [3, 4]. The method repeats bio-divisions using K-means until divisions stop and finally gets the optimal number of divisions. The first method is simple and reliable, but the complexity is very high especially when no prior information is provided. Merging methods have to calculate information entropy lost in each step, so the convergence rate is weak and the complexity is high. K-means based methods, i.e. X-means, have strong convergence rate and low complexity. However, they only consider relationships between different clusters, ignoring relationships inside each cluster. It is possible a cluster is divided by mistake. The Spectral-division Unsupervised Shot-clustering Algorithm is proposed to solve problems mentioned above. The central computational module is based on Spectraldivision iterations. It divides a category into two categories and it is iteratively used to further divide the results of previous divisions. At the end of each iterative shotdivision, Bayesian information Criterion (BIC) is utilized to automatically judge whether to stop further division. K, the number of clusters, would converge to the optimal number at exponential rate. The algorithm measures both the relationships between clusters and inside clusters, thus would avoid most of wrong divisions which may happened in X-means. Besides, to correct such mistakes, similar categories will be merged by calculating the similarities of all possible category pairs. The experiments on three kinds of videos show a better K estimation and shots clustering results. The paper is organized as follows. Section 2 presents and extends spectral theory. Section 3 introduces the extraction of low-level feathers and the quantization of feathers for spectral division. The clustering algorithm is developed and analyzed in section 4. Experiments and comparisons are given in Section 5, and the conclusions are in Section 6.
2 Spectral Theory and Spectral-Division In this section, we first briefly introduce spectral theory and its application in image segmentation. Then, we discuss the way to construct a graph for shot division according to spectral theory and the steps of spectral division which is the central computational module of spectral clustering algorithm are represented in detailed. 2.1 Spectral Theory Spectral theory is a famous algebraic method, which could convert some segmentation problems to graph problems, and it has a wide application in division and clustering [5, 6]. Wu and Leahy [7] applied the theory to image segmentation and got a rewarding result. They constructed topological model according to pixels in image, as shown in Figure 1. Shi and Malik [8] proposed the Normalized Cut, which could avoid unnatural bias for partitioning out small sets of points. Instead of directly calculating the value of
784
L. Zhong et al.
total weight of edges connecting the two partitions, they computed the cut cost as a fraction of the total edge connections all the nodes in the graph. Besides, they gave a solution to get the best partition which was a NP-complete problem. The edge weights signify the similarities of data points and the partition is based on the eigenvectors of its affinity matrix. i
Wij eij
j
Fig. 1. Topological Model. The model is build to segment a picture, and all the pixels in image are represented as points in graph, such as i and j. The relationships between pixels are defined as the distances between corresponding points, and Wij, which is the weight of edge eij between i and j, represents the relationship between pixel i and j in image.
2.2 Spectral-Division Algorithm Spectral-division in the paper applies spectral theory to video shot clustering by constructing graphs with shot sets. The points in graph stand for the shot sets, and the weights of edges between points represent the similarities between the two shots, as shown in Figure 2. Then, spectral-division employs the solution, which uses eigenvectors of affinity matrix to partition image, to divide the shots. Form the weighted graph G=(V, E) with shot set S, and Vi stands for Si, eij signifies the similarity between shot i and shot j. The result of spectral division { A, B } makes Ncut[8] minimal.
Ncut ( A, B) =
cut ( A, B) cut ( A, B) + assoc( A,V ) assoc ( B,V )
cut ( A, B) =
∑
i∈A, j∈B
eij , assoc( A) =
∑
i∈A, j∈V
Fig. 2. Graph model for shot division
(1)
eij
(2)
Unsupervised Clustering Algorithm for Video Shots Using Spectral Division
785
The Spectral-division algorithm to divide a category is summarized as follows: Step 1. Form the N × N affinity matrix E by items of edge eij . Step 2. Construct the degree matrix D(E) as a diagonal matrix. dii = ∑ j eij
(3)
Step 3. Define the normalized affinity matrix L(E) as: L( E ) = ( D( E )) −1/2 E ( D( E )) −1/2
(4)
Step 4. Divide graph G according to the sign bit of the eigenvector of L(E) which is corresponding to Fiedler [9] (the second small eigenvalue). Spectral-division algorithm integrates distances between and inside clusters and it could make divisions more accurate. Besides, it converts the problem to get the minimal value to find the eigenvector which is corresponding to Fielder. What is more, it could take approximate computation method when calculating the eigenvector, which could further decrease complexity.
3 Feathers Extraction and Quantization 3.1 The Feather Extraction Feature extraction is one of the most fundamental and important issues in video shot categorization and there are a large variety of visual features being exploited to describe video contents. Color is one of the low-level physical feathers which would best reflect the visual feathers of images, and it has close correlation with objects and scenes in images. Compared with other visual feathers, it has been extensively studied because of its invariance with respect to image scaling, orientation and perspective [10] . Therefore it has a higher robustness. Color histogram is widely used in image retrieval area, which describes the probability distributions of different colors. Color feather is the most apparent feather in sport video and it could describe necessary information to represent shots. Besides, the performances of color feather which is used to be the major feather of sport video are shown to be encouraging in paper [11]. Thus, in this paper, we use color feathers to describe shots in experiments. Comparing with other color models, the HSV (Hue, Saturation and Value) color space is more similar to perception of human [12]. Besides, it is nearly perceptually uniform and the similarity of two HSV colors is determined by their distance, which can be measured by the Euclidean distance between the color vectors, in the HSV color space. 3.2 Quantization of Feathers for Spectral Division In HSV space, we quantize the color space into 36 non-uniform colors [13]. Usually, when we judge the similarity between two images by eyes, we firstly judge them in coarse scale. So the 36-non-uniform scheme is closed to the vision model of our eyes. In the scheme, hue is divided into seven non-uniform colors of red, orange, yellow, green, cyan, blue and purple. Besides, to better represent key information in frames, we divide SV panels non-uniformly too, as shown in Figure 3.
786
L. Zhong et al.
Fig. 3. (a) Hue dimension is quantized into seven non-uniform colors according to vision model of human eyes. (b)(c) Region and can be perceived as a black and a gray area respectively despite of H. Only region can be perceived as a color area, which represents more information about the image.
Ⅲ
Ⅰ Ⅱ
The quantization scheme can be summarized as following, h,s,v represent the values of the pixels in Hue, Saturation, and Value bins and h’, s’, v’ are the values of the pixels after quantization and l is the final value of a pixel in frames: For v ∈ [0, 0.2) , it is a black area:
l=0
(5)
For s ∈ [0, 0.2) and v ∈ [0.2, 0.8) , it is a gray or even white area: l = ⎣( v − 0.2) × 10⎦ + 1
(6)
For s ∈ (0.2,1.0] and v ∈ (0.2,1.0] , it is a color (most useful) area: ⎧0 ⎪ ⎪1 ⎪ ⎪2 h ' = ⎪⎨ 3 ⎪ ⎪4 ⎪5 ⎪ ⎪6 ⎩
if if if if if if if
h ∈ (330,22] h ∈ (22,45] h ∈ (45,70] h ∈ (70,155] h ∈ (155,186] h ∈ (186,278] h ∈ (278,330]
The final value of l is
⎧0
if
s ∈ (0.2, 0.65]
⎩1
if
s ∈ (0.65,1]
⎧0
if
v ∈ (0.2, 0.7]
⎩1
if
v ∈ (0.7,1]
s'= ⎨
v' = ⎨
l = 4h '+ 2 s '+ v '+ 8
(7)
(8)
Therefore, in this quantization scheme, a color histogram of 36 bins is produced to describe a frame. The non-uniform characters are more similar to the human vision model. What is more, its results have little variation when illumination, perspective or shadow varies. In our experiment, the 36 bins HSV histogram is employed and in each shot, we use the average HSV histogram of all frames to represent the shot. Hi and Hj are the HSV histograms of Si and Sj respectively, and edge eij in graph G is defined as follows: −|| H i − H j ||2
eij = e
2σ 2
(9)
In the standard spectral clustering algorithm, σ is the time parameter between shots Si and Sj, and it controls how rapidly the affinity measure eij between Si and Sj
Unsupervised Clustering Algorithm for Video Shots Using Spectral Division
787
decreases when the Euclidean distance between them decreases. However, the time between two shots have little influence on the similarity of them, so we define σ a constant value 0.15 [14] in experiments.
4 Unsupervised Clustering Algorithm Based on Spectral Division In algorithm we use BIC criteria [3] based on posterior probability to judge the necessity of 2-divisions and connectivity of two divisions. In Probability Theory and Statistics, BIC is often used to score the description ability of a model to data sets. BIC has some advantages over common information criteria: 1. 2.
BIC used Spherical Gaussians which could better describe the actual distribution of data sets. BIC, based on posterior probabilities of models rather than the distance between two distributions as common information criteria, could be a more reliable criterion to score the description abilities of models.
Thus, BIC is a good choice to judge when to stop further division in the algorithm. We assume S is the shot set to be clustered, Q is the temporal queue to store shot clusters, and RQ is used to record the final results of the clustering. Q and RQ are initialized to be empty. Step 1. Employ Spectral-Division to divide S into two sub-sets S1 , S 2 (k is default to be more than 2). Insert S1 , S 2 at the end of Q. Step 2. Assume Si is the head of Q and the distribution could be described by Spherical Gaussians, So the distribution of samples X = {xi : i = 1,..., R} in shot set Si is formulized as follows: 1 f ( μ i , Vi ; X ) = (2π ) − p / 2 | Vi |−1/ 2 exp[ − ( x − μ i ) 'Vi −1 ( x − μ i )] 2
the value of BIC ( k = 1) :
BIC (k = 1) = 2log L(μˆi ,Vˆi ; xi ∈ Si ) − M log R
(10) (11)
μˆ i and Vˆi are the maximum likelihood estimate of mean and variance of the M dimensional normal distribution respectively, and M is the total number of parameters. L is the likelihood function, which indicates L(<) f (<) . R is the number of samples in Si . Step 3. Employ Spectral-Division to divide Si into two sub-sets, Si (1) , Si (2) . So the Gaussians distribution of X ( xi(1) , xi(2) ) is : xi(1) ( P i(1) , Vi (1) ), xi(2) ( Pi(2) , Vi (2) )
(12)
the value of BIC (k = 2) : BIC (k = 2) = 2log[ L( μˆi(1) ,Vˆi (1) ) ⋅ L(μˆi(2) ,Vˆi (2) )] − 2M log R
The total number of parameter becomes 2M now.
(13)
788
L. Zhong et al.
Step 4. If BIC (k = 2) > BIC (k = 1) , the set Si is more likely to be two sets and the division is valid, and if so, insert Si (1) , Si (2) into Q. On the contrary, the division is invalid and inserts Si into RQ. If the Q is not empty, go to step 2. Step 5. Assume the sets in RQ is S ' = ( S '1 , S '2 ,..., S 'k ) , and calculate the value of BIC (k = 2) and BIC ( k = 1) of S 'i U S ' j in each pairs of i, j (i, j = 1,..., k , i ≠ j ) . If
BIC (k = 2) > BIC (k = 1) , S 'i and S ' j are connective. Merge all the sets which are connective and output the result of clustering and the number of clusters.
5 Implementation and Evaluation To test the performance of the proposed algorithm, we perform experiments on three kinds of videos: basketball, football and badminton which are come from NBA, EPL, Uber Cup respectively. To promise the accuracy of the test data, we first label these shots by hand and assume these labels are ground truth. Then we categorize these shots into four types: global, close-up, zoom-in and auditorium. The results of handlabeled 1105 shots are shown in Table 1. The representative images of five categories in football video are shown in Figure 4, and Close-up-1 are close-up shots with lawn background and Close-up-2 represent shots with auditorium background. In Figure 5, the categorized shots in Basketball and Badminton are shown, and they are both labeled with four categories. Table 1. Experimental video data and categories
Types of video
Number of shots
basketball
323
Football
523
Badminton
259
Categories c1: global( 80) c2: close-up( 125) c3: zoom-in( 73) c4: auditorium( 45) c1: global( 161) c2: zoom-in( 151) c3: close-up with lawn back-ground( 172) c4: close-up with auditorium background( 29) c5: auditorium( 10) c1: global( 98) c2: close-up( 112) c3: zoom-in( 25) c4: auditorium( 24)
Fig. 4. Representative images of categories in Football
Unsupervised Clustering Algorithm for Video Shots Using Spectral Division
789
Fig. 5. Representative images of categories in basketball and Badminton
The experiments evaluate the algorithm in two aspects: 1. 2.
The optimal number of clusters. The validity of the final clusters.
The results of the algorithm would be compared with the results of X-means. In experiments, recall and precision are used to evaluate the performance of algorithm. Like most information retrieval scenario, recall is defined as the number of shots correctly categorized into the cluster divided by the total number of shots which should be categorized into the cluster and precision is defined as the number of shots correctly categorized into the cluster divided by the total number of shots in the cluster. They reflect the validity and accuracy of the algorithm respectively. Table 2. The result of SUSC with football video
Results
C1
C2
C3
C4
C5
Precision/%
Recall/%
R1 R2 R3 R4 R5 Average
156
4
1
0
0
96.89
94.55
6
14
5
0
0
92.72
93.96
3
5
16
0
0
95.35
96.47
0
0
0
27
2
93.10
96.43
0
0
0
1
9
90.00
81.82
93.61
92.65
The clustering results of 523 football shots in Table 1 are shown in details in Table 2 and 3, in which Ci represents the shots hand-labeled with cluster i, Ri represents the shots categorized into cluster i by algorithm, and the number in boldface are the number of shots categorized into correct clusters. From the Table 2, we would see the shots are categorized into 5 clusters by SUSC( 8 clusters before merging), while 8 clusters are produced after X-means, which is far more than fact situation; The average recall of SUSC is 93.61% and the average precision is 92.64%, which are higher than 90.32% and 88.65% with X-means. Therefore, SUSC performances better than X-means in estimating the number of clusters and could get more valid and accurate clustering results.
790
L. Zhong et al. Table 3. The result of X-means with football video
Results R1
C1
C2
C3
C4
C5
102
7
5
0
0
R2
25
0
0
0
0
R3 R4
21
1
0
0
0
8
112
7
0
0
2
22
0
0
0
7
8
157
0
0
0
0
0
0
0
R5 R6 R7 R8 Average
Precision/%
Recall/%
91.93
89.70
88.74
89.33
0
91.28
92.90
26
3
89.66
96.30
1
9
90.00
75.00
90.32
88.65
Table 4. The results after merging
Type of video
Basketball
Results
Precision /% SUSC X-means
Recall/% SUSC
X-means
R1
95.00
91.25
96.20
85.88
R2
97.60
92.80
98.39
94.31
R3
94.52
87.67
95.83
90.14
R4 R1
95.56 95.67 96.89
82.22 88.49 91.93
89.58 95.00 94.55
84.09 88.61 89.70
R2
92.72
88.74
93.96
89.33
R3
95.35
91.28
96.47
92.90
R4
93.10
89.66
96.43
96.30
R5 Average
90.00 93.61
90.00 90.32
81.82 92.65
75.00 88.65
R1
94.56
90.2
94.45
86.89
R2
94.49
90.41
94.58
89.85
R3
93.34
86.34
94.46
91.43
R4
94.15 94.14
86.54 88.37
92.75 94.06
90.29 89.62
Average
Football
Badminton
Average
In Table 3, we could find the mistakes happen most between two pairs of shots categories: global and zoom-in, closer-in and auditorium. Looking into the wrong categorized shots between global and zoom-in, we find players take little space in some zoom-in frames and these shots are easily categorized into global shots because their color histogram are similar to shots of global shots. As for closer-in and
Unsupervised Clustering Algorithm for Video Shots Using Spectral Division
791
auditorium, we see auditorium takes much space in some closer-in and zoom-in shots. Since SUSC employ HSV color histogram to represent low-level feathers of shots, it is hard to categorize these particular shots correctly. However, such shots are very rare and have little influence on final results, therefore color feather is effective in experiments. In the experiments on basketball and badminton, SUSC correctly get the optimal number of clusters while X-means get an approximate twice bigger number. After merging the clusters produced by X-means by hand, we compare the recall and precision of the two methods. The results are shown in Table 4, and we could see the SUSC performance better both in recall and precision on three kinds of video.
6 Conclusions By exploiting spectral theory, we propose Spectral-Division to divide a category. The new unsupervised shot-clustering algorithm based on Spectral Division proposed in this paper achieves an excellent performance in video shots clustering. Furthermore, BIC is employed to control the process automatically, and we can get the optimal number of clusters by merging similar clusters, which is a difficult problem in unsupervised clustering algorithms. We present a comparison of SUSC and X-means for the classification of sports videos, and the results of our experiments with 1105 shots show that SUSC performs better than X-means in both accuracy and validity.
References 1. Wang, P., Liu, Z.Q., Yang, S.Q.: Investigation on unsupervised clustering algorithms for video shot categorization. Soft Computing, 355–360 (2007) 2. Krishnapuram, R., Freg, C.P.: Fitting an unknown number of lines and planes to image data through compatible cluster merging. Pattern recognition, 385–400 (1992) 3. Pelleg, D., Moore, A.: X-means: Extending k-means with efficient estimation of the number of clusters. In: 17th International Conf. on Machine Learning, pp. 727–734 (2000) 4. Ishioka, T.: An expansion of x-means for automatically determining the optimal number of clusters - Progressive iterations of k-means and merging of the clusters. In: IASTED International Conference on Computational Intelligence, pp. 91–96 (2005) 5. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in NIPS 14, pp. 849–856. MIT Press, Canbridge (2001) 6. Shawe-Taylor, J., Cristianini, N., Kandola, J.: On the concentration of spectral properties. In: Advances in NIPS 14, pp. 511–517. MIT Press, Cambridge (2001) 7. Wu, Z., Leahy, R.: An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation. IEEE PAMI, 1101–1113 (1993) 8. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 9. Fiedler, M.: A Property of Eigenvectors of Nonnegative Symmetric Matrices and Its Application to Graph Theory. Czech. Math. J. 25, 619–637 (1995) 10. Duan, L.Y., Xu, M., et al.: A mid-level representation framework for semantic sports video analysis. In: Proc. Of the 11th ACM International Conference on Multimedia, Berkeley, CA, USA, November 2-8, pp. 33–44 (2003)
792
L. Zhong et al.
11. Zhong, D., Chang, S.-F.: Structure analysis of sports video using domain models. In: Proceedings of IEEE International Conference on Multimedia & Expo., Tokyo, Japan, pp. 713–716 (2001) 12. Smith, J.R.: Integrated spatial and feature image system: Retrieval, analysis and compression [PHD dissertation], Columbia University, New York (1997) 13. Lei, Z., Fuzong, L., Bo, Z.: A CBIR Method Based on Color-Spatial Feature. In: Proceedings of the IEEE Region the 10th Conference, pp. 166–169 (1999) 14. Odobez, J.-M., Gatica-Perez, D., Guillemot, M.: On spectral methods and the structuring of home videos. IDIAP Technical Report, IDIAP-RR-55 (November 2002)
Noise Analysis of a SFS Algorithm Formulated under Various Imaging Conditions Amal A. Farag, Shireen Y. Elhabian, Abdelrehim H. Ahmed, and Aly A. Farag University of Louisville
Abstract. Many different shape from shading (SFS) algorithms have emerged during the last three decades. Recently, we proposed [1] a unified framework that is capable of solving the SFS problem under various settings of imaging conditions representing the image irradiance equation of each setting as an explicit Partial Differential Equation (PDE). However, the result of any SFS algorithm is mainly affected by errors in the given image brightness, either due to image noise or modeling errors. In this paper, we are concerned with quantitatively assessing the degree of robustness of our unified approach with respect to these errors. Experimental results have revealed promising performance against noisy images but has also lacked in reconstructing the correct shape due to error in the modeling process. This result emphasizes the need for robust algorithms for surface reflectance estimation to aid SFS algorithms producing more realistic shapes.
1 Introduction The Shape from shading (SFS) problem is a classical computer vision problem which deals with recovering the 3-D shape of an object through the analysis of the brightness variation in a single image of that object. SFS was formally introduced by Horn over 30 years ago [2, 3]. Since then many different approaches have emerged [4, 5, 6, 7] (for a survey see [8, 9]). The SFS problem can be solved by first formulating an imaging model that describes the relation between the surface shape and the image brightness, considering the three components of the problem which are the camera, the light source and the surface reflectance. After establishing the imaging model, a numerical algorithm has to be developed to reconstruct the shape from the given image. The image brightness is determined by the amount of energy an imaging system (i.e. the camera) receives from the visible area of the surface patch as seen by the camera. This visible area is commonly referred to as the foreshortened area, typically it is the surface area multiplied by cosine the angle between the surface normal and the light direction. Hence, there is a relationship between the radiance at a surface point and the irradiance at the corresponding point in the image. In general, the image irradiance equation can be written as follows; E(x) = R(ˆ n(x)) ,
(1)
where E(x) is the image irradiance at the surface point x and R(.) is the surface radiance with unit normal vector n ˆ (x). Generally,the radiance function R depends on illumination direction, viewer direction and surface reflectance, all of which are assumed to be known [2]. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 793–802, 2008. c Springer-Verlag Berlin Heidelberg 2008
794
A.A. Farag et al.
Recently, we proposed [1] a unified framework for the shape from shading problem which handles various settings of imaging conditions representing the image irradiance equation of each setting as an explicit Partial Differential Equation (PDE). In this paper, we are focusing on testing the sensitivity of our earlier work against image noise and modeling errors. This paper is organized as follows: section 2 gives a brief discussion of different surface reflectance models, section 3 outlines the unified framework for SFS algorithms as proposed in [1], section 4 illustrates our experimental results and section 5 concludes our work.
2 Surface Reflectance Models Image brightness is the result of the interaction between surface reflectance properties (also called surface material) and light source(s). To fully understand how an image is formed, various models that describe reflection properties of different surfaces are needed. Surfaces can be classified according to their reflection properties as specular, diffuse or a combination of the two (hybrid surface). The specular reflection, also called interface reflection, is caused by a mirror-like reflection at the surface-air interface. On the other hand, the diffuse reflection, also called body reflection, is caused by the penetration of light into surface internal medium, followed by a sequence of multiple physical processes which include multiple light-material interactions due to microscopic inhomogeneities in the surface medium, then the light is emitted back out of the surface. A perfectly diffuse (i.e Lambertian) surface appears equally bright from all directions which is the result of the reflection of all incident light. Accordingly, the scene radiance is the same for all directions regardless of the viewer direction. Despite the simplicity of Lambert’s model [28], it has been proven to be an inaccurate approximation to the diffuse component of the surface reflectance. Through a set of experiments carried out on real samples such as plastic, clay and cloth, Oren-Nayar [31] showed that all of these surfaces demonstrate significant deviation from Lambertian behavior, which motivated them to develop a comprehensive reflectance model for rough diffuse surfaces. They assumed that the surface is composed of a collection of long symmetric V-cavities, each having two opposing facets, where the roughness of the surface is specified using a probability density function for the orientation of the facets (see [31] for details). Oren-Nayer model is based on the surface geometry, however Wolff model [32] is based on physics inherited from optics. Lambertian and in turn Oren-Nayer assume that light does not penetrate the surface, yet Wolff has developed a reflection model for smooth surfaces where inhomogeneous dielectric material is modeled as a collection of scatters contained in a uniform medium with index of refraction different from that of air. To obtain a general diffuse reflectance model, Wolff et al. [35] proposed to incorporate the reflectance model of Wolff in the Oren-Nayar model (referred to as Oren-Nayar-Wolff model [1]) which works for smooth and rough surfaces. To account for both the diffuse and the specular components of the surface reflectance, Ward [33] proposed a simple formula which is constrained to obey fundamental physical laws such as conservation of energy and reciprocity, Ward has experimentally validated his model from real samples collected by a simple reflectometry device.
Noise Analysis of a SFS Algorithm Formulated under Various Imaging Conditions
795
3 Unified Framework for SFS Algorithms Depending on the modeling of the surface reflectance, light source and camera sensor, many different combinations of these can be configured; such as Lambertian surface, distant point light source and orthographic camera projection or non-Lambertian surface, nearby point light source and perspective projection, ... etc. The modeling of each combination leads to a different form of the irradiance equation, making it difficult to formulate one generalized imaging model. A list of several different imaging models is listed in Table 1. In our earlier work [1], the image irradiance equation for each model was derived and formulated as a Hamilton-Jacobi partial differential equation (PDE) with Dirichlet boundary conditions. Lax-Friedrichs Sweeping (LFS) method [37] based on Lax-Friedrichs Hamiltonian - was used to approximate the viscosity solutions of static Hamilton-Jacobi equations. The ability to deal with both concave and convex Hamiltoninas with any degree of complexity is the main advantage of LFS, and has been used in [30, 1, 38] to solve the SFS problem for a class of non-Lambertian diffuse surfaces. Due to space limitation, we recommend the reader to refer to [1] for more details. Table 1. The imaging models studied in [1] Model Camera projection A B C D E F G
orthographic orthographic orthographic orthographic perspective perspective perspective
Light source
Surface reflectance literature
at infinity Lambertian [4, 5, 7, 39] at infinity Oren-Nayar [39, 23] at infinity Oren-Nayar-Wolff [23] at infinity Ward [1, 38] at optical center of the camera Lambertian [40, 41] at optical center of the camera Oren-Nayar [30] at optical center of the camera Oren-Nayar-Wolff [1]
4 Experimental Results We follow the same evaluation methodology described in [3] to quantitatively assess the effect of noise and modeling errors on the result of the SFS algorithm proposed in [3]. 4.1 Effect of Image Noise Model ’F’ (perspective camera with light source located at the optical center and OrenNayar reflectance model) has been selected as a case study to assess the robustness of the proposed approach against noise. Given a synthetic image rendered from Model ’F’, noisy images are constructed by adding noise that follows uniform and normal distributions. In order to measure the noise level, the signal to noise ratio (SNR) is used, which is defined by the ratio of the standard deviation of the noise-free image to the standard deviation of the noise.
796
A.A. Farag et al.
Fig. 1. The effect of adding Gaussian noise on the reconstruction accuracy. The input images are shown in the first column and their recovered shapes are shown in the second column. The last two columns display profile plots of the recovered shapes (in solid blue) imposed on the ground truth (in dotted red). The first row is for the results of the noise-free image while the second and the third rows are for the results of noisy images with SNR=4 and SNR=1.5 respectively.
Fig. 2. The effect of adding uniformly distributed noise on the reconstruction accuracy. The input images are shown in the first column and their recovered shapes are shown in the second column. The last two columns display profile plots of the recovered shapes (in solid blue) imposed on the ground truth (in dotted red). The first row is for the results of the noise-free image while the second and the third rows are for the results of noisy images with SNR=4 and SNR=1.5 respectively.
Experiment #1. In this experiment, the depth map of Venus head (of maximum depth 85mm) is used to generate a synthetic image under the assumptions of Model ’F’. Five more images are constructed from this image by adding Gaussian noise with zero mean
Noise Analysis of a SFS Algorithm Formulated under Various Imaging Conditions
797
and different variances such that the values of the SNR for these images are 10.0, 6.0, 4.0, 2.0 and 1.5. The algorithm proposed in [1] is applied on these images to recover the shape of Venus head. The process of generating the noisy images and recovering the corresponding shapes are repeated several times to obtain average results. Figure 1 shows three images with their recovered shapes compared to their ground truth shape. The first row of the figure shows the results of the noise-free image, while the last two rows show the results of images with SNR=4 and SNR=1.5 respectively. For the sake of illustration, the profiles (vertical and horizontal cross-sections) of the recovered and the true shapes are plotted in the last two columns of the same figure. As shown in the figure, the quality of the reconstruction is degraded with increasing the noise level. On average, the mean absolute error of the depth is increased from 1.69%, for the noise-free image, to 2.76% for the images with SN R = 1.5. Experiment # 2. The previous experiment is repeated using a uniformly distributed noise instead of the Gaussian noise. The input images and the reconstruction results are shown in Fig. 2. For this case, the mean absolute error of the depth is increased from 1.69% for the noise-free image to 2.98% for the images with SN R = 1.5. The results of Exp.1 and Exp.2 as illustrated in Fig. 1, Fig. 2 and Table 2 show that even for images with high level of noise, the algorithm produces relatively satisfying results which indicates the robustness of the proposed approach. Table 2. The mean of the absolute error at different values of SNR SNR
Gaussian noise (%)
Uniformly distributed noise (%)
10.0 6.0 4.0 2.0 1.5
1.50 1.54 1.60 2.03 2.76
1.49 1.55 1.67 2.15 2.98
4.2 Effect of Modeling Errors Here a simple experimental setup is used to estimate the reflection characteristic of three samples of papers : (1) a paper with diffuse surface which can be accurately described using Lambertian reflection with one parameter: the surface albedo, (2) a sand paper with reflection that can be approximated using Oren-Nayar model with two parameters: the albedo and the surface roughness, and (3) a paper with glossy surface that can be approximated using Ward model with three parameters: diffuse albedo, specular albedo and surface roughness. In order to estimate these parameters, a flat sheet of the sample is mounted vertically in front of a camera with flash. The camera is positioned such that its optical axis is perpendicular to the paper sheet as illustrated in the first column of Fig. 3. The distance between the camera and the sample is about two and half meters which makes the assumption of orthographic projection an acceptable approximation. In addition, using the flash that attached to the camera as a light source leads to the light direction being equal to the viewer direction. As stated before, the reflectance map depends on the light
798
A.A. Farag et al.
sheet of paper
the camera optical axis
camera with flash
Fig. 3. An experimental setup to estimate reflection parameters for three samples of paper. The first row shows three different orientations for the sample plane. The rows from the second to the fourth show the corresponding images of a glossy paper, a sand paper and a diffuse paper.
Fig. 4. Experimental results for real images of three cylinders constructed from a diffuse paper, a sand paper and glossy paper: (a) the input images, (b,c) the reconstructed shapes under two different orientations and (d) profile plots of the recovered shapes (in solid blue) imposed on the ground truth (in dotted red)
direction s, the viewer direction v, the surface normal vector n and the refection parameters. In this experimental setup the only unknowns are the reflection parameters of each sample. For the diffuse paper sample, only one image is sufficient to determine the only unknown in Lambertian model: the surface albedo. For the glossy paper sample, three images are acquired with the camera under three different orientations of the sample sheet as shown in Fig. 3. Similarly, two images are acquired for the sand paper sample. Using these images, the reflection parameters are estimated for the three samples: the diffuse paper: ρ = 0.8, the sand paper : ρ = 0.7 and σ = 0.4, and the glossy paper: ρd = 0.83 , ρs = 0.09 and σ = 0.05. After estimating the reflection parameters, three cylinders with radius= 40 mm were formed from the three paper sheets then their images were acquired in the same way as
Noise Analysis of a SFS Algorithm Formulated under Various Imaging Conditions
799
Table 3. The error measures for the real samples Error Measures
Diffuse Paper Sand Paper Glossy Paper
Mean Absolute Error 2.78 Absolute Error Standard Deviation 2.50 Mean Gradient Error 0.061
(a)
(b)
3.04 1.7 0.052
(c)
1.43 1.01 0.040
(d)
Fig. 5. The effect of using incorrect reflectance model: (a) an image of a cylinder constructed from a sand paper (first row) and a glossy paper (second row), (b) the reconstructed shapes using the Lambertian assumption, (c) profile plots of the recovered shape (in solid blue) imposed on the ground truth (in dotted red) and (d) a synthetic image rendered from the recovered shape using Lambertian reflectance model and the same lighting
in the first column of Fig 3. Given these images and the estimated reflection parameters in addition to the light and viewer directions, the proposed SFS was used to recover the shapes shown in Fig. 4. As shown in the figure, the recovered shape is close to the true one for all the three cases. The error measures are reported in Table 3. These small error values illustrate the accuracy of the reconstruction in terms of depth and surface gradients. The effect of using erroneous values for the reflection parameters is then studied. Figure 5 (first row) shows the recovered shape of the sand paper sample when the Lambertian reflection model is used instead of the Oren-Nayar model. Similarly, the Ward reflection model is replaced by the Lambertian model to obtain the results that are shown in Fig. 5 (second row) for the glossy paper sample. As indicated by the results in both cases, the SFS algorithm fails to infer the correct shape due to the error in the modeling process. It is worth noting that the algorithm tries to recover the shape that will look similar to the input image if rendered using the Lambertian model as illustarted in Fig 5(d).
5 Conclusion In our earlier work, we have formulated the SFS algorithms using different imaging models based on more realistic models for the camera, the light source and the surface reflectance. Since the results of any SFS algorithm may be mainly affected by the error in image brightness due to noise and modeling errors, in this paper, we evaluated our SFS framework against these error sources. To assess the effect of image noise,we
800
A.A. Farag et al.
generated synthetic images under the assumption of perspective camera with light source located at the optical center and Oren-Nayer reflectance model, noisy images were then constructed by adding zero-mean gaussian and uniform noise. We used signal to noise ratio to measure the noise level. The SFS algorithm was applied on these images to recover the corresponding shapes which are compared to the ground truth shape (from which we generated the images). The experimental results illustrate the promising performance of our earlier proposed formulation against noise. On the other hand, in order to measure the effect of modeling errors, a simple experimental stepup was exploited to estimate the reflection parameters of three samples of papers, a paper with diffuse surface which can be described as Lambertian surface, a sand paper which can be modeled by Oren-Nayer model and a glossy paper which can be approximated by Ward model. Then three cylinders were formed from these papers to acquired their images. When using the SFS algorithm to reconstruct the shapes of these images under the assumption of knowing the surface reflectance parameters, the recovered shape was very close to the true one for all the three cases. However, using erroneous values for the reflection parameters, the SFS algorithm failed to infer the correct shape due to the error in the modeling process. This result emphasizes the need for robust algorithms for surface reflectance estimation to aid SFS algorithms producing more realistic shapes.
References 1. Ahmed, A.H., Farag, A.A.: Shape from shading under various imaging conditions. In: International Conference on Computer Vision and Pattern Recognition CVPR 2007, Minnesota, USA, pp. 1–8 (2007) 2. Horn, B.K.P.: Shape from Shading: A Method for Obtaining the Shape of a Smooth Opaque Object from One View. PhD thesis, Massachusetts Inst. of Technology, Cambridge, Massachusetts (1970) 3. Horn, B.K.P.: Obtaining shape from shading information. In: Winston, P.H. (ed.) The Psychology of Computer Vision, ch. 4, pp. 115–155. McGraw-Hill, New York (1975) 4. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence 10(4), 439–451 (1988) 5. Pentland, A.P.: Local shading analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 6(2), 170–187 (1984) 6. Prados, E., Faugeras, O.: Shape from shading: a well-posed problem? In: International Conference on Computer Vision and Pattern Recognition CVPR 2005, San Diego, CA, USA (June 2005) 7. Rouy, E., Tourin, A.: A viscosity solutions approach to shape from shading. SIAM J. Numerical Analysis 29(3), 867–884 (1992) 8. Zhang, R., Tsai, P.-S., Cryer, J.-E., Shah, M.: Shape from shading: A survey. IEEE Trans. on Pattern Analysis and Machine Intelligence 21(8), 690–706 (1999) 9. Durou, J.-D., Falcone, M., Sagona, M.: A survey of numerical methods for shape from shading. 2004-2-r, Institut de Recherche en Informatique de Toulouse, IRIT (2004) 10. Ikeuchi, K., Horn, B.K.P.: Numerical shape from shading and occluding boundaries. Artificial Intelligence 17(1-3), 141–184 (1981) 11. Brooks, M.J., Horn, B.K.P.: Shape and source from shading. In: Proceedings of the International Joint conference on Artificial Intelligence, pp. 932–936 (1985) 12. Horn, B.K.P.: Height and gradient from shading. Int. J. Comput. Vision 5(1), 37–75 (1990) 13. Szeliski, R.: Fast shape from shading. CVGIP: Image Underst. 53(2), 129–153 (1991)
Noise Analysis of a SFS Algorithm Formulated under Various Imaging Conditions
801
14. Crandall, M.G., Lions, P.-L.: Viscosity solutions of hamilton-jacobi equations. Trans. American Mathematical Society 277, 1–43 (1983) 15. Oliensis, J.: Shape from shading as a partially well-constrained problem. CVGIP: Image Underst. 54(2), 163–183 (1991) 16. Dupuis, P., Oliensis, J.: Direct method for reconstructing shape from shading. In: International Conference on Computer Vision and Pattern Recognition CVPR 1992, UrbanaChampaign Illinois, pp. 453–458 (1992) 17. Oliensis, J., Dupuis, P.: A global algorithm for shape from shading. In: International Conference on Computer Vision ICCV 1993, San Diego, CA, USA, May 1993, pp. 692–701 (1993) 18. Kimmel, R., Bruckstein, A.M.: Tracking level sets by level sets: a method for solving the shape from shading problem. Computer Vision and Image Understanding 62(2), 47–58 (1995) 19. Sethian, J.: A fast marching level set method for monotonically advancing fronts. Proc. Nat. Acad. Sci. 93, 1591–1595 (1996) 20. Kimmel, R., Sethian, J.A.: Optimal algorithm for shape from shading and path planning. Journal of Mathematical Imaging and Vision 14(3), 237–244 (2001) 21. Prados, E., Faugeras, O.: Perspective shape from shading” and viscosity solutions. In: International Conference on Computer Vision ICCV 2003, Nice, France (2003) 22. Prados, E., Faugeras, O., Camilli, F.: Shape from shading: a well-posed problem? RR 5297, INRIA Research (August 2004) 23. Ragheb, H., Hancock, E.: Surface normals and height from non-lambertian image data. In: Second International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT 2004), Greece (2004) 24. Pentland, A.P.: Linear shape from shading. International Journal of Computer Vision 4(2), 153–162 (1990) 25. Tsai, P.S., Shah, M.: Shape from shading using linear approximation. Image and Vision Computing J. 12(8), 487–498 (1994) 26. Lee, C.-H., Rosenfeld, A.: Improved methods of estimating shape from shading using the light source coordinate system. Artificial Intelligence 26(2), 125–143 (1985) 27. Horn, B.K.P., Brooks, M.J.: Shape from Shading. MIT Press, Cambridge (1989) 28. Lambert, J.H.: Lambert’s photometrie(photometria sive de mensura de gratibus luminis, colorum et umbrae). Wilhelm Engelmann, Leipzig, Germany (1892) 29. Mu Lee, K., Jay Kuo, C.-C.: Shape from shading with a generalized reflectance map model. Comput. Vis. Image Underst. 67(2), 143–160 (1997) 30. Ahmed, A.H., Farag, A.A.: A new formulation for shape from shading for non-lambertian surfaces. In: International Conference on Computer Vision and Pattern Recognition CVPR 2006, NY, USA, pp. 1817–1824 (2006) 31. Oren, M., Nayar, S.: Generalization of the lambertian model and implications for machine vision. International Journal of Computer Vision 14(3), 227–251 (1995) 32. Wolff, L.B.: Diffuse reflection. In: International Conference on Computer Vision and Pattern Recognition CVPR 1992, Urbana-Champaign Illinois, pp. 472–478 (1992) 33. Ward, G.J.: Measuring and modeling anisotropic reflection. In: Proceedings of the 19th annual conference on Computer graphics and interactive techniques,SIGGRAPH 1992, pp. 265–272 (1992) 34. Born, M., Wolf, E.: Principle of Optics. Pergamon, New York (1970) 35. Wolff, L.B., Nayar, S.K., Oren, M.: Improved diffuse reflection models for computer vision. Int. J. Comput. Vision 30(1), 55–71 (1998) 36. Phong, B.T.: Illumination for computer generated pictures. Commun. ACM 18(6), 311–317 (1975)
802
A.A. Farag et al.
37. Kao, C.Y., Osher, S., Qian, J.: Lax-friedrichs sweeping scheme for static hamilton-jacobi equations. Journal of Computational Physics 196(1), 367–391 (2004) 38. Ahmed, A.H., Farag, A.A.: Shape from shading for hybrid surfaces. In: IEEE International Conference on Image Processing ICIP 2007, Texas, USA (2007) 39. Samaras, D., Metaxas, D.: Incorporating illumination constraints in deformable models for shape from shading and light direction estimation. IEEE Trans. Pattern Anal. Mach. Intell. 25(2), 247–264 (2003) 40. Okatani, T., Deguchi, K.: Shape reconstruction from an endoscope image by shape from shading techniques for a point light source at the projection center. Computer Vision and Image Understanding 66(2), 119–131 (1997) 41. Prados, E., Faugeras, O.: Unifying approaches and removing unrealistic assumptions in shape from shading: Mathematics can help. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 141–154. Springer, Heidelberg (2004)
Shape from Texture Via Fourier Analysis Fabio Galasso and Joan Lasenby University of Cambridge, Cambridge, UK
Abstract. Many models and algorithms have been proposed since the shape from texture problem was tackled by the pioneering work of Gibson in 1950. In the present work, a general assumption of stochastic homogeneity is chosen so as to include a wide range of natural textures. Under this assumption, the Fourier transform of the image and a minimal set of Gabor filters are used to efficiently estimate all the main local spatial frequencies of the texture, i.e. so as to compute distortion measures. Then a known method which uses singular value decomposition to process the frequencies under orthographic projection is considered. The method is extended to general perspective cases and used to reconstruct the 3D shape of real pictures and video sequences. The robustness of the algorithm is proven on general shapes, and results are compared with the literature when possible.
1
Introduction
Shape from texture dates back to nearly 60 years ago, when Gibson introduced the visual effects of texture cues in the 3D perception of the human visual system ([1]). Since then many authors have contributed to the study of the shape from texture problem with different texture and shape assumptions (e.g. [2], [3], [4], [5], [6], [7], [8], [9], [10]). Our texture model is stochastic, as in [2], [5], [4], [6] and [10], so as to allow a wider variety of textures than a deterministic model. This implies the use of local spectral measurements achieved with the Fourier transform and with Gabor filters. Our initial assumption is homogeneity. It is frequently used (e.g. [2], [3], [10]) as it results more applicable to real textures than other more restrictive ones, like texels (e.g. [9]), seldom found in nature, or isotropy (e.g. [11]), rarely the case. Homogeneity can be seen as periodicity for deterministic textures, and is formalized as stationarity under translation for stochastic textures ([5]). Under this condition we assume that all texture variations are produced only by projective geometry. Shape from texture is generally about measuring the texture distortion in an image and then reconstructing the surface 3D coordinates in the scene ([3], [2], [5], [10]). The present work shows how the method of distortion estimation based on local spatial frequency (LSF) introduced by Galasso and Lasenby and presented in [12] can easily be integrated with the reconstruction method based on singular value decomposition (SVD) of Dugu´e and Elghadi introduced in [6]. The former is the first to use the multi-scale nature of texture, whereas most of the related work uses only two preferred directions in the spectral domain G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 803–814, 2008. c Springer-Verlag Berlin Heidelberg 2008
804
F. Galasso and J. Lasenby
(e.g. [4],[13]). The latter is a robust method for automatic 3D reconstruction from LSFs which only assumes that the image contains a frontal point. While the method in [6] only considers orthographic projections, here we also extend the method to deal with perspective projections. Moreover we compare some of the achieved results with those in [12], and we illustrate how the method proposed here is more robust, can effectively use redundancy to improve estimates, and is applicable to general non-developable surfaces. Finally application of the method is shown on a real video sequence, evidencing the sensitivity of the new algorithm to small changes in the shape of a texture object, which could open the way to using shape from texture as a non-invasive shape variation analysis. The article is structured as follows: section 2 explains how the texture is analyzed to produce distortion information; section 3 presents the projective geometry; section 4 shows how we can recover the 3D coordinates from the measured texture distortion; finally, section 5 presents results on real pictures.
2
Texture Description
Here we describe how to compute the instantaneous frequencies for the different LSFs in the texture, with the use of Fourier transform and minimal sets of 2D Gabor functions. LSFs provide the distortion measures, from which the 3D coordinates of the texture surface are then reconstructed. 2.1
Estimating the Instantaneous Frequencies
We can analyze an image I(x) using a pass-band filter h(x, u), a function of a point x = (x, y) and of a central frequency u = (u, v), which is convolved with the image to provide the local spectrum. As in [14] and [12] we choose 2D Gabor functions: h(x, u) = g(x)e2πjx·u where g(x) =
1 − 2γ12 x·x e . 2πγ 2
(1)
These functions satisfy exactly the lower bound for the measure of a frequency u and its position x, given by the uncertainty principle ([15]). Furthermore the functions are geometrically similar and can be easily arranged to form banks of filters. Our goal is to estimate the instantaneous frequencies u of the image points. [14], [12] and references therein show that this can be done by considering a Gabor function h(x, u), and its two first order derivatives, hx (x, u) and hy (x, u): |hx (x, u) ∗ I(x)| 2π|h(x, u) ∗ I(x)| |hy (x, u) ∗ I(x)| |˜ v (x)| = . 2π|h(x, u) ∗ I(x)|
|˜ u(x)| =
(2)
˜ = (˜ The estimate, u u, v˜), can be assumed to be correct if the measured frequency is in the pass-band of the filter. The method implies therefore that we set the
Shape from Texture Via Fourier Analysis
805
Fig. 1. Brodatz texture D6 (left) and its spectrum amplitude (right)
(a)
(b)
(c)
(d)
Fig. 2. (a) Brodatz texture D6 rendered on a plane; (b) its spectrum amplitude; (c) real LSFs (yellow areas) represented on the spectrum; (d) chosen Gabor functions (yellow circles) superimposed on the spectrum
central frequencies u and the spatial constants γ of the Gabor functions, i.e. their centers and width, to relevant values. Unlike Super and Bovik ([14]), who sample the whole 2D frequency plane, we choose to use the method of Galasso and Lasenby ([12]), i.e. we define small sets of Gabor functions from the information provided by the Fourier transform of the image. 2.2
Setting the Gabor Filter Parameters
We refer to [12] for a detailed explanation of how the amplitude of the Fourier transform represents the 2D frequencies of an image. Here we consider the Brodatz texture D6 and its spectrum amplitude in figure 1. From the spectrum we deduce that the texture is mainly characterized by four LSFs, marked with small yellow circles (only half of the spectrum need to be considered due to symmetry). By slanting and rotating the texture (fig. 2(a)), the four dots of figure 1 become corresponding areas in the spectrum (fig. 2(b)). Figure 2(c) represents the instantaneous frequencies of each of the image pixels for the four LSFs (yellow areas) on the spectrum (as the image is synthesized, the ground truth, i.e. the values of each LSF at each pixel, is known and marked on the spectrum). It is clear from this that non-zero values represent corresponding frequencies of the images, and that contiguous areas correspond to the same LSF. In our algorithm each area is used to set a distinct group of Gabor functions (yellow circles in fig. 2(d)), i.e. to set distinct sets of central frequencies u (centers of the circles) and spatial constants γ (their radii). A significant overlapping of the filters increases the robustness of the estimation. Furthermore the number of filters (and consequently convolutions) varies
806
F. Galasso and J. Lasenby
with the complexity of the image, but fewer are necessary than if sampling the whole spectrum (as in [14]), resulting in a significant reduction in computational expense. Finally the method exploits the multi-scale nature of the texture, because all different-scale frequencies are estimated and used in the shape reconstruction.
3
Projection of Texture
Here we describe the viewing geometry and a projection model, to provide a relationship between the surface texture and the image plane frequencies as a function of the shape. Figure 3 illustrates the viewing geometry of [14]. xw = (xw , yw , zw ) is the world coordinate system and xi = (xi , yi ) is the image plane coordinate system, placed at zw = f < 0, with |f | being the focal length. The orientation of the surface is described using the slant-tilt system: the slant σ is the angle between the surface normal and the optical axis; the tilt τ is the angle between the xi axis and the projection on the image plane of the surface normal. The surface is described by the coordinate system xs = (xs , ys , zs ): the xs -axis is aligned with the perceived tilt direction, the zs -axis is aligned with the surface normal. The above coordinate systems can be easily extended to general surfaces: assuming that the surface is smooth and that at any point it can be locally approximated with the corresponding tangent plane, the equations from [14] then apply to the tangent plane at the considered point. Given the homogeneity assumption in section 1, surface texture frequencies us corresponding to the same LSF must be the same all over the surface texture. However the xs coordinate system changes with the tilt-direction, as the xs axis is aligned with it at each point. For aligning the xs coordinate system we counter-rotate it by the angle −τ on the tangent plane, so as to have xs oriented along the same direction in a virtual unfolded surface texture.
Fig. 3. Viewing geometry and projection model of [14]
Shape from Texture Via Fourier Analysis
807
This system of coordinates was first introduced by Dugu´e and Elghadi [6], who illustrate the relationship between xi and xs , and between the frequencies on the image and surface texture, ui and us , in the orthographic case. Here we generalize the orthographic relationships to the general perspective case. The transformation matrix between the aligned xs and the xw coordinate system is ⎡ ⎤⎡ ⎤ ⎡ ⎤ cos σ cos τ − sin τ sin σ cos τ cos τ sin τ 0 0 xw = ⎣ cos σ sin τ cos τ sin σ sin τ ⎦ ⎣ − sin τ cos τ 0 ⎦ xs + ⎣ 0 ⎦ , (3) − sin σ 0 cos σ 0 0 1 z0 which we combine with the perspective projection equation between xw and xi f 100 xi = xw , (4) zw 0 1 0 to obtain the relation between the image plane xi and the surface texture xs coordinates (note that, from here on, we will take xs to be (xs , ys ), since on the surface texture zs = 0): xi =
f R(τ )P (σ)R(−τ )xs , where zw = z0 − sin σ(xs cos τ + ys sin τ ) , (5) zw cos τ − sin τ cos σ 0 R(τ ) = and P (σ) = . (6) sin τ cos τ 0 1
R(τ ) and P (σ) are respectively responsible for rotating the system of coordinates of τ and foreshortening the distances parallel to the tilt direction by a coefficient cos σ, z0 is the zw coordinate of the image point being considered. As in [14], we assume that the image intensity I(xi ) is proportional to the surface reflectance Is (xs (xi )), and that shading is negligible because it is assumed to vary slowly compared to Is . Also, as in [14], we compute the relationship between the frequencies on the image plane ui = (ui , vi ) and on the surface texture us = (us , vs ) by using the transpose of the Jacobian of the vector function xi (xs ) in equation 5: us = J t (xi , xs )ui
with J t (xi , xs ) =
sin σ zw
xi cos τ yi cos τ xi sin τ yi sin τ
(7)
+
f zw R(τ )P (σ)R(−τ )
,
(8)
where zw is the corresponding coordinate of the surface point which projects to the image point xi . In the orthographic case, as shown by [6], the Jacobian reduces to J t (xi , xs ) = R(τ )P (σ)R(−τ ) . (9) The homogeneity assumption can therefore be written as: us1 = us2 J (xi1 , xs1 )ui1 = J t (xi2 , xs2 )ui2 t
(10) (11)
where us1 and us2 , the surface texture frequencies at the points xs1 and xs2 , are the back-projections of ui1 and ui2 , the image frequencies belonging to the same LSF measured at two distinct image plane points xi1 and xi2 .
808
4
F. Galasso and J. Lasenby
Computing Surface Orientation
We explain here how our algorithm processes the LSFs computed in section 2 to produce the shape of the surface texture. Let us consider equation 7 for two different LSFs ui = (ui , vi ) and ui = (ui , vi ) measured at the same image point xi : us = J t (xi , xs )ui and us = J t (xi , xs )ui . The above equations can be combined to write us u s Us = J t (xi , xs )Ui where Us = vs vs
and Ui =
(12) ui ui . vi vi
(13)
Here the Us contains the surface texture frequencies relative to the two LSFs, the same all over the surface according to the homogeneity assumption, while Ui contains the image frequencies for the two LSFs, which are functions of the particular point. In the orthographic case we substitute for J t (xi , xs ) from 9 in the above equation: 1 cos τ − sin τ cos τ sin τ 0 −1 −1 cos σ Ui Us = R(τ )P (σ)R(−τ ) = . (14) sin τ cos τ 0 1 − sin τ cos τ The above system of four equations in the two unknowns (σ,τ ) holds for each pixel of the image. In the particular case of a frontal point (σ = 0), i.e. when the orientation of the tangent plane is parallel to the image plane, we have Us = Ui |σ=0 .
(15)
In fact Us will be computed from the Ui estimated at a frontal point, or from equation 14 at a point of known orientation (see below for discussion on the frontal point). Since R is orthonormal and P −1 is diagonal, equation 14 can be compared with the SVD of the left hand side of the equation, according to which we write λ 0 −1 Ui Us−1 = V DV −1 = [vM vm ] M [vM vm ] , (16) 0 λm with V the orthogonal matrix of the eigenvectors (vM , vm ) of Ui Us−1 , and D the diagonal matrix of the corresponding eigenvalues (|λM | > |λm |). Recalling equation 14 we have 1 σ = arccos (17) λM τ = (vM ). (18) In other words σ is computed from the bigger eigenvalue while τ is oriented in the direction of the corresponding eigenvector. This is also approximately valid
Shape from Texture Via Fourier Analysis
809
when the surface is not completely unfoldable, in which case the term D22 of the diagonal matrix is not exactly 1, as explained in [6]. In the perspective case we substitute for J t (xi , xs ) from equation 8 in equation 13: 1 f cos τ − sin τ cos τ sin τ cos σ 0 (I − Y (σ, τ, zw ))−1 Ui Us−1 = (19) sin τ cos τ 0 1 − sin τ cos τ zw sin σ xi cos τ yi cos τ where Y (σ, τ, zw ) = (20) . sin σ(xi cos τ + yi sin τ ) + f cos σ xi sin τ yi sin τ Again the matrix Us is computed from the Ui frequencies of a frontal point: Us =
z0 Ui |σ=0 , f
(21)
where f is the focal length and z0 is the zw coordinate of the frontal point. As in the orthographic case, assuming we know the value of the left hand side, we can decompose using the SVD: f λM 0 −1 −1 −1 (I − Y (σ, τ, zw )) Ui Us = V DV = [vM vm ] [vM vm ]−1 , (22) 0 λm zw from which we compute (σ,τ ) using equations 17 and 18. However, since Y actually depends on (σ, τ, zw ), we propose an iterative solution: – the slant σ and tilt τ are estimated orthographically; – the computed σ’s and τ ’s are integrated to reconstruct the 3D shape, i.e. the zw is estimated with the method of Frankot and Chellappa ([16]); – the matrix Y (σ, τ, zw ) and the left hand side of equation 19 are computed; – σ’s and τ ’s are updated from the SVD and the procedure iterates until convergence. Convergence of the algorithm has been verified empirically, furthermore we observe that matrix Y is usually small with respect to the unitary matrix I. In fact, a few iterations are usually enough to reach convergence, and this is achieved even when initializing the algorithm with the wrong (τ ,σ,zw ) estimate. The procedure needs a frontal point or a point with known orientation. As in [13], we can either manually input the orientation of a point (hence Us is computed from equations 14 or 19), or assume that the image displays a frontal point (in which case Us is computed using equations 15 or 21). The presence and position of a frontal point can be detected using the method of [4], which easily integrates with our algorithm. As shown in [12], we√ only consider two LSFs to write down the product of the canonical moments mM of [4] as √ mM = |u1 v2 − u2 v1 |. (23) It can be shown that the above is the inverse of the area gradient (see [3]), computed using the two frequencies. Its minimum gives the frontal point.
810
F. Galasso and J. Lasenby
With the method described, we get a reconstructed shape for each pair of LSFs. By using equation 7 and the determined (σ,τ ,zw ), we can backproject the frequencies ui to the surface texture. The homogeneous assumption states that the frequencies computed in this way should be the same and that their variance should therefore be zero. Hence we choose the shape associated with the lowest variance, assuming that lower values, closer to the ideal zero value, correspond to better estimates. Using two LSFs does not reduce the applicability of our method, because most real textures have at least two LSFs. On the other hand, the over-specification of the systems of equations and the mathematical structure imposed by the SVD provide robustness to noise, as discussed in section 5. Finally, the algorithm lends itself to parallel implementation, because filters and LSFs can be processed independently and implemented separately.
5
Results
We demonstrate our method on two sets of real images. The first set (fig. 4(a), (d),(g)) is taken from [12], with which we compare our results. Figures 4(b),(e),(h) show the orthographic reconstructions, while figures 4(c),(f),(i) show the perspective reconstructions. Table 5 shows the average error for the orientation parameters σ and τ (|σ | and |τ | respectively), alongside the corresponding error from [12]. Compared with the most recent literature (e.g. [10]), our results are accurate, and the pictures in figure 4 show robustness against changes in illumination, self shadowing (e.g. 4(a)) and imperfections (e.g. 4(g)). The slightly higher error in some cases is repaid by increased robustness and by the applicability the new algorithm has to non-developable shapes, as discussed later on in this section. The second set of images is extracted from a video sequence of a textured cloth laid on a slowly deflating balloon. The video was shot with a Pulnix TM-6EX, at a resolution of 400x300 with 256 levels of gray. Every frame was processed and the 3D shape reconstructed for each of them (top row in fig. 5 shows a sequence of four frames [1,100,200,300], mid and bottom rows show the orthographic and perspective reconstructions respectively). Reconstructions are valid despite image quality being non-optimal. Furthermore the proposed distortion Table 1. Estimation errors for σ’s and τ ’s for orthographic and perspective cases (degrees), compared with results from [12] Average error - Orthographic Average error - Perspective New Method Method in [12] New Method Method in [12] Image |σ | |τ | |σ | |τ | |σ | |τ | |σ | |τ | sponge 1.11 4.26 1.55 4.30 1.07 4.48 1.21 4.27 trousers 2.19 5.14 1.48 1.92 2.14 5.50 1.86 1.87 rubber rug 3.28 5.23 4.28 0.44 2.92 5.45 4.39 0.62
Shape from Texture Via Fourier Analysis −730
−730
−740
−740
−750 10
5
15
20
25
30
35
10
5
15
20
25
30
35
−750
10
10
20
20
30
30
(a) Sponge
811
(b) Sponge - orthographic (c) Sponge - perspective rereconstruction construction −750
−750
−760
10
5
15
20
25
30
−760
35 −770
10
5
15
20
25
30
35
10 10 20 20 30
30
(d) Trousers
(e) Trousers - orthographic (f) Trousers - perspective reconstruction reconstruction −990
−990
−1000
−1000 −1010
−1010
10
5
20
25
30
35 −1020 10
5
15
20
25
30
35 −1020
10
10
20
20 30
(g) Rubber rug
15
30
(h) Rubber rug - ortho- (i) Rubber rug - perspecgraphic reconstruction tive reconstruction
Fig. 4. Real images of textures and reconstruction
measure and reconstruction algorithm were able to detect small changes in the shape from frame to frame. Figure 6(a) shows the perspective reconstruction of the first frame of the real video sequence in figure 5, achieved using the algorithm from [12]. It is clear from the reconstruction that at some pixels the orientation parameters were misestimated. In fact the algorithm from [12] estimates two sets of orientation parameters separately with the two LSFs considered for each pixel, and then chooses the one set which minimizes the errors for both the equations for the two LSFs. However it does not always chooses the better estimation. In fact when a tilt direction is orthogonal to the spectral direction of a LSF, then the foreshortening effect is null and the shape can only be recovered by considering the distortion due to perspective convergence to vanishing points (e.g. compare figure 6(b), LSF with spectral direction orthogonal to the tilt direction of the cosine shape underneath, and figure 6(c), a second LSF is superimposed with a spectral direction parallel to the tilt direction, which makes it effectively possible to recognize the shape of the cosine). In addition we note that laying a textured
812
F. Galasso and J. Lasenby
5
10
5 15
20
25
10
5 15
20
30
25
5
10
20
−600 25 −620 −640 30
25
25
10
20
25
10
15
5 15
20
30
25
10
15
20
20
30
25
5
15
10
15
15
20 −600 25 −620 −640 30
30 5
10
20 −600 25 −620 −640 30
30 5
−600 25 −620 −640 30
10
15
25
10
20
5
20
20
15 −600 25 −620 −640 30
10
15
10
5 15
30
10
30 5
20
5
−600 25 −620 −640 30
30
15
20
5 15
20
10
15
5
5 15
5
10
−600 25 −620 −640 30
10
20 −600 25 −620 −640 30
Fig. 5. Frames extracted from the real video sequence of a textured cloth laid on a deflating balloon. Top row: frames 1,100,200,300; Mid row: respective orthographic reconstructions; Bottom row: corresponding perspective reconstructions.
cloth on a sphere-like shape, i.e. non-developable, includes some natural stretching, meaning that the homogeneity assumption is only approximately true. This implies noise in the distortion estimation, which adds to the noise always present in real pictures. Figure 6(d) shows the spectral directions of the two main LSFs for the considered frame superimposed on the image. These noise sources imply that, for example, some pixels having a tilt along the spectral direction of frequency 2 are estimated using frequency 1, and are therefore misestimated. On the other hand, the algorithm proposed here considers the two LSFs together. If their spectral directions do not coincide, our method can deal with all tilt directions. Moreover, thanks to an effective redundancy and to the use of eigenvalues and eigenvectors, the algorithm proposed here is definitely more robust: it can cope with more noise and it can reconstruct general shapes, including non-developable shapes. When used to process the whole real video sequence, the algorithm of [12] failed on some frames. The main reason for this is that it estimates each pixel of the image from a previously estimated close pixel. In this way a single error is propagated and can compromise the whole reconstruction. On the other hand our new algorithm estimates every pixel from the frontal point, therefore single errors are isolated and naturally ignored in the final integration of σ’s and τ ’s. As stated in section 1, the homogeneity assumption requires some sort of periodicity/stationarity: the algorithm works well in cases where the periodicity of the texture is more than one order of magnitude bigger than the main frequency associated with the shape. As in [12], the well known tilt ambiguity is solved assuming convexity. However using a technique such as an EM algorithm, will make our algorithm applicable to more complex shapes.
Shape from Texture Via Fourier Analysis
5
10
15
20
25
813
30 5
10
Freq.2
15 20 25 −600 −620 −640 30
(a)
(b)
(c)
Freq.1
(d)
Fig. 6. (a) Perspective reconstruction of frame 1 from the video sequence in fig.5, achieved using the algorithm of [12]; (b) cosine textured with a LSF orthogonal to the tilt direction; (c) cosine textured with a LSF orthogonal to the tilt direction (same as (b)) and a second LSF directed parallel to the tilt direction; (d) frame 1 from the video sequence in fig.5 with the spectral directions of its two main LSFs superimposed in the center
Finding a robust solution to shape from texture will open the way to possible applications such as the rendering and the retexturing of clothing. The sensitivity of our algorithm to small changes, as in the video sequence result, also makes it possible to use laid-on textures as a non-invasive way of studying shape changes of objects.
6
Conclusions
A method to recover the 3D shape of surfaces from the spectral variation of visual textures has been presented. In particular, the distortion measure has been characterized by local spatial frequencies and estimated with the Fourier transform of the image and with a minimal set of Gabor filters. Then the 3D coordinates of the shape have been reconstructed with a known method which uses singular value decomposition to process the frequencies under orthographic projection. Such a method has here been extended to the general perspective cases and used to reconstruct the 3D shape of real pictures and video sequences. The results are accurate and comparable with the most recent literature. Robustness against shading, variations in illuminations, and occlusions have been proven. Furthermore the proposed algorithm can effectively use redundancy and is applicable to general non-developable shapes. Finally its sensitivity to small shape changes has been illustrated, and its use to non-invasively study texturedobject modifications has been proposed for further study.
Acknowledgements Fabio Galasso is grateful to St. John’s college and Giovanni Aldobrandini for the ‘A Luisa Aldobrandini’ Scholarship, supporting his graduate studies.
References 1. Gibson, J.J.: The Perception of the Visual World. Houghton Mifflin, Boston (1950) 2. Kanatani, K., Chou, T.C.: Shape from texture: general principle. Artificial Intelligence 38, 1–48 (1989)
814
F. Galasso and J. Lasenby
3. G˚ arding, J.: Shape from texture for smooth curved surfaces in perspective projection. JMIV 2, 327–350 (1992) 4. Super, B.J., Bovik, A.C.: Shape from texture using local spectral moments. IEEE Trans. Patt. Anal. Mach. Intell. 17, 333–343 (1995) 5. Malik, J., Rosenholtz, R.: Computing local surface orientation and shape from texture for curved surfaces. IJCV 23, 149–168 (1997) 6. Dugu´e, A.G., Elghadi, M.: Shape from texture by local frequencies estimation. In: SCIA, Kangerlussuaq, Greenland, pp. 533–544 (1999) 7. Ribeiro, E., Hancock, E.R.: Estimating the perspective pose of texture planes using spectral analysis on the unit sphere. Pattern Recognition 35, 2141–2163 (2002) 8. Clerc, M., Mallat, S.: The texture gradient equation for recovering shape from texture. IEEE Trans. Patt. Anal. Mach. Intell. 24, 536–549 (2002) 9. Loh, A.M., Hartley, R.: Shape from non-homogeneous, non-stationary, anisotropic, perspective texture. In: BMVC, pp. 69–78 (2005) 10. Massot, C., H´erault, J.: Model of frequency analysis in the visual cortex and the shape from texture problem. IJCV 76, 165–182 (2007) 11. Witkin, A.P.: Recovering surface shape and orientation from texture. Artificial Intelligence 17, 17–45 (1981) 12. Galasso, F., Lasenby, J.: Shape from texture of developable surfaces via fourier analysis. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Paragios, N., Tanveer, S.-M., Ju, T., Liu, Z., Coquillart, S., Cruz-Neira, C., M¨ uller, T., Malzbender, T. (eds.) ISVC 2007, Part I. LNCS, vol. 4841, pp. 702–713. Springer, Heidelberg (2007) 13. Loh, A.M., Kovesi, P.: Shape from texture without estimating transformations. Technical report, UWA-CSSE-05-001 (2005) 14. Super, B.J., Bovik, A.C.: Planar surface orientation from texture spatial frequencies. Pattern Recognition 28, 729–743 (1995) 15. Gabor, D.: Theory of communications. J. Inst. Elec. Engrs. 93, 429–457 (1946) 16. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Trans. Patt. Anal. Mach. Intell. 10, 439–451 (1988)
Full Camera Calibration from a Single View of Planar Scene Yisong Chen1 , Horace Ip2 , Zhangjin Huang1, and Guoping Wang1 1
Key Laboratory of Machine Perception (Ministry of Education), Peking University 2 Department of Computer Science, City University of Hong Kong
Abstract. We present a novel algorithm that applies conics to realize reliable camera calibration. In particular, we show that a single view of two coplanar circles is sufficiently powerful to give a fully automatic calibration framework that estimates both intrinsic and extrinsic parameters. This method stems from the previous work of conic based calibration and calibration-free scene analysis. It eliminates many a priori constraints such as known principal point, restrictive calibration patterns, or multiple views. Calibration is achieved statistically through identifying multiple orthogonal directions and optimizing a probability function by maximum likelihood estimate. Orthogonal vanishing points, which build the basic geometric primitives used in calibration, are identified based on the fact that they represent conjugate directions with respect to an arbitrary circle under perspective transformation. Experimental results from synthetic and real scenes demonstrate the effectiveness, accuracy, and popularity of the approach.
1 Introduction As an essential step for extracting metric 3D information from 2D images, camera calibration keeps an active research topic in most computer vision applications[9]. Much work has been devoted to camera calibration. They can be classified into two categories: (1D, 2D or 3D) Calibration pattern based algorithms, and multiple view based self-calibration approaches [15,19]. Conics and quadrics are widely accepted as most fundamental patterns in computer vision due to their elegant properties such as simple and compact algebraic expression, invariance under projective transformation, and robustness to image noise. Conics have long been employed to help perform camera calibration and pose estimation [6]. The strategy of using spheres as calibration pattern also draws more and more attention in recent years [1]. Vanishing point and vanishing line also play important roles in a lot of calibration and scene analysis work[12,14]. Under the assumption of zero skew and unit aspect ratio, all intrinsic parameters can be solved from the vanishing points of three mutually orthogonal directions in a single image[3]. Multiple patterns or views can be employed to perform calibration in the cases where not all three vanishing points are available from a single view. Although recent research has come up with fruitful achievements, most work suffers from the problems of multiple views, restricted patterns or incompleteness of solutions [5,13,18]. Two major obstacles are the mandatory requirements of multiple views and G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 815–824, 2008. c Springer-Verlag Berlin Heidelberg 2008
816
Y. Chen et al.
non-planar scene structures. In this paper, we make an attempt to calibrate the camera from a single image of planar scene. In our approach, coplanar circles are adopted as basic calibration patterns and vanishing points function in a brand new way. Conic based planar rectification[11] and conic based pose estimation[4] are two previous approaches most related to our work. The work of [4] estimates the focal length and the camera pose from two coplanar circles in the image. However, this method implicitly assumes that the principal point is known beforehand and cannot treat non-unit aspect ratio. The work of [11] makes accurate Euclidean measures from coplanar circles in a calibration-free manner. Nevertheless, the analysis is limited on the target plane and cannot be extended to applications in need of camera parameters. In our work, we propose a full calibration scheme which statistically estimates the focal length, the principal point, the aspect ratio as well as the extrinsic parameters. In particular, we show that circle is a powerful conic in that a single view of two coplanar circles is capable of providing adequate information to do metric calibration. A coarse pipeline of our algorithm is as follows: First, calibration-free planar rectification reported in [11] is performed and extended to recover the vanishing line, the centers of the circles and many orthogonal vanishing point pairs. Second, under different guesses of the principal point the distribution of the focal length are computed from all orthogonal vanishing point pairs. Third, based on the previously computed focal length distribution a statistical optimization routine is designed to estimate the focal length, the principal point and the aspect ratio simultaneously. Fourth, the conic based pose estimation[4,6] are employed to compute the extrinsic parameters. Finally, the calibration result is validated by comparison with the ground truth for synthetic scenes or by augmented reality tests for real scenes. The major advantage of our work lies in that the algorithm uses only a single image of simple planar scene to achieve a full solution of camera calibration. Therefore, the scene requirement is low in comparison with previous methods. This approach is very practical and works well for many scenes which previous methods fail to treat. The rest of the paper is organized as follows. Section 2 briefly reviews and extends the previous work of coplanar circles based scene analysis. Section 3 elaborates a feasible scheme which statistically estimates the focal length, the principal point and the aspect ratio simultaneously. Some discussions are also provided in this section. Section 4 presents the experimental results on both synthetic and real scenes. Finally, concluding remarks are given in Section 5.
2 Preliminaries We will present a calibration algorithm step by step under the practical assumption of zero skew. Our algorithm solves the camera projection matrix P = K[R|t] where K is the zero-skew calibration matrix containing 4 intrinsic parameters as defined in equation (1) and the metric matrix [R|t] fully encodes the 6 extrinsic parameters. ⎛ ⎞ αf 0 u K = ⎝ 0 f v⎠ (1) 0 01
Full Camera Calibration from a Single View of Planar Scene
817
We first briefly introduce some related work. Throughout the discussion we adopt the homogeneous presentation which is standard in algebraic projective geometry [17]. In [11] it is suggested that under perspective transformation the images of the two circular points on a plane, I = (1, i, 0)T and J = (1, −i, 0)T , can be computed by solving the intersection of the images of two coplanar circles, which have the following forms under homogeneous presentation. a1 x2 + b1 xy + c1 y 2 + d1 xw + e1 yw + f1 w2 = 0 a2 x2 + b2 xy + c2 y 2 + d2 xw + e2 yw + f2 w2 = 0
(2)
By solving equation (2) the images of the two circular points, I and J , can be computed as I = (x0 , y0 , 1) J = (x0 , y0 , 1)
(3)
where (x0 , y0 ) are the roots of equation (2) corresponding to the circular points. Afterwards the vanishing line is computed as the cross product of the two circular points: l∞ = I × J = (y0 − y0 , x0 − x0 , x0 y0 − x0 y0 )
(4)
Notice that the vanishing line is a real line although the entries of I and J are always complex. In [4] an algorithm is addressed to estimate the focal length and the camera pose from two coplanar circles. Unfortunately the principal point has to be known a priori and the aspect ratio is fixed to be 1.0 for this algorithm to take effect. The attempt of using this constraint solely to estimate the principal point and the focal length at the same time results in large errors, especially in the presence of non-unit aspect ratio. It is also mentioned in [10] that small changes in the estimated principal point may severely degrade the quality of reconstruction. Therefore, some alternative algorithm is desired to give a reliable estimate of the principal point as well as the aspect ratio. By integrating and extending the ideas of the above approaches, we propose a calibration scheme which simultaneously estimates the focal length, the principal point, and the aspect ratio. The algorithm is outlined in Section 3.
3 Statistical Camera Calibration In this section, we present a step-by-step framework that benefits from conjugate direction computation and fully calibrates the camera. We start from Some preparing theories and give an orthogonal direction identification algorithm based on coplanar circles [17]. 3.1 Orthogonal Vanishing Point Pairs Identification To make full use of the geometric cues in the image we turn to the following fact: the line at the infinity, l∞ , is the polar line of the circle center, oi , of an arbitrary circle
818
Y. Chen et al.
Fig. 1. Vanishing line and orthogonal point pair computation under original and perspective view. v1 and v2 are orthogonal vanishing points.
C on the plane. In other words, oi and l∞ satisfy the pole-polar relation described in equation (5). ⎡ ⎤⎡ ⎤ a b/2 d/2 xi l∞ = (l1 , l2 , l3 )T = Coi = ⎣ b/2 c e/2⎦ ⎣ yi ⎦ (5) d/2 e/2 f zi A corollary of the above fact is that two orthogonal directions are conjugate to each other with respect to any circle on the plane. Note that a planar direction can be represented by the corresponding point at infinity. Accordingly, given a circle C and the line at infinity l∞ on the plane, we can freely choose one point at infinity, v, on l∞ and determine another point at infinity v in the orthogonal direction of v using the conjugate property of the two directions. The calculation is formulated with the following equations under homogeneous representation. l = Cv, v = l × l∞
(6)
That is, the orthogonal direction of a given point at infinity can be computed by solving the intersection of its polar line, l, and the line at infinity, l∞ . All the above computations are based on the pole-polar relationship, which is invariant under projective transformation. Consequently, the process can be easily transported to determine as many conjugate vanishing point pairs as we want in a perspective view. An illustration of these computations is given in Figure 1. With an image of two coplanar circles, the vanishing line can be computed using equations (2-4). Then many orthogonal directions can be computed using equation (6). This paves the way for our statistical calibration framework, which will be detailed in the next section. 3.2 Statistical Calibration by Maximum Likelihood Estimate For convenience we first consider the camera model with unit aspect ratio. In the Cartesian image coordinate if the position of the principal point p(x0 , y0 ) is given, then for each freely chosen vanishing point v(x, y), a corresponding vanishing point v (x , y ), which represents the orthogonal direction of v, can be identified on the vanishing line using equation (6). Moreover, according to the orthogonal property of v and v , there is
Full Camera Calibration from a Single View of Planar Scene
819
a unique focal length corresponding to specified p, v, and v , which can be computed from the following equation [3].
f = −(x − x0 )(x − x0 ) − (y − y0 )(y − y0 ) (7) Different orthogonal vanishing point pairs lead to different estimated focal lengths. Therefore, for each guessed principal point p and a set of orthogonal vanishing points V = {{v1 , v1 }, {v2 , v2 }, ...{vn , vn }} , we can estimate a corresponding set of focal lengths F = {f1 , f2 , ...fn }. Our basic idea is to employ the set F containing large amounts of estimated f values to statistically put constraint on the principal point. We reasonably expect that, if the principal point is correctly estimated, then the f values in the F set surely form a densely distributed cluster. On the contrary, if the guessed principal point is far from the correct position, then the distribution of the focal lengths computed by equation (7) is more likely quite sparse. Therefore, the distribution of the entries of the F set provides a confidence measure of the guess about the principal point (x0 , y0 ). Naturally, the variance of the distribution, D(F ), is a good candidate to measure such confidence and evaluate the goodness of the guess. In other words, although the probability density function P (x0 , y0 ) is hidden from us it can be measured through the observable focal length distribution D(F ). Smaller D(F ) corresponds to higher confidence of (x0 , y0 ). Under this formulation we can use D(F ) to characterize the probability density function of the principal point and perform calibration through maximum likelihood estimate. Note that F is determined by (x0 , y0 ) and should be more strictly written as F (x0 , y0 ). We take D(F ) as the cost function and try to solve the following optimization problem: M inimize(x0,y0 ) (D(F (x0 , y0 )))
(8)
Under this formulation, from every guess about the principal point a confidence value can be estimated and the corresponding focal length can be computed. An optimization routine is required to seek the minimum of equation (8), which corresponds to the maximum likelihood estimate of the intrinsic parameters (x0 , y0 , f ). In our study the above statistical function is not easily differentiated analytically. So a derivative-free optimizer is preferred. The downhill simplex method is a good candidate for this type of optimization [16]. In addition, Experiments show that a lot of local minimums exist in the solution space. We solve this problem by employing multiple initial points. Namely, the optimization is repeated several times with multiple randomly chosen starting points and the best result produced is adopted as the final solution. This strategy ensures the reliability and robustness. After the principal point (u, v) and the focal length f are determined, the conic based pose estimate algorithm in [4,8] is employed to calculate the extrinsic parameters. This completes a full single-view based calibration framework. 3.3 Taking Aspect Ratio into Account Having made the above calibration algorithm work, adding an extra intrinsic parameter, i.e., the scale factor α, becomes straightforward. All we need to do is just introduce α as a fourth unknown variable into the optimization routine. During optimization, for each guessed value of the scale factor, the image is first scaled horizontally to give a corrected
820
Y. Chen et al.
’ideal image’. Afterwards the ’ideal image’ is taken as the input of the optimization program. Due to the fact that the 4 intrinsic parameters (u, v, f, α) are somewhat tightly coupled and are not easy to be estimated separately with high precision [2], the risk of instability slightly rises compared with the algorithm considering no scale factor. Nonetheless, the power of the downhill simplex method with multiple initializations effectively prevents such degradation and a good solution can always be reached. Table 1 briefly describes the major steps of the calibration scheme addressed in this section. Table 1. The pipeline of our calibration scheme step Approach 1 2 3 4 5
Solve the vanishing line and a group of orthogonal vanishing point pairs Compute an optimal focal length f for each given guess of (u, v, α) Use the distribution of the focal lengths to build an error function based on MLE Nonlinear optimization by iteratively executing steps 2-4, until convergence Extrinsic parameter calibration from optimal (u, v, α, f )
Techniques Ellipse detecting and fitting, Algebraic equation solving, pole-polar computation Vanishing point and vanishing line based intrinsic parameter calibration Probability density function based maximum-likelihood estimate Downhill simplex method with multiple initialization Conic based camera pose estimation
Equations (2-6) (7) (8) Ref. [16] Ref. [4,8]
3.4 More Discussions Finally, it is worthwhile to make an insightful comparison between our work and other conic based calibration approaches. Although conics are widely used for calibration purpose, up to date most conic based calibration algorithms are originated from the plane-based calibration framework stated in [18]. In such a framework, the images of the circular points computed from conics of multiple views are used to put constraints on and solve the image of the absolute conic ω(IAC). Then the calibration matrix K is obtained by factorizing it with the well known equation ω = K −T K −1 . In contrast, by treating the problem from a different point of view our work does not involve the notion of IAC. The calibration is achieved from statistical information provided by a large set of orthogonal vanishing points in a single view. The advantage of our work is obvious: it does not rely on a priori known principal point or aspect ratio as in [4], has no restrictions on circle positions as in [5,13], and above all, utilizes only a single view to achieve reliable calibration. We believe that it is an attractive solution and may represent a promising direction. Note that in the context of our study, many orthogonal vanishing point pairs {v1 , v1 }, {v2 , v2 }, ..., {vn , vn } can be identified. So one might wonder whether we could use the over-determined constraint v T ωv = 0 to solve the image of the absolute conic, ω, and then compute the calibration matrix by employing the methods of [18]. Unfortunately this idea does not work in practice. The reason is that all vanishing points come from a common line, which is a degenerate case for this formulation. The solution obtained
Full Camera Calibration from a Single View of Planar Scene
821
Fig. 2. The original Olympic logo scene and two augmented frames Table 2. Camera calibration results for the synthetic scene
Case1 Case2 Case3 Case4 Case5
Ground truth (u, v, α, f ) (-0.005, -0.005, 1.00, 2.56) ( 0.275, -0.360, 0.89, 2.56) ( 0.500, 0.450, 0.97, 2.56) (-0.200, 0.125, 1.05, 2.56) (-0.075, -0.045, 1.16, 2.56)
Estimated (u, v, α, f ) (-0.00983, -0.00278, 1.00068, 2.57648) ( 0.26896, -0.38662, 0.89303, 2.55377) ( 0.49496, 0.45749, 0.96346, 2.49293) (-0.20441, 0.17573, 1.05164, 2.68032) (-0.07842, -0.01975, 1.15125, 2.51853)
in this way bears a high risk of inaccuracy and instability and it often occurs that the ω computed in this manner fails to be decomposed reliably. This problem is avoided in our MLE based optimization scheme.
4 Experimental Results Both synthetic and real world scenes are tested in our experiments. Ellipse detection and fitting is performed by some robust feature extraction and regression algorithms [7]. One point to stress is that our experiments are done in normalized image coordinates to ensure a steady order of magnitude during the solving process and to achieve a better precision [9]. 4.1 Synthetic Scene The image in our first experiment is a 512*512 synthetic planar scene with an Olympic logo on it. See Figure 2. Different circle combinations can be selected to compute the vanishing line. After the vanishing line is computed by equation (4) the method addressed in section 3 is employed to calibrate the camera and test the performance of the algorithm under different parameter configurations. In the normalized image coordinate the size of one pixel is 0.01. Some results are given in Table 2. The data in the table show that the algorithm adapts stably to a wide range of principal points and aspect ratios. All parameters can be estimated with high accuracy. The maximal deviations of the principal point, the focal length, and the aspect ratio are, respectively, 5
822
Y. Chen et al.
Fig. 3. The original and the augmented plaza scenes Table 3. Camera calibration results for the real scene Key circle u v α f N Circle1 0.00757 -0.02641 1.07045 6.86312 (-0.2804,0.7824,0.5561) Circle2 0.01425 0.03174 1.06088 6.93255 (-0.2846,0.7869,0.5475) Circle3 0.01577 -0.02191 1.07829 6.92373 (-0.2774,0.7857,0.5529)
pixels, 12 pixels, and 0.009. The plane normal with respect to the camera coordinate, N , can be computed using the method in [8]. In all cases we get consistent results N ≈ (−0.0005, 0.4480, 0.8940). This normal vector helps us compute the extrinsic parameters, which define the camera pose. With all intrinsic and extrinsic parameters figured out, we easily rebuild the transform matrix between the world coordinate and the camera coordinate. To validate the correctness of the result we make an augmented reality experiment by producing a short movie of a 3D soccer model rolling on the plane in the scene. The augmented scene looks quite realistic. Two frames extracted from the movie are illustrated in Figure 2. 4.2 Real Scene In our second experiment a challenging 640*480 plaza scene is adopted to test the interesting case of concentric circles. See Figure 3. Despite the relatively noisy scene structures in the image, the distinct magenta color of the two rings in the scene allows robust detection of both edges of the inner ring (circle1 and circle2) and the inner edge of the outer ring (circle3). Circle1 and circle2 are too close to each other and represent a typical degenerate case. So this pair is not appropriate and should be discarded for calibration purpose. By contrast, the (circle1, circle3) pair and the (circle2, circle3) pair are sufficiently distant although the two circles are concentric. From either combination the images of the circular points and the circle centers can be successfully estimated. By employing equation (6) we compute an individual orthogonal vanishing point set from each of the 3 circles and try one calibration respectively. The calibration results of the 4 intrinsic parameters and the plane normal N for all 3 cases are given in Table 3. The data in the table show that the results are quite consistent. This strongly justifies the stability of our approach. Two reference points are selected on the plane to help build a world coordinate system with the ground plane as the x-y plane and its normal as the z direction. Then the
Full Camera Calibration from a Single View of Planar Scene
823
transform between the camera coordinate and the world coordinate is established. We additionally validate our algorithm by adding a 3D moving car model into the scene. Metric rectification technique helps tailor the size of the virtual car and fill it seamlessly into the scene. Two frames of the animation are given in Figure 3.
5 Conclusion We have presented a framework that fully calibrates the camera from only a single perspective view of two coplanar circles. Metric planar rectification and conic based pose estimate are combined in a statistical manner to achieve robust and reliable calibration. The advantage of our work is twofold: First, it is superior to many previous calibration algorithms in that it uses only a single view of planar scene. Second, it is more practical than many calibration-free approaches because it supports 3D augmented reality applications in addition to simple 2D Euclidean measure. Future work includes making more accurate error analysis and treating more complex camera models.
Acknowledgement We would like to thank all anonymous reviewers for their comments and advices, which greatly help improve the quality of this manuscript.
References 1. Agrawal, M., Davis, L.S.: Camera calibration using spheres: a semi-definite programming approach. In: Proc. of IEEE ICCV, pp. 782–789 (2003) 2. Bougnoux, S.: From projective to Euclidean space under any practical situation, a criticism of selfcalibration. In: Proc. 6th ICCV, pp. 790–796 (January 1998) 3. Caprile, B., Torre, V.: Using vanishing points for camera calibration. International journal of computer vision 4(2), 127–140 (1990) 4. Chen, Q., Hu, H., Wada, T.: Camera calibration with two arbitrary coplanar circles. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 521–532. Springer, Heidelberg (2004) 5. Colombo, C., Comanducci, D., Del Bimbo, A.: Camera calibration with two arbitrary coaxial circles. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 265–276. Springer, Heidelberg (2006) 6. Dhome, M., Lapreste, J., Rives, G., Richetin, M.: Spatial localization of modeled objects of revolution in monocular perspective vision. In: Faugeras, O. (ed.) ECCV 1990. LNCS, vol. 427, pp. 475–485. Springer, Heidelberg (1990) 7. Fitzgibbon, A., Pilu, M., Fisher, R.: Direct least-square fitting of ellipses. IEEE Trans. PAMI (June 1999) 8. Forsyth, D., et al.: Invariant descriptors for 3-D object recognition and pose. IEEE Trans. PAMI 13(10), 250–262 (1991) 9. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University, Cambridge (2003) 10. Hartley, R., Kaucic, R.: Sensitivity of calibration to principal point position. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2351, pp. 433–446. Springer, Heidelberg (2002)
824
Y. Chen et al.
11. Ip, H., Chen, Y.: Planar rectification by solving the intersection of two circles under 2D homography. Pattern recognition 38(7), 1117–1120 (2005) 12. Kanatani, K.: Statistical analysis of focal-length calibration using vanishing points. IEEE Trans. Robotics and Automation 8(6), 767–775 (1992) 13. Kim, J., Gurdjos, P., Kweon, I.: Geometric and algebraic constraints of projected concentric circles and their applications to camera calibration. IEEE Trans. PAMI 27(4), 637–642 (2005) 14. Liebowitz, D., Criminisi, A., Zisserman, A.: Creating architectural models from images. In: EuroGraphics1999, vol. 18, pp. 39–50 (1999) 15. Luong, Q.T., Faugeras, O.: Self-calibration of a moving camera from point correspondences and fundamental matrices. IJCV 22(3), 261–289 (1997) 16. Press, W.H., et al.: Numerical recipes in C: the art of scientific computing, 2nd edn. Cambridge University Press, Cambridge (1997) 17. Semple, J.G., Kneebone, G.T.: Algebraic projective Geometry. Clarendon Press, Oxford (1998) 18. Sturm, P.F., Maybank, S.J.: On plane-based camera calibration: a general algorithm, singularities, applications. In: Proc. CVPR, pp. 432–437 (1999) 19. Zhang, Z.: Camera calibration with one-dimensional objects. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 161–174. Springer, Heidelberg (2002)
Robust Two-View External Calibration by Combining Lines and Scale Invariant Point Features Xiaolong Zhang, Jin Zhou, and Baoxin Li Department of Computer Science and Engineering Arizona State University, Tempe, USA {xiaolong.zhang.1,jinzhou,baoxin.li}@asu.edu
Abstract. In this paper we present a new approach for automatic external calibration for two camera views under general motion based on both line and point features. Detected lines are classified into two classes: either vertical or horizontal. We make use of these lines extensively to determine the camera pose. First, the rotation is estimated directly from line features using a novel algorithm. Then normalized point features are used to compute the translation based on epipolar constraint. Compared with point-feature-based approaches, the proposed method can handle well images with little texture. Also, our method bypasses sophisticated post-processing stage that is typically employed by other line-feature-based approaches. Experiments show that, although our approach is simple to implement, the performance is reliable in practice.
1 Introduction Multi-view-based camera external calibration is a well-studied computer vision task, which is of importance to applications such as vision-based localization in robotics. We consider the following problem of external calibration: given two views under general motion, to recover the relative camera pose and translation. The state of art solution to determine the geometry for two views is to first estimate the essential matrix and then compute the relative pose and translation from the essential matrix. Usually, the essential matrix is computed from the fundamental matrix, which can be estimated from 8 point correspondences [2]. In practice, RANSAC is used for processing much more points to gain robustness. Still, such methods suffer from two degenerated cases: 1) lack of texture in the images and thus lack of corresponded features; and 2) the existence of coplanar feature points. When the number of correctly-corresponded features in general positions is too small, the computation of fundamental matrix will have difficult. In the case that the number of good feature correspondences is less than 8, the algorithm will not work. To address these problems, we propose a novel method for two-view external calibration by combining lines and points features. Specifically, we utilize the fact that, in images captured in most indoor or outdoor urban environments, there exist a lot of line-like structures that are either vertical or horizontal in the original physical world. These include, for example, the vertical contours of buildings, door frames, etc. Our method employs a two-step approach by estimating the camera orientation and translation respectively, G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 825–835, 2008. © Springer-Verlag Berlin Heidelberg 2008
826
X. Zhang, J. Zhou, and B. Li
heavily using such line-like structures from the images. We first estimate the orientation using detected lines then the camera can be rotated such that only the translation needs to be estimated. In theory, only two points are sufficient to determine the new essential matrix which is determined only by the translation. Earlier work on stereo matching based on line segments have been reported, verifying the robustness of line-based approaches in general and especially when texture information is insufficient. However, such approaches often require sophisticated post processing stages after line detection, either by iteratively reintroducing mismatches [5] or by merging and growing operations [7]. These methods yield fairly good results but rely on a large number of correct correspondences and thus require such extra steps that boost the computational cost. Our approach focuses on very simple manipulation based on geometric principles to provide a closed-form solution rapidly. Compared with the approach reported in [8], in which and are first determined, our approach does not require two vanishing points in orthogonal direction, so it works with weaker assumptions and thus it may be able to handle more general situations. The proposed approach has three main steps: 1. Line detection and matching. 2. Line-based estimation of the rotation matrix . 3. Point-based estimation of the translation vector . In the next section we briefly describe our line matching method, and then in Sections 3 and 4 we discuss the estimation of and . In Section 5, we provide experimental results using both synthetic and real data. We conclude with brief discussion on future work in the last section.
2 Line Detection and Matching The fundamental problem we are trying to solve is the estimation of the relationship between different views of the same scene, as shown in Fig.1, which illustrates an outdoor environment from which a fair amount of line features can be observed. The first component of our algorithm deals with reliable line detection and matching.
Fig. 1. Multiple views of the same scene taken by cameras with different pose containing rotation and translation
Robust Two-View External Calibration
827
For line detection, we use the line detector utility HoughLines2 in OpenCV [12], which is based on Canny edge detector [11] and Hough transform [13]. We control the thresholds in the Canny edge detector so as to guarantee at least 10 lines being generated. Note that, in theory our method requires only two pairs of corresponding vertical lines and one pair of horizontal lines to carry out the estimation. In practice, experiments show that using about 10 lines is sufficient to guarantee reliable performance.
Fig. 2. Building the profile vector from adjacent parallel lines for each single line feature
Fig. 3. Above: edges and lines returned by the Canny detector and the line detector. Below: vertical lines plotted in green. This office scene contains 7 vertical line matches containing 1 mismatch and 7 non vertical matches containing 2 mismatches.
We had three line matching criteria: color histogram with sidedness constraint, line slope value, and centroid distance. First, our algorithm extracts a profile vector to represent the color histogram on both sides of each line. For computing the color histogram, we use pixels that lie on lines that are parallel to and are within a reasonable distance to the given line. The distance should not be too close (in which case the pixels may fall onto the given line) or too far (in which case the pixels may be too far from an actual edge). We experimented with the distance range of 3 to 10 pixels. Bilinear interpolation is used in obtaining the lines of pixels. This is illustrated in
828
X. Zhang, J. Zhou, and B. Li
Fig. 2. This color histogram is computed for both sides of the given line and then saved as a profile vector for that line. Then we match each line on the other image for potential correspondences by measuring the difference between the histograms on both sides with the cost function defined in [5]. The search region here is a window centered at the middle of the original line. After we found the potential candidates, we check the slope of each candidate and disqualify correspondences with tilt angle difference greater than 45 degrees. Furthermore, we require mutual agreement (line in view 1 and line b in view 2 should both choose each other as the best match) to confirm a correspondence. During this process, lines with the largest slope were labeled as vertical lines and others as horizontal ones (as illustrated in Fig.3).
3 Estimating the Rotation After we achieve reliable line matching we start the next step of estimating the rotation matrix. We assume the camera matrix K is known: 0 0 0
(1) 0
1
and , The rotation matrix can be decomposed as the multiplication of , each standing for the rotation along each axis in the 3D space respectively. We estiand based on the vertical lines, rectify the image coordinate based on mate and , and then estimate based on the horizontal lines. estimated
Fig. 4. Illustration of and estimation: two vertical lines in an image defines an vertical intersection P, rotating it twice we get
Robust Two-View External Calibration
3.1 Estimating
829
and
We now discuss our method for estimating and . With the vertical lines in each view, we calculate the vertical vanishing point from those lines. By rotating the image coordinate along the z axis, we make this vanishing point fall on the y-z plane, and then we further rotate the image along the x axis so that the vanishing point falls on the y-axis. In practice, we rotate the original image along the z axis to guarantee the vertical vanishing point falls onto the y axis in the imaging plane in the first step, and then rotate the image along the x axis to transform the vertical lines to be parallel to each other. This is illustrated in Fig. 4. It can be proven that the two rotation angles in the previous steps are and respectively. Since in this step only one pair of vertical lines from each view is needed, we pick the pair from the set of corresponded vertical lines using the following criterion: We pick the pair whose intersection point is closest to the centroid of the intersections of all vertical line pairs. 3.2 Estimating So far we have focused on the vertical lines. Next the horizontal lines will play the and known, we normalize line coordinates main role in estimating . With (2) and have been calculated in the previous We assume that is known, and steps. Now we normalize the image coordinate as shown in (2). In theory, if two cameras have the same internal parameters as well as orientation, any vanishing point will by transform a have same position in both views. From this property, we can find vanishing point on one image to its corresponding position on the other image. Vanishing points can be obtained by intersecting horizontal lines with the horizontal vanishing plane. In the standard view, the axis 0,1,0 is the vanishing line of the horizontal plane, and any point on it has the coordinates , 0,1 . After rotating the camera around the axis, one point , 1,0 is transferred to another point , 1,0 on the axis (illustrated in Fig.5). So we have 0 1
0 1
(3)
where cos 0 sin
0 1 0
sin 0 cos
(4)
The solution of can be obtained from one pair of vanishing point , which is determined by a pair of horizontal edge correspondence , normalized in (2), 0,1,0
and
0,1,0
(5)
830
X. Zhang, J. Zhou, and B. Li
Multiple correspondences yield an over-determined problem, for which we define the following cost function of geometric distance to search for a solution: (6) stands for a pair of corresponding vanishing points and is the esHere , based on the jth correspondence. This cost function value represents the timation of total cost associated with the jth pair of correspondence, and we choose the j that minimizes this value. Up to here we have accomplished our estimation of the rotation matrix and we can normalize the original image coordinate: (7) The normalized image will be used to estimate the camera translation.
Fig. 5. Computing θ from corresponding intersection sets between two views. Blue dots stand for intersection sets containing all point candidates, the red one is the optimal matching.
4 Estimating the Translation Since we have already estimated the orientation of both cameras, after normalizing both image coordinates, we have removed the rotation component and the relationship I|0 between two views become pure translation. The new camera matrices are P I| . The essential matrix is E , which has only two degrees of freeand P dom. According to the definition of the essential matrix, the following property holds true for any corresponding point pair x and x’
Robust Two-View External Calibration
E
0
831
(8)
Based on this constrain, two point correspondences are enough to determine the translation . Suppose , , , then 0 (9) E 0 0 Substituting a pair of corresponding points in homogeneous coordinates, , ,1 , , 1 , into (8), we have 0
(10)
Given a set of correspondences with k matches, we obtain a set of linear equations of the form 0
(11)
The solution is the right null-space of . For any point correspondences more than two pairs, this became an over determined problem. We adopt the RANSAC [10] framework to eliminate inaccurate point measurements and correspondence outliers and to ensure robustness of our system. Our point correspondence method is based on the SIFT[4] features which provides scale invariant point matching across views. Matching outliers is not a big concern here because RANSAC optimization would always select points that yield comparatively good results. Nevertheless, even if the point correspondence fails completely, we could always fall back to the baseline system using intersections of corresponding edges we detected.
5 Experimental Results We carried out two types of experiments assessing the effectiveness of the proposed method. The first category based on synthetic data contains three experiments, covering pure rotation along each of the three axis respectively. The second category contains four experiments based on real data from both indoor and outdoor environments containing both rotation and translation. For synthetic data experiments, we constructed a virtual scene using ray tracer software Mega-Pov[14], then we adjusted the camera angle and took a series of pictures. Here the camera location is fixed and we intend to separate the rotation on each of the three axis, and thus three types of movement were recorded and three sequences were generated corresponding to rotation along the , and axis respectively.Then we pick a pair of images from each sequence as our test samples (as shown in Fig. 6). We quantitatively analyze the performance by comparing the ground truth with estimates generated by a point based system using SIFT feature detector implementation [9] as well as the output of our system. It can be observed in Table1 that few effective point correspondences were established due to the lack of texture in the scenes and the accuracy is lower than 50%. In comparison line correspondences have a higher accuracy at 77%. In addition, since we
832
X. Zhang, J. Zhou, and B. Li
require only a few correct correspondences for the estimation, the number of correct matches returned is sufficient for the calibration. This is further proven by the comparison of the actual estimation result: on the fourth row, estimation by point based methods failed whereas on the fifth row line based method provide fairly good estimation.
Fig. 6. Results for synthetic data. Column 1,2 and 3 correspond to experiment on the , and axis respectively. In each column the top two rows are two input views and the last row is the stitched image based on resulting estimation. Table 1. Experimental results for synthetic data. Top two rows: correct/total correspondence detected by each method. Bottom three rows: comparison of rotation angle estimation.
Point correspondences Line correspondences
14/32 18/21
17/40 14/20
16/35 12/16
Ground truth Point based estimate Line based estimate
0.3683 0.1291 0.3665
0.1459 0.5005 0.1189
0.3579 0.7850 0.3142
For real data, we carried out four experiments covering the case of different rotations and translations in Fig. 7. These experiments covered both indoor and outdoor cases. Due to the lack of ground truth information in these experiments, here we only visualize the stitched images to demonstrate the performance of camera pose estimation: camera poses were determined and rectification was carried out accordingly, and then the rectified images were stitched together; a good match of the two stitched image implies that the two views have been calibrated accurately.
Robust Two-View External Calibration
833
Fig. 7. Four experiments based on real data. On each triplet, two smaller images in each group stands for original inputs and the larger one is the stitched image.
In Fig.7, shown on the top row is the indoor examples, and the bottom is the outdoor ones. Smaller image pair containing epipolar lines is the original input and the stitched result based on the estimation is shown as well. The disparity in stitched images is caused by camera translation across the views. As long as the scene contains more than one vertical lines, which holds true in most man-made environments, our algorithm would recover the camera pose with satisfying performance. Since our method mainly based on lines especially the vertical lines, the correspondence in poorly textured scenes like the second one is not a problem.
834
X. Zhang, J. Zhou, and B. Li
6 Conclusion and Discussion We have proposed a novel approach for two-view camera calibration based on line features. The experimental results demonstrate the advantage of our method over point based methods in poorly textured region where good point correspondence is hard to obtain. Basic geometric constrains enabled us to achieve satisfying result bypassing any complicated extra process in comparison with other line feature based methods, thus simplifying the process. By combining both line and point features, our system can estimate rotation and translation matrices with robust performance. Preliminary results validated our method on both natural scenes and synthetic data. We now discuss potential limitations of the proposed approach. The approach may encounter difficulties when vertical lines are occluded or broken into segments that are shorter than our threshold. Though our adaptive module guarantees 10 lines are always detected, it cannot guarantee how many of them are actually vertical. If it turned out to be no vertical lines among them, the algorithm would fail. There are two problems yet to be solved: 1) broken vertical lines are important structural information but could not be used effectively; and 2) input image could have been taken with very large rotation angle which would confuse the system in judging vertical lines from horizontal ones. For the first problem, existing works adopt iterative process to merge line segments as mentioned earlier, but the computational cost is high. On the other hand, human eyes are “very sensitive to spatial continuity”, as summarized by David Marr [3]. One potential strategy is to look for higher level structures in noisy scene with perceptual oriented methods such as tensor voting. As for the second question, since it is hard to make any assumption on the basic camera pose for pictures in general, we are working on developing preprocessing techniques based on natural constrains such as shadows to determine if the coordinate system of an input picture should be reversed before processing.
References 1. Kosecka, J., Yang, X.: Global Localization and Relative Pose Estimation Based on ScaleInvariant Feature. In: Proc. ICPR 2004 (2004) 2. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge (2006) 3. Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, San Francisco (1982) 4. Lowe, D.G.: Object Recognition from local scale-invariant features. In: Proc. ICCV (1999) 5. Bay, H., Ferrari, V., Van Gool, L.: Wide-Baseline Stereo Matching with Line Segments. In: Proc. CVPR 2005 (2005) 6. Rother, C., Carlsson, S.: Linear Multi View Reconstruction and Camera Recovery Using Reference Plane. IJCV 49(2/3), 117–141 (2002) 7. Baillard, C., Schmid, C., Zisserman, A., Fitzgibhon, A.: Automatic Line Matching And 3D Reconstruction of Buildings From Multiple Views. In: Proc. ISPRS Conference on Automatic Extraction of GIS Objects from Digital Imagery (September 1999) 8. Zhou, J., Li, B.: Exploiting vertical lines in vision-based navigation for mobile robot platforms. In: ICASSP 2007 (2007)
Robust Two-View External Calibration
835
9. Lowe, D.G.: Demo Software: U. of British Columbia, http://www.cs.ubc.ca/~ lowe/keypoints/ 10. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. CACM 24(6), 381–395 (1981) 11. Canny, J.: A computational approach to edge detection. IEEE PAMI 8(6) (November 1986) 12. Open Source Computer Vision Library, Copyright © 2000, Intel Corporation (2000) 13. Hough, P.V.C.: Method and means for recognizing complex patterns. U.S. Patent 3,069,654 14. MegaPov: unofficial POV-Ray extentions collection, http://megapov.inetart. net/index.html
Stabilizing Stereo Correspondence Computation Using Delaunay Triangulation and Planar Homography Chao-I Chen1 , Dusty Sargent1 , Chang-Ming Tsai1 , Yuan-Fang Wang1 , and Dan Koppel2 1
Department of Computer Science, University of California, Santa Barbara, CA 2 STI Medical Systems, 733 Bishop Street, Suite 3100, Honolulu, HI
Abstract. A method for stabilizing the computation of stereo correspondences is presented in this paper. Delaunay triangulation is employed to partition the input images into small, localized regions. Instead of simply assuming that the surface patches viewed from these small triangles are locally planar, we explicitly examine the planarity hypothesis in the 3D space. To perform the planarity test robustly, adjacent triangles are merged into larger polygonal patches first and then the planarity assumption is verified. Once piece-wise planar patches are identified, point correspondences within these patches are readily computed through planar homographies. These point correspondences established by planar homographies serve as the ground control points (GCPs) in the final dynamic programming (DP)-based correspondence matching process. Our experimental results show that the proposed method works well on real indoor, outdoor, and medical image data and is also more efficient than the traditional DP method.
1
Introduction
Inferring the visible surface depth from two or more images using light-path triangulation is an ill-posed problem due to the multiplicity of ways in establishing point correspondences in the input images. To alleviate this problem of ambiguity, different rectification algorithms [1,2] have been proposed to rearrange image pixels so that the corresponding points (that result from the projection of the same 3D point) will lie on the same image scan line. This configuration greatly reduces the search dimension (from 2D to 1D) of finding the matched points. Even with this powerful constraint in hand, identifying stereo correspondences is still a very challenging problem. A great number of stereo algorithms have been proposed in the past few decades and many of them are surveyed in [3,4]. Among all these algorithms, dynamic programming (DP)-based optimization techniques are often used due to its simplicity and efficiency. DP is an efficient algorithm that constructs globally optimal solutions by reuse and pruning. One common way to achieve global optimality and stabilize the DP-based stereo matching results is to impose the continuity (or smoothing) constraint. That is, G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 836–845, 2008. c Springer-Verlag Berlin Heidelberg 2008
Stabilizing Stereo Correspondence Computation
837
neighboring pixels (most likely) view 3D points that lie closely together, and hence, should have similar stereo disparity values. However, this constraint is only applicable to a single or a few neighboring scan lines, and applying this constraint often results in undesired striking effects. Furthermore without using any ground control points (GCPs), or reliable anchor points, as guidance, DP is very sensitive to the parameters chosen for the continuity constraint [5]. To improve the robustness of the current DP techniques, we present a new method that combines Delaunay triangulation and planar homographies to provide reliable GCPs and impose continuity constraint across multiple image scan lines. We discuss our techniques in more detail below.
2
Continuity Constraint and Planar Homographies
The continuity constraint used in a stereo matching program assumes that the disparity changes continuously except when crossing the occluding boundaries. The implementation of this constraint often involves the use of a variable gappenalty term that properly penalizes discontinuous disparity values. Although there are many different ways to design parameters for the gap-penalty, the majority of the gap-penalty design is based on some image content analysis. One common analysis is to calculate image gradient for detecting edges. If there is a strong edge response at the current pixel location, the gap-penalty should become smaller since it is very likely a jump boundary is present in the neighborhood. More sophisticated techniques may be applied to further exclude cases where edges are just part of the surface texture and discontinuity in stereo disparity values should still not be allowed. This type of analysis is usually complicated and data dependent. Therefore, it is very difficult, if not impossible, to design a universal set of discontinuity parameters for all kinds of input images. It is also believed that even within a single image, parameters should be adjusted for different regions due to varying signal-to-noise ratios inside the image [6]. Hence, we propose a new method that uses planar homographies to extend the continuity constraint across multiple scan lines without complicated content analysis. Without loss of generality, simple two-view cases are considered. It is well known that images of 3D points lying on a plane are related to each other in different views by a planar homography [7]. That is, if points in a 3D scene actually share a plane, we can simply transfer the image projections from one view to another by applying equation (1) where x and x are the homogeneous representation of the corresponding points in the two views, while H is a 3 by 3 homography matrix. x = Hx (1) Unlike a traditional DP-based stereo method where searching is always performed for every single pixel to identify its match on the corresponding image scan line on the other image, our method can identify many such point correspondences in a more efficient and reliable way. In our method, any points within a planar region can be transferred directly to another image through
838
C.-I. Chen et al.
the computed planar homography. This computationally efficient transformation provides a very simple way to impose continuity constraint across multiple scan lines. Furthermore, as the planar homography is computed based on multiple, highly reliable GCP matches, we can better guarantee the robustness and accuracy in stereo correspondences. Computing the homography matrix H is straightforward. Given at least four point correspondences, the singular value decomposition (SVD) can be applied to compute the 3 by 3 homography matrix. A more challenging question remaining to be answered is this: How do we identify the “anchor” points that actually lie on the same plane? To answer this question, we need to first partition the input images into smaller patches. Intuitively speaking, pixels within a small image area are more likely to view 3D points lying on the same plane than those which are more distant apart. After we extract these small patches, instead of simply assuming every such small patch is locally planar, we explicitly examine the planarity hypothesis in the 3D space. Therefore, a measurement must be designed to determine how close a set of 3D points lie on a plane. We will describe these tasks in detail in the following sections.
3 3.1
Plane Detection Feature Extraction
To partition the input images, a set of reliable point correspondences needs to be identified first. We rely on a promising local feature detection technique – Speeded-Up Robust Features (SURF) [8] to detect these anchor point features. Feature correspondences detected by SURF are either very accurate (up to subpixel precision) or totally chaotic. This is actually a good property in a sense that we can easily identify and eliminate chaotic matches by applying the widely used random sample consensus (RANSAC) algorithm. We note that these point correspondences can serve more than one purpose. Since we assume a general scenario in which the input images are taken by a single camera, calculating the camera’s extrinsic property (rotation and translation) is essential for rectifying the images and inferring the surface depth by light-path triangulation. We can use the same set of high quality point correspondences to compute the extrinsic camera matrix with little extra computational cost. 3.2
Delaunay Triangulation
Given a set of reliable point correspondences, Delaunay triangulation is employed to partition only the first image. Since we know the feature correspondences, the partition in the second image is automatically determined by the feature correspondence relationships in the two input images. Delaunay triangulation produces a partition of the input image such that no point lies inside the circumcircle of any triangle. Furthermore, Delaunay scheme tends to avoid skinny triangles because the scheme maximizes the minimum angle of all the angles of all the triangles in the triangulation.
Stabilizing Stereo Correspondence Computation
(a)
(d)
(b)
(c)
(e)
(f)
839
Fig. 1. Delaunay triangulation examples
These nice properties, however, may not hold in the second image where no Delaunay triangulation is performed and all triangles are determined by their correspondences with those in the first image. Figure 1 shows one such case. One big circle and a rectangle are in the scene. And for our discussion, assume that each of these objects has three SURF features represented by the small dots in figure 1(a). Furthermore, we assume that the rectangle is closer to the view point and the big circle is occluded by the rectangle in the second image (figure 1(b)). Figure 1(c) shows the Delaunay triangulation in the first image and figure 1(d) shows its corresponding partition in the second image. As can be seen in figure 1(d), in the second image, the largest triangle contains some other smaller ones, which violate the Delaunay property. This simple example may raise concerns about imposing wrong planar constraints. In figure 1(d), it is clearly wrong if we simply assume that the largest triangle lies on a plane and then use planar homography to transfer points within this triangle from image one to image two. Fortunately, there are ways to detect and exclude this kind of abnormality. In the second image, we can eliminate triangles that contain other triangles from the planarity test. In other words, we exclude any triangle that contains feature points inside. Note that after the process of filtering, triangles on the front object which are valid will remain. In our figure 1(d) example, the small triangle inside the rectangle is still valid. More extreme cases like figure 1(e) and 1(f) may happen. In this scenario, no features are detected in the front object (the rectangle in our example), maybe due to insufficient lighting. The big triangle therefore contains no feature points inside, and hence, the method described above will fail to exclude it and may assume pixels within this triangle share a plane. To solve this problem, intensity consistency is checked. Ix ≈ IHx
(2)
840
C.-I. Chen et al.
Equation (2) describes this image content-based constraint. It says that a point in the first image and its corresponding point calculated by homography in the second image should have similar intensity values. With this extra condition, we can eliminate figure 1(f) cases where large intensity discrepancy between a point and its transfer by planar homography should be expected. 3.3
Planarity Testing
The hypothesis we use is that the scene we want to reconstruct contains piecewise planar surfaces. When we view the scene through small triangles in the image, it is highly likely that we are observing parts of a planar 3D surface. This assumption in general works fine if the triangle size is small and all three vertices lie on the same object. However, it may cause problems when three vertices do not lie on the same object or they are far apart. To address this issue, [9] proposes a geometry fitting method which involves complex analysis of color, texture, and edge information to detect object geometry. We, on the contrary, propose a simpler method by using multiple triangles and planar homographies. There are two main reasons why we use multiple triangles. First, to compute a 3 by 3 homography matrix H, we need at least four point correspondences. Second, any three random points in the 3D space form a plane. Therefore, extra information is required to help identify cases in which points actually lie on the same object. A reasonable choice of selecting multiple triangles is to pick up one triangle and then enlarge it to include its neighbors. The assumption behind combining adjacent triangles is that if all three vertices in a triangle lie on the same object that is planar, it is very likely the vertices of immediately adjacent triangles will lie on the same plane as well. To examine this hypothesis, theoretically we can calculate the residual error of equation (3) where F is the fundamental matrix between the two input views and H is the 3 by 3 homography matrix calculated from all vertices that are considered to lie on a plane. HT F + F T H = 0
(3)
Equation (3) says that the homography matrix H should be consistent or compatible with the fundamental matrix F . It is because a homography matrix H describes constraints only on a local planar area while the fundamental matrix F describes the epipolar geometry which all points in the scene should obey no matter what. The proof of equation (3) can be found in [7]. Although this equation is very elegant, unfortunately we found that it is very sensitive to noise and is hard to apply in real world applications. What we did was to perform computer simulations to calculate the value of H T F + F T H for different 3D planar and non-planar configurations (different 3D locations and orientations). To be realistic, we add noise to perturb both projected 2D feature locations and 3D depths of these features. The noise simulates that 2D features may not be accurately localized in images, and hence, 3D depths may not be recovered very accurately either. As can been see in figure 2, the simulation result
Stabilizing Stereo Correspondence Computation
841
Fig. 2. The Frobenius norm of H T F + F T H
shows a relatively flat landscape. When there is no error in 2D positions and 3D depths, indeed it is true that planar surfaces give zero H T F +F T H. But the value of H T F + F T H quickly jumps up and flattens out even with a small amount of noise added. Worse still, with large noises (or when the object is not planar), the error is about the same as the small noise cases. This means that planar surfaces can easily be identified as non-planar and vice versa. This is a disappointing observation because with noise perturbation we cannot distinguish non-planar cases from planar ones. Therefore, we directly compute the depth information of all these vertices through light-path triangulation and evaluate how close they can fit on a plane in the 3D space. In figure 3, we illustrate this concept by using one example that involves only the minimum number of points. A, B, C, and D are four points in the 3D space and A , B , C , and D are their projections. Let us assume A B C and B C D are two adjacent triangles determined by Delaunay triangulation. The solid triangle ABC represents the plane defined by points A, B and C in the
3D space. Its plane normal is NABC . To compute plane homography, we need at least four points, which means at least two triangles should be merged into a larger polygon. What we would like to know in this example is whether point D also lies on the plane ABC or not. Equation (4) describes how we test planarity in the 3D space. What this equation says is that if point D is very close to the
plane defined by points A, B and C, vectors AD, BD and CD will be almost
perpendicular to plane normal NABC .
max(| NABC · AD |, | NABC · BD |, | NABC · CD |) < threshold
(4)
842
C.-I. Chen et al.
Fig. 3. Example of planarity testing
Once the planes are detected and homography matrices are calculated, point correspondences in the planar regions can be easily determined by equation(1).
4
Optimization
The final step of our method is to optimize the stereo matching results with the help of dynamic programming (DP). The difference between the conventional DP approaches and our method is that we have already established many reliable point correspondences through planar homography constraints. These correspondences can serve as the ground control points (GCPs) and help stabilizing the stereo matching results. It has also been proven that the GCPs can significantly reduce the computational cost of DP [5]. In order to tolerant some noise and imprecision in localizing GCPs, GCPs are allowed to undergo a small amount of movement during the optimization process. Originally the total disparity range when processing images taken by a single camera will be [−W, W ] where W is the width of the rectified images. Unlike images shot by a pre-calibrated stereo rig that have only positive disparity values, images produced by a rectification program often have both positive and negative disparity values [1,2]. Searching point correspondences with a large search range is computationally expensive and error-prone. Therefore, we use the GCPs to further improve the speed by determining the disparity search range.
5
Experimental Results
Two-view 3D reconstruction results of three different data sets are presented in this section. In each data set, one gray scale input image, the computed depth map, and 2D views of the constructed 3D models are shown in figure 4.
Stabilizing Stereo Correspondence Computation
843
(a)
(b)
(c) Fig. 4. Experimental results. (a) Indoor data (b) Outdoor data (c) Medical data.
Our method works well on both indoor and outdoor scenes. We also apply the proposed method in our Colon-CAD project [10] to improve the accuracy of the model building process. Figure 4(c) depicts images of a colon acquired by an endoscope where a small camera is mounted at the tip. These images depict a diverticulum, an abnormal air-filled outpouching of the colonic wall, which may lead to infection and surgery may be needed. Using the structure-from-motion framework, we were able to successfully identify this type of structures. Medical images in general are much more challenging than indoor or outdoor data because tissue colors are very similar. Hence, the depth map in figure 4(c) is noisier than the other two depth maps. However, with the help of all anchor patches that computed by our method, the depth map of medical data does capture the important structure of the diverticulum (the circled area). Table 1 presents the timing comparison results between the proposed method and the traditional DP method. Both original and rectified image sizes are provided here. These two images sizes will be similar when the camera movement is roughly a sideway motion. The third column in this table shows estimated disparity ranges. The importance of this estimation is especially clear in the medical data set where rectified images are large while the actual disparity range is very
844
C.-I. Chen et al. Table 1. Timing comparison
Data
Image Size (Orig./Rect.) Disp.
Indoor 308 X 231 / 322 X 265 Outdoor 426 X 568 / 435 X 581 Medical 660 X 560 / 853 X 890
Feature# Tri.+DP (sec.) DP (sec.)
[-7, 51] 208 [-97, 4] 1836 [-3, 26] 74
0.04+0.45 0.25+2.74 0.02+1.67
15.56 58.98 424.37
small due to a relatively flat 3D structure. By limiting the search range, a significant amount of computation can be saved. The last two columns are numbers of seconds required to perform stereo matching using our method and the traditional DP method. In column Tri.+DP, we divide the computational time into two part—the preprocessing time where both Delaunay triangulation and homographies are computed and the DP optimization time. The listed DP optimization time in the last two columns include running time of computing pair-wise pixel similarity which is essential in real applications but not considered in many stereo papers when reporting the time required. As can been seen in table 1, the computational cost spent on preprocessing which depends on the number of features (Feature# in the fourth column) we detected is almost negligible comparing to the time saved from the DP procedure. This experiment was performed on a laptop computer with Intel Core2 2.00 GHz Duo processor T7200.
6
Conclusion
We propose a simple method to stabilize stereo correspondence computation. Scenarios where two input images are acquired by a single camera are considered. Reliable SURF features are first detected for camera motion inference and for Delaunay triangulation. Instead of using individual triangles in the Delaunay partition and assume that 3D surfaces viewed through these triangles are locally planar, we combine adjacent triangles and test planarity hypothesis in the 3D space explicitly. Once planes are identified, point correspondences within these planar areas can be easily determined through homographies. All these point correspondences serve as anchor points in the final stereo optimization procedure. To the best of our knowledge, we are the first group to impose continuity constraint on stereo matching by examining planarity hypothesis in 3D space with the help of Delaunay triangulation and planar homographies. Three different types of data, including medical data, are presented in our experimental results. These results show that our method not only can stabilize the reconstruction results but also can significantly speed up the stereo matching process. The speed gain over the traditional DP method is obtained by automated disparity range determination as well as pre-assigned anchor points from the planar homography constraints.
Stabilizing Stereo Correspondence Computation
845
References 1. Pollefeys, M., Koch, R., Van Gool, L.: A simple and efficient rectification method for general motion. In: Proceedings of International Conference on Computer Vision, pp. 496–501 (1999) 2. Hartley, R.I.: Theory and practice of projective rectification. International Journal of Computer Vision 35, 115–127 (1999) 3. Brown, M.Z., Burschka, D., Hager, G.D.: Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 993–1008 (2003) 4. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47, 7–42 (2002) 5. Bobick, A.F., Intille, S.S.: Large occlusion stereo. International Journal of Computer Vision 33, 181–200 (1999) 6. Gong, M., Yang, Y.H.: Fast stereo matching using reliability-based dynamic programming and consistency constraints. In: Proceedings of the 9th IEEE International Conference on Computer Vision, vol. 1, pp. 610–617 (2003) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 8. Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404– 417. Springer, Heidelberg (2006) 9. Li, P., Farin, D., Gunnewiek, R.K., de With, P.: On creating depth maps from monoscopic video using structure from motion. In: Proceedings of 27th Symposium on Information Theory in the Benelux, pp. 508–515 (2006) 10. Koppel, D., Chen, C.I., Wang, Y.F., Lee, H., Gu, J., Poirson, A., Wolters, R.: Toward automated model building from video in computer assisted diagnoses in colonoscopy. In: Proceedings of the SPIE Medical Imaging Conference (2007)
Immersive Visualization and Analysis of LiDAR Data Oliver Kreylos1, Gerald W. Bawden2 , and Louise H. Kellogg1,3 1
W. M. Keck Center for Active Visualization in the Earth Sciences (KeckCAVES) 2 United States Geological Survey 3 Department of Geology, University of California, Davis
Abstract. We describe an immersive visualization application for point cloud data gathered by LiDAR (Light Detection And Ranging) scanners. LiDAR is used by geophysicists and engineers to make highly accurate measurements of the landscape for study of natural hazards such as floods and earthquakes. The large point cloud data sets provided by LiDAR scans create a significant technical challenge for visualizing, assessing, and interpreting these data. Our system uses an out-of-core viewdependent multiresolution rendering scheme that supports rendering of data sets containing billions of 3D points at the frame rates required for immersion (48–60 fps). The visualization system is the foundation for several interactive analysis tools for quality control, extraction of survey measurements, and the extraction of isolated point cloud features. The software is used extensively by researchers at the UC Davis Department of Geology and the U.S. Geological Survey, who report that it offers several significant advantages over other analysis methods for the same type of data, especially when used in an immersive visualization environment such as a CAVE.
1
Introduction
LiDAR (Light Detection And Ranging) is emerging as a powerful tool for rapid investigation of the response of the landscape, including engineered structures like buildings, to geologic events including earthquakes and landslides. A LiDAR data set is represented as a set of 3D points – sometimes attributed with RGB color or intensity values – without connectivity. LiDAR data can be categorized into two main classes: airborne LiDAR (also known as airborne laser swath mapping, ALSM) is gathered by mounting a downward-scanning laser on a lowflying aircraft, and ground-based or terrestrial LiDAR is gathered by mounting a portable scanner on a tripod and scanning horizontally. In both classes, LiDAR data sets typically consist of sets of individual scans, or airborne flight swaths, that were co-registered and merged after gathering. Airborne LiDAR is often used to generate digital elevation models (DEMs), and can be thought of as representing a height field, such that zi = f (xi , yi ) for each point pi = (xi , yi , zi ) and some underlying function f . This representation is often valid since the laser scans down almost vertically and the DEMs are typically constructed using only G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 846–855, 2008. c Springer-Verlag Berlin Heidelberg 2008
Immersive Visualization and Analysis of LiDAR Data
847
Fig. 1. Left: a ground-based LiDAR data set containing 4.7 million points reflected off of both engineered (buildings, water tower) and natural (trees) features. Center: ground-based LiDAR scan containing 23 million points of a house whose foundations (highlighted in green) were exposed by a landslide. Right: a user in a 4-sided CAVE interacting with an airborne LiDAR data set containing 308 million points.
the lowermost or “ground” laser return from each laser pulse. This process creates a 2.5D product in which the true 3D nature of the data has been traded for ease of processing. This approach does not work for ground-based LiDAR where the data cannot be assumed to represent a height field; and even for airborne LiDAR, a lot of detail can be lost in the conversion process due to the non-uniform spacing of the raw points and the ability of the scanner to record multiple light returns for each laser pulse. However, the straightforward alternative, directly rendering the raw 3D points, does not work well for conventional visualization. The lack of depth cues in the point clouds, especially shading, makes it difficult to perceive the 3D shapes represented in the data. An additional challenge lies in data analysis. An important task is to (visually) identify man-made or natural features in the data, and then fit simple geometric primitives such as planes, spheres, or cylinders, or complex free-form surfaces, to the subset of points defining those features. Derived quantities such as plane equations are essential to measure and compare LiDAR data sets. The problem is that features of interest are often small, i. e., sampled with only relatively few points, and surrounded by “noise,” such as sampled vegetation. Isolating targeted features primarily relies on a user’s ability to visually distinguish feature points from noise, and is especially tedious and time-consuming in desktop environments due to the lack of depth perception and the limitation to 2D (mouse-based) interaction. Immersive, i. e., head-tracked and stereoscopic, visualization solves the perceptual problem of visualizing scattered, non-shaded point cloud data. The direct interaction provided by 3D input devices enables the isolation and extraction of features. In combination, immersive visualization and 3D interaction allow users to work directly with the “raw” LiDAR point cloud data, thereby expanding the quality control capabilities, increasing the accuracy and ease of use of point selection for feature extraction, and enhancing the overall analytical potential of the 3D data. Immersive visualization exposes another LiDAR challenge: it requires fast rendering of (optimally) 48–60 stereoscopic frames per second to maintain the
848
O. Kreylos, G.W. Bawden, and L.H. Kellogg
immersion that is essential for 3D perception and intuitive interaction. LiDAR data sets typically contain between several million and several billion 3D points, and current graphics hardware is unable to directly render such large sets at the required rates. We therefore implemented a multiresolution, view-dependent, out-of-core rendering scheme based on an octree data structure to maintain high frame rates even for multi-billion point data sets, described in Section 3. Tied into the rendering scheme are interaction methods that allow users to isolate and extract features in real time from inside the immersive visualization. These methods are described in Section 4.
2
Related Work
LiDAR data has previously mostly been visualized as triangulated DEMs using geographical information system software such as ArcGIS [1], with the concomitant problems discussed above. Some recently developed packages such as LViz [2] also allow point cloud visualization, but, to our knowledge, use nonhierarchical methods and cannot handle large data sets. Research into (multiresolution) point cloud rendering has mostly focused on applications where surfaces were assumed to be sampled by very dense point sets, and achieved continuous shaded rendering of the sampled surfaces using splatting [3,4]. In LiDAR applications, on the other hand, features of interest are typically not densely sampled, and the interpolation inherent in splatting interferes with data analysis. A more relevant recent work is the octree-based point renderer described by Wimmer and Scheiblauer [5]. It uses a similar preprocessing step, but a different data representation and view-dependent rendering algorithms.
3
Multiresolution Point Cloud Visualization
To maintain sufficent frame rates for immersive visualization even for very large point sets, we use an octree-based multiresolution data representation. The octree’s domain is aligned with the principal axes and covers the bounding box of the source point cloud. Each leaf node stores the set of source points contained within the node’s domain, and each interior node contains a subset of the union of point sets stored in its children. When constructing the tree, we recursively split nodes starting at the root until no leaf contains more than a user-defined number of points, and then downsample the point sets in each node’s children while moving back up the tree to generate the lower-resolution approximations used during rendering. Downsampling collapses clusters of nearby points, but chooses an unmodified source point as cluster representative, in order not to introduce artificial points. The point octree is created in a preprocessing step and saved as a file which serves as input data for the visualization application itself. During rendering, we recursively traverse the octree from the root and check whether each entered node’s domain intersects the current view frustum, and whether the node contains enough point detail for the current view. The
Immersive Visualization and Analysis of LiDAR Data
849
nodes required for rendering are loaded from the input file on demand, to allow rendering point sets that are too large to fit into main memory. 3.1
Preprocessing
The preprocessor reads raw LiDAR data from input files in a variety of common formats, and writes the union of the loaded point sets into a multiresolution octree file. Each node N in the octree file contains a detail size d(N ) and a set of points Q(N ), which is either the full set of source points contained in the node’s domain, or a downsampled version thereof. The preprocessing algorithm starts by computing the octree’s domain to cover the point set’s bounding box, and then calls procedure CreateSubTree with the root node and octree domain, the entire source point set P , and a user-specified maximum number of points to be stored in each node. Procedure : CreateSubTree(N ,D,P ,maxNumNodePoints). Recursively constructs point octree for node N with domain D and the source point set P , with at most maxNumNodePoints points stored in each node. if |P | > maxNumNodePoints then Create N ’s child nodes N1 , . . . , N8 Split D into the child nodes’ domains D1 , . . . , D8 Split P into Pi such that for each i = 1, . . . , 8 : Pi ⊆ Di foreach i = 1, . . . , 8 do CreateSubTree(Ni , Di , Pi , maxNumNodePoints ) Set Q(N ) := 8i=1 Q(Ni ), the union of the child nodes’ point sets Set d(N ) := 0 while |Q(N )| > maxNumNodePoints do Find (pi , pj ) ∈ Q(N ), the pair of closest points in Q(N ) Set d(N ) := 2 · |pi − pj | foreach pk ∈ Q(N ) with k = i ∧ |pk − pi | < d(N ) do Remove pk from Q(N ) Write N to the octree file as an interior node else Set Q(N ) := P and d(N ) := 0 Write N to the octree file as a leaf node
Ë
This algorithm creates an octree whose nodes contain close to, but never more than, the given maximum number of points, and whose interior nodes approach uniform local point density even if the source data exhibits highly non-uniform sampling (which is typical for ground-based LiDAR data, where the local point density is inversely proportional to distance from the scanner). The actual algorithm uses additional spatial data structures for closest-point location and point removal during downsampling and an out-of-core source data management scheme to work for very large source point sets; both aspects are beyond the scope of this paper. The created octree contains the full unmodified source data and supports fast extraction of point subsets due to its hierarchical structure. This makes it a
850
O. Kreylos, G.W. Bawden, and L.H. Kellogg
Preprocessing Time (seconds)
10000
1000
100
10
1 105
106
107 108 Data Set Size (number of points)
109
1010
Fig. 2. Graph of preprocessing time vs source data size. Measurements were taken on a Linux PC with a 2.4 GHz Intel Core2 Quad CPU, 4 GB of RAM, and a 1.4 TB 3-disk striped RAID.
convenient format for data archiving, and it is typically generated shortly after data gathering. Preprocessing is reasonably fast: on a Linux PC with a 2.4 GHz Intel Core2 Quad CPU and using at most 2 GB of RAM and a single core, it took 18 seconds to process a small data set containing 4.7 million points; the same machine took 119 minutes for a large data set containing 1.5 billion points (see Figure 2). In the latter example, total source data size was around 24 GB, and the program utilized a 1.4 TB 3-disk striped RAID for the source data, the temporary files required by the out-of-core data management scheme, and the produced octree file. To put these numbers in perspective: gathering the source data took about an afternoon for the small data set, and several weeks for the large data set.
Fig. 3. Three renderings of an airborne LiDAR data set containing 308 million points, at different quality settings in a 1920 × 1200 window. Left: Low focus+context strength, full detail level, 5.2M points rendered, 57.2 frames/s; center: normal focus+context strength, full detail level, 2.4M points rendered, 62.5 frames/s; right: high focus+context strength, low detail level, 1.1M points rendered, 232.9 frames/s. The quality settings shown in the right image are sufficient for interactive analysis, because objects in the user’s workspace are still rendered at high-enough detail. Reducing focus+context strength in the left image to 0.0 will reduce the frame rate to about 12 frames/s, at no visible quality increase.
3.2
Rendering
The LiDAR visualization application renders a single point set stored in a point octree file created during preprocessing. To maintain the frame rates required
Immersive Visualization and Analysis of LiDAR Data
851
for immersive visualization even for very large point sets, it uses an out-ofcore view-dependent multiresolution rendering scheme. The application starts instantaneously, independent of data size, and never blocks to read data from the input file. It only loads the octree’s root node during startup, and all later file input is handled by a background thread to avoid interfering with the immersive visualization. At the beginning of each frame, the renderer determines the current view frustum and projection parameters, and calls procedure RenderSubTree with the root node and octree domain, the view frustum, and a level-of-detail (LOD) bias value. The result of this algorithm is that the renderer will cull invisible subtrees, and render downsampled versions of the original point set if the two farthest-apart points that were removed from a node during preprocessing would be projected to the same or neighboring pixels. In other words, it will not render points that would fall outside of the screen, or points that would be hidden by other already rendered points. The renderer will also not enter nodes whose point data have not yet been loaded from the octree file. Instead, the renderer will issue load requests to a disk cache manager running in a separate thread that will attempt to load node data in order of decreasing priority. This ensures that the rendering thread never blocks for disk access, but instead renders downsampled point sets until higher-resolution data arrives. The additional LOD bias parameter allows either the user or an automatic governing algorithm to adjust the overall approximation quality of the rendering to steer effective frame rates towards a desired range. To improve rendering performance without sacrificing rendering quality in the user’s workspace, we added an additional focus+context LOD bias that reduces the detail levels of nodes outside the focus region, typically a sphere roughly encircling the used VR environment. The internals of the disk cache manager, and the secondary cache keeping currently rendered nodes in graphics card memory as vertex buffer objects, are beyond the scope of this paper. We have been using the multiresolution LiDAR visualization algorithm on a large number of LiDAR data sets of sizes ranging from several million to more than a billion 3D points. The largest data set we have worked with so far, an airborne LiDAR data set of the southern San Andreas fault, contains 1.4 billion points. In our 4-sided CAVE, which is run by a cluster of Linux PCs with dual 2.2 GHz AMD Opteron CPUs, 4 GB of RAM, an Nvidia GeForce 6600 GPU, and an NFS-mounted striped disk array on the head node and single 2.2 GHz AMD Athlon 64 CPUs, 2 GB of RAM, and Nvidia Quadro FX 4400 GPUs in 96 Hz active stereo mode on the four render nodes, and 1 Gb/s Ethernet interconnect, we consistently achieved frame rates of more than 30 fps at full rendering quality, and the desired 48 fps at slightly reduced quality.
4
Interactive Analysis of Large Point Clouds
The three main analysis tasks addressed by our LiDAR visualization application are: quality control of the LiDAR scanning process, extraction of survey measurements from the data, and the extraction of isolated point cloud features.
852
O. Kreylos, G.W. Bawden, and L.H. Kellogg
Quality control, especially the detection of misalignments between individual LiDAR scans or flight swaths, is a vital component in validating the quality of any LiDAR data set and their derived products. For ground-based LiDAR this is mostly a visual process and needs few tools beyond the immersive visualization provided by the multiresolution renderer and the intuitive navigation provided by the VR environment. The quality control assessment of airborne LiDAR additionally employs a tool that colors the point cloud data according to distance from a user-defined plane. This allows the user to easily identify misaligned data as color gradients in the imagery. A user will explore the data set, focusing on the areas of overlap between adjacent scans, and look for signs of misalignment such as offset replicated surfaces (see Figure 4). If such signs are found, the user will measure their positions and the amount of displacement using a simple 3D measurement tool and adjust the source data accordingly. Similarly, the user can isolate specific landmarks or points in the 3D point cloud data and use the same measurement tool to extract survey-quality 3D positional relationships among the selected points. Procedure: RenderSubTree(N ,D,F ,lodBias). Renders the subtree rooted at node N with domain D using the view specification/view frustum F and level-of-detail bias lodBias. if |Q(N )| > 0 and D intersects the frustum F then if N is interior node then Calculate dF (N ), N ’s projected detail size in frustum F in pixel units if N is outside focus region then Reduce dF (N ) by context weight times distance from focus region if dF (N ) > lodBias then if N ’s children are loaded then Split D into the child nodes’ domains D1 , . . . , D8 foreach i = 1, . . . , 8 do RenderSubTree(Ni ,Di ,F ,lodBias) else Issue load request N, dF (N ) to load N ’s children with priority dF (N ) Render Q(N ) else Render Q(N ) else Render Q(N )
¡
Feature extraction fits mathematical primitives such as planes or cylinders to selected sets of points using least-squares approximation. It reduces the positions of thousands to millions of points to a few quantities such as a plane equation, and allows the measurement of landmarks based on these derived quantities instead of the original data. The main benefits of feature extraction are that it removes noise intrinsic to the scanning technology (typically ±0.4 cm for groundbased and ±15 cm for airborne LiDAR) by averaging over large numbers of points, and that it allows the comparison of landmarks between multiple data
Immersive Visualization and Analysis of LiDAR Data
853
Fig. 4. Misalignment between two ground-based LiDAR scans containing a steam pipe. The two cylinders were extracted from separate scans and are offset by approximately 12 cm. Points sampling the pipe had to be isolated from the surrounding vegetation (highlighted in green) in order to extract well-fitting cylinders.
Fig. 5. Feature extraction from point clouds. Left: a landmark feature in a groundbased LiDAR data set. Center: the user isolated the feature by selecting the points sampling it using the 3D selection brush (yellow sphere). Right: the cylinder approximating the isolated points.
sets in a time series. The latter cannot be achieved using only the point data due to noise and the lack of a one-to-one relationship between points sampling the same surface in multiple scans. Feature extraction is a very important analysis technique; one main benefit of LiDAR over traditional surveying techniques is that any distinct objects in the scanned scene, such as tree trunks, fence posts, road signs, etc., can be measured and used as landmarks. It is, however, an involved task that requires user interaction in several steps (see Figure 5): first, a user has to detect the presence of a potential landmark, such as a fence post, by visual inspection. The user then separates the 3D points sampling the landmark from the surrounding points, which, depending on the amount of vegetation and other “noise,” can be a time-consuming process. Finally, the user extracts derived quantities, such as plane equations or cylinder axes and radii, from the isolated point set. Our LiDAR visualization application provides a simple but powerful tool for feature isolation. On the user’s request, a 3D selection brush, currently a sphere of adjustable radius, can be associated with any input device present in the VR environment. When a button on the device is pressed, the program will either add or remove all LiDAR points inside the sphere to or from the selected point set, depending on a user interface setting. The point selection/deselection algorithm uses the same multiresolution octree as the rendering algorithm (see procedure SelectPointsInSubTree); hence, point selection is instantaneous even
854
O. Kreylos, G.W. Bawden, and L.H. Kellogg
Procedure : SelectPointsInSubTree(N ,D,c,r). Selects all points from the subtree rooted at node N with domain D that are inside the sphere of radius r centered at point c. if |Q(N )| > 0 and D intersects sphere (c, r) then if N is interior node and N ’s children are loaded then Split D into the child nodes’ domains D1 , . . . , D8 foreach i = 1, . . . , 8 do SelectPointsInSubTree(Ni ,Di ,c,r) else foreach p ∈ Q(N ) do if |p − c| ≤ r then Select point p
for LiDAR data containing billions of points. In effect, the user can keep the button pressed and sweep the input device through the data for fast and intuitive selection of larger point subsets. Observations show that our users typically start with a rough sketch by quickly sweeping a large sphere through the data, and then refine the selection by using a smaller brush and adding/removing points in smaller groups or even one at a time. The pseudocode for procedure SelectPointsInSubTree shows that selection, and therefore feature extraction, only considers the points currently held in memory. This is usually not a concern since users will refine their selections while zoomed in to a level where the program already shows the desired feature at full resolution. Nonetheless, we added another bias factor to the rendering algorithm that requests to subdivide all interior nodes that intersect any current selection brushes. This additional bias is omitted from the pseudocode for procedure RenderSubTree for clarity. Once the points defining a feature of interest have been isolated with sufficient accuracy, a user can select a primitive type (plane, cylinder, etc.) from a pop-up menu and the program will automatically fit a primitive of the selected type to the selected point set using a non-linear least-squares approximation implemented as a Levenberg-Marquardt algorithm [6]. Depending on the number of selected points and the primitive type’s mathematical complexity, feature extraction can take from fractions of a second to several seconds. The extraction algorithm runs in a separate thread to avoid interfering with the immersive visualization. Once the thread finishes, the new primitive is added to the visualization, and its parameters are written to a file for further analysis.
5
Results
LiDAR technology holds great promise for assessing natural hazards and resources by enabling the precise measurement of earth movement associated with faults, landslides, volcanoes, and fluid pumping [7]. As the quantity of data increases, there is a growing demand for methods such as those presented here that allow rapid, direct, and effective analysis and interpretation of these point cloud
Immersive Visualization and Analysis of LiDAR Data
855
data. Researchers in the UC Davis Department of Geology and the U.S. Geological Survey use the system extensively in a variety of environments, ranging from desktop PCs over portable stereoscopic projection systems to CAVEs, to visualize and analyze LiDAR data sets ranging from a few million to about 1.4 billion points. Their experience shows that the immersion provided by head-tracked stereoscopic environments is crucial in identifying subtle flaws in LiDAR data, and that the interactivity provided by tracked input devices supports isolating landmark features more quickly and accurately than previously possible. As a result, the system has become an integral part of the LiDAR gathering workflow, being used at all stages from scan merging over quality control assessment to data analysis.
References 1. ESRI: ArcGIS – the complete geographic information system (2008), http://www.esri.com/software/arcgis/ 2. Conner, J.: LViz: 3D LiDAR visualization tool (2008), http://lidar.asu.edu/LViz.html 3. Alexa, M., Behr, J., Cohen-Or, D., Fleishman, S., Levin, D., Silva, C.T.: Point set surfaces. In: Vis 2001: Proceedings of the conference on Visualization 2001, Washington, DC, USA, pp. 21–28. IEEE Computer Society, Los Alamitos (2001) 4. Rusinkiewicz, S., Levoy, M.: QSplat: A multiresolution point rendering system for large meshes. In: SIGGRAPH 2000: Proceedings of the 27th annual conference on computer graphics and interactive techniques, pp. 343–352. ACM Press/AddisionWesley Publishing Co., New York (2000) 5. Wimmer, M., Scheiblauer, C.: Instant points. In: Proceedings Symposium on PointBased Graphics 2006. Eurographics Association, pp. 129–136 (2006) 6. Mor´e, J.J.: The Levenberg-Marquardt algorithm. implementation and theory. In: Watson, G.A. (ed.) Numerical Analysis. Lecture Notes in Mathematics, vol. 630, pp. 105–116. Springer, New York (1977) 7. Crane, M., Clayton, T., Raabe, E., Storker, J., Handley, L., Bawden, G., Morgan, K., Queija, V.: Report of the U.S. Geological Survey LiDAR workshop. Technical Report Open File Report 04–106, U.S. Geological Survey (2004)
VR Visualisation as an Interdisciplinary Collaborative Data Exploration Tool for Large Eddy Simulations of Biosphere-Atmosphere Interactions Gil Bohrer1, Marcos Longo2, David J. Zielinski3, and Rachael Brady3 1
Ohio State University, department of Civil and Environmental Engineering and Geodetic Science, 470 Hitchcock Hall, 2070 Neil Ave. Columbus, OH, 43210 2 Harvard University, Department of Earth and Planetary Sciences, 20 Oxford St., Cambridge, MA 02138 3 Duke University, Visualization Technology Group, 130 Hudson Hall, Box 90291, NC, 27708
Abstract. Scientific research has become increasingly interdisciplinary, and clear communication is fundamental when bringing together specialists from different areas of knowledge. This work aims at discussing the role of fully immersive virtual reality experience to facilitate interdisciplinary communication by utilising the Duke Immersive Virtual Environment (DiVE), a CAVElike system, to explore the complex and high-resolution results from the Regional Atmospheric Modelling System-based Forest Large-Eddy Simulation (RAFLES) model coupled with the Ecosystem Demography model (ED2). VR exploration provided an intuitive environment to simultaneously analyse canopy structure and atmospheric turbulence and fluxes, attracting and engaging specialists from various backgrounds during the early stages of the data analysis. The VR environment facilitated exploration of large multivariate data with complex and not fully understood non-linear interactions in an intuitive and interactive way. This proved fundamental to formulate hypotheses about treescale atmosphere-canopy-structure interactions and define the most meaningful ways to display the results.
1 Introduction “Interdisciplinary research is a mantra of science policy” [1]. As our knowledge in each area advances, the need for information from other areas of knowledge becomes increasingly important. More and more of the critical research issues consist of problems that involve interactions between topics, which would have classically been affiliated in different disciplines. For example, global economic growth, climate change, and the recent food crisis cannot be fully understood by focusing only on economy, meteorology, or agronomy aspects: indeed, all these questions are interrelated, as they are linked to several other areas of knowledge. On the other hand, the interdisciplinary character of such questions poses the challenge of bringing together specialists from different areas of knowledge, and clear communication across disciplinary boundaries is fundamental in this case. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 856–866, 2008. © Springer-Verlag Berlin Heidelberg 2008
VR Visualisation as an Interdisciplinary Collaborative Data Exploration Tool
857
Virtual reality visualisation can be used as a powerful tool to facilitate communication. As pointed out by Riva [2], virtual reality constitutes an efficient communication tool because it is able to bring different inputs and data into a single experience, and more importantly, allows the users to explore and immerse themselves in virtual environments. The communicating efficacy of this full immersive experience can be confirmed by its use to foster scientific discussions [3] and the increased use as an educational tool [4, 5]. This work describes the development of a VR visualisation data viewer and explorer for the simulation output of the Regional-Atmospheric-Modelling-System-based Forest Large Eddy Simulation (RAFLES) [6, 7]. This model studies biosphere-atmosphere interactions, and in particular, the effects of sub-tree-scale (very high-resolution in meteorological terms) heterogeneity of the forest canopy on wind flow, turbulence and fluxes inside the canopy and above it at the atmospheric boundary layer. Intrinsically an interdisciplinary research tool, this high-power computing model resolves the interactions between forest canopy structure, ecosystem dynamics and the biological functioning of trees, atmospheric fluid dynamics, turbulence and dispersion. Our project was the only study that simulated forest canopies as naturally realistic, 3-D heterogeneous structures that interact with atmospheric flow at high resolution (since we started our work two other research groups have used 3-D heterogeneous forest large eddy simulations [8-12]). Not much was known about the fine-scale effects of canopy structure, its interactions with turbulence, and with plant functions such as transpiration of water vapour. Atmospheric boundary layer flows, as any turbulent process, have chaotic structure in space and time with complex fractalstructured eddies that advect and evolve in space and time. A visualisation tool for data exploration was needed at the early stages of the study in order to detect and identify such effects because we had no a priori knowledge of what were the effects of canopy structure on the flow, where were they concentrated, and what were the resulting turbulence features that were most affected by particular canopy structures. This tool had to present both the dynamics atmospheric variables and the canopy structures in the same space, animate them in time, and do that in a way that would be intuitively interpreted by collaborating researchers from the atmosphere, ecology and remote sensing disciplines. The main goal of this work is to present a system in which the modelled interactions between the heterogeneous structure of the canopy and the atmosphere can be visualised in an intuitive and interactive way. The data was explored inside the Duke Immersive Virtual Environment (DiVE), a 6-sided CAVE-like system [13]. We utilized the commercial scientific visualisation package AMIRA [14]. Although we used this system to analyse the interactions between canopy and atmosphere in a very fine resolution, it could easily be applied in other types of studies, such as regional and global scale simulations of biosphere-atmosphere interactions. This system is an effective tool to promote communication and facilitate collaboration at the early stages of data analysis between different scientific communities and to present the results of complex numerical models that study non-linear interactions between physical systems from different scientific disciplines. In this paper we briefly describe the forest model and VR system design. We then present how the shared interdisciplinary communication using the DiVE with collaborating researchers resulted in scientific hypotheses that we were able to quantitatively verify using statistical methods.
858
G. Bohrer et al.
2 Related Work Virtual Reality in Atmospheric Sciences. Atmospheric science is amongst the most dependent on visual tools. In fact, most of the current knowledge of atmospheric dynamics became possible only after the development of the first meteorological charts during the 19th century [15]. By plotting observations collected at different places and integrating them into a two-dimensional chart, early meteorologists were able to identify weather patterns, giving rise to what is called synoptic meteorology, which, since then, has become the basis for most theories in atmospheric science and weather forecasting. As the knowledge of atmospheric dynamics evolved in time, the atmospheric scientists realized that atmospheric systems had complex three-dimensional structures. The traditional two-dimensional plots and meteorological charts were limited in their ability to visualise both horizontal and vertical structures. Currently, meteorological model results, including RAMS, the parent model of the LES described here, are commonly visualised by specialized interfaces such as ViS5d. Virtual reality always had strong appeal among atmospheric scientists since it could simplify the visualisation of such complex structures. In fact, many three dimensional visual tools have been developed specifically for atmospheric analysis, for example, the VIS-5D software, developed at University of Wisconsin during the late 1980s [16, 17]. During the 1990s several virtual reality projects were developed from VIS-5D, including a CAVE- and ImmersaDesk-friendly version, called CAVE-5D [18]. Recently, the National Center for Atmospheric Research’s Visualisation Laboratory (NCAR) contributed to the new VIS-5D generation, called Vis5D+ [19] by including an option for stereo/3D projection. The visualisation facility CAVCaM of the Desert Research Institute [20] included CAVE-5D capabilities in visualisation of wild fire, and atmospheric conditions including dispersion of dust, and other aerosols, including chemical and biological agents. These simulations are driven by atmospheric models (WRF and RAMS). VIS-5D has become extremely popular in the meteorological community and it is currently used for scientific purposes ranging from pollution dispersion [21] to transport of debris by tornadoes [22], as well as for educational purposes [23, 24]. It does not have built-in tools for visualizing forests. Surprisingly, few works effectively used virtual reality to study atmospheric surface layer processes and their interaction with land surface heterogeneity and structure, other than topography. Some examples include: Loth et al. [25] developed a technique to study particle dispersion within the atmospheric boundary layer using immersive virtual reality; and Nichol and Wong [26], who used virtual reality combined with remote sensing data to depict temperature variations due to heterogeneous urban features in Hong Kong, which proved a useful tool identifying the effects of wall shading and vegetated pockets within the city. Virtual Reality in Canopy Studies. Studies concerning canopy structure and forest landscapes are common and there are several different tools available, as reviewed by Sen and Day’s [27] and Pretzsch et al. [28]. Most of these tools are focused on the representation of individual trees in a highly realistic way, and even include environment factors such as light competition, temperature and shading obstacles to drive the tree growth and shape [29]. However, as pointed out by Pretzch et al. [28], fewer studies extended to more complex plant ecosystems [30], and even less have the
VR Visualisation as an Interdisciplinary Collaborative Data Exploration Tool
859
interactive character and portability to integrate model output [31]. To generate the realistic look of a tree or a forest stand, these models demand a large set of parameters describing the stand. These parameters for explicit, real forest stands are rarely available over large enough domains (several km2) that are needed for even a small-scale atmospheric boundary layer simulation. While many of these parameters are essential for the realistic visualisation of trees, most of them are not effective in the interaction between the forest and the atmosphere and, thus, are not represented in atmospherebiosphere simulations. Furthermore, no study has shown the interaction between the canopy structure and the surrounding atmosphere.
3 Method Atmosphere-biosphere Large Eddy Simulation. Biosphere-atmosphere interactions couple vegetation function and structure with the atmosphere. This two-way coupling includes non-linear feed-back biophysical mechanisms primarily through the effects on the fluxes of water and heat emitted by vegetation and by modification to the soil moisture. The RAMS-based Forest Large Eddy Simulation (RAFLES) [6] explicitly resolves turbulence in realistic, observation-based, 3-D heterogeneous canopies. RAFLES’ parent model, RAMS [32], is a regional atmospheric model with grid nesting capabilities that can run as an LES. It solves the Navier-Stokes equations, using a quasi-hydrostatic approach and the Boussinesq approximation, on a rectangular, vertically stretched grid mesh, with a staggered grid scheme, and uses a split-time leapfrog discretization scheme. It includes radiation and dispersion parameterizations. The subgrid-scale turbulence kinetic energy (SGS-TKE) parameterization scheme is based on Deardorff [33] modified with additional turbulence backscatter terms [34]. RAFLES includes several developments from its parent model RAMS, specifically introduced to handle the effects of the canopy and to increase the numerical stability at the typically high spatial and temporal resolutions of the simulations. It includes a multi-layer, 3-D heterogeneous canopy. It allows for the effects of leaves on drag and fluxes to the atmospheric surface layer in the canopy air space. Tree stems are represented in the atmospheric model as restrictions to the free-air volume. Its typical gridmesh spacing of the order of 1 m3, allows for the simulation of many of the features that are generated by tree-crown structures. Its simulation domain, typically on the order of 1 km3 is large enough to simulate a fully dynamic boundary layer. The canopy structure in the model can be prescribed, based on remote sensing [35], or, as described here, constructed by the Virtual Canopy-Generator (V-CaGe [36]). V-CaGe generates a random canopy field, λ, using a 2-D Fourier transform with a specified correlation function and a random phase and observed allometric relationships of the simulated forest stand. Ecosystem dynamics and the biological functioning of trees in terms of leaf temperature, reflected and absorbed radiation and water vapour heat and CO2 fluxes from leaf surfaces to the atmosphere are calculated by a dynamically coupled version of the Ecosystem Demography model (ED2RAMS [37]). Visualisation in the DiVE facility. The DiVE is a 3m x 3m x 3m room with rigid ceiling and floor, flexible walls, a sliding door and using a 7 PC workstation Cluster running Windows XP with NVidia Quadro FX 3000G graphics cards as the graphics-rendering engine. Christie Digital Mirage S+2k DLP projectors connect with
860
G. Bohrer et al.
Stererographics active stereo glasses. Head and wand tracking is supported by the Intersense IS-900 system. For virtual worlds designed for this system, it is a fully immersive room in which the individual literally walks into the world, is surrounded by the display and is capable of interacting with virtual objects in the world. We have developed a post-processing package that reads the compressed RAFLES model output fields and converts a specified sub-domain (in space and time) into binary AMIRA data scripts. These scripts specify data objects according to data type. We used the native stacked slice coordinate to fit the model’s coordinates which are horizontally regular and with a specified stretching in the vertical direction. Atmospheric scalars where written as density fields, wind as a vector field (the vertical component of the wind was also duplicated as a density field). Leaf density of the canopy was incorporated as a density field and surface patch. The tree stems are line/cylinder objects. Dispersing particles (simulating seeds) were viewed as line objects. Once the data is loaded into the program, a rich GUI provides many techniques and approaches for visualizing and exploring the data. AMIRA was selected as an interface because its VR interface was already developed in the DiVE system, and because it allows a flexible and rich GUI for visualization. AMIRA comes with many visualisation techniques which can be easily applied to the data. Of particular use for this data exploration, are: orthogonal and oblique slicing; volume and surface rendering; isolines and isosurfaces; vector fields; stream line rakes and illuminated stream lines. Head tracking allows the user to move their head, and then see the proper view for the new position and orientation. This, coupled with the active stereo, allows the user to gain a better perception of the size and structure of 3D objects. A virtual wand is useful for multi user sessions, in that the tracked user can point out regions of interest to the other users. By pressing a button on the wand the user can move and slice planes, or stream line sources, rotate and reposition the domain anywhere in the space. This allows the user to see relevant aspects of the model from other angles and other positions. Finally, a tablet style display is mounted to a swing arm, which can be moved inside the DiVE. By using a stylus, the user is able to interact with the AMIRA GUI directly while in the DiVE. With the tablet, the user can toggle what is displayed, change visualisation settings, add new data files, or load new models. Both the wand and the stylus-tablet interfaces were routinely used throughout data exploration and increased the flexibility of data visualization and allowed the users freedom to immediately view new variables and angles and test ideas about the meaning of the data.
4 Results Visualisation and VR Experience. The major advantage of VR data exploration initially was the ability to view and communicate data of different types, canopy wind and scalar fluxes, at the same space and time. The VR environment allowed walking through the data, tilt and move points of view, and observing the relationships between different variables from many angles. Conventional visualisation of LES results are performed using a 2-D vertical cross-sectional plane, with arrows for wind vectors and colour for scalars. As we were suspecting that one of the major effects of the canopy’s leaf density structure on the wind flow will be a modification to the vertical wind, we have treated the vertical component of the wind as an additional scalar (fig. 1.A).
VR Visualisation as an Interdisciplinary Collaborative Data Exploration Tool
A
B
C
D
E
F
861
Fig. 1. A. “Conventional” visualization. Data presented on a vertical cross section through the centre of the domain. Arrows show wind vectors, colour indicates the vertical component of the wind (w). The canopy top is visualized as a green surface. B. Humidity (blue colormap, side and top walls), w (physics colormap, back wall) and streamlines through the canopy. Stems illustrated as brown lines. C. “Cloud” visualization of momentum ejections and sweeps. Red and blue “clouds” are isosurfaces where w is 0.8 and -0.8 m/s, respectively. The thin “legs” that connect the “cloud” to the canopy top tend to penetrate the canopy where it is relatively low. D. A view from under the canopy. Canopy leaf density is projected using a white (low)-green (high) colormap on a horizontal cross-section through the mean canopy height. Ejection and sweep isosurface “clouds” are semi-transparent point surfaces. In this snapshot, ejections aggregate in areas where the canopy is lower (leaf area = 0, white). E. Viewed from above, humidity shown using a white (moist)-blue (dry) colormap on a horizontal cross-section just below the highest trees in the forest (green “islands”). During ejection events humid air from inside the canopy rise and mix with the dryer air in the atmosphere above the canopy. F. Seed dispersal from an ejection hot-spot. A puff of seeds (yellow spheres) dispersing above the canopy. w is projected on a vertical cross-section (white – negative, red-positive).
862
G. Bohrer et al.
The flexibility of AMIRA in specifying a frame and mesh for each variable meant that despite the staggered-grid setup of the model and the stretched vertical coordinates, variables could be correctly collocated in space without interpolation to a common grid. By using the meteorologically conventional visualisation (vector arrows, physics colormap for vertical velocity) with intuitive colormaps for additional scalars and canopy variables (blue-white for water vapour, green for leaf density and canopy, brown for stems) the data became intuitive for viewers that are accustomed to either the real forest (from the ecology discipline) or meteorologists. An unexpected but perhaps the largest advantage was the fact that VR is exciting and new and therefore it drew many viewers that would not have normally found the time to participate in preliminary stages of raw data analysis. Initial excitement about the possibilities of volumetric visualisations, led to “loaded” visualisations, showing many variables, but giving little benefits in terms of additional insight into the canopy-atmosphere interactions. Of particular disappointment were streamlines and animated streamlines. These gave a very sparse sample of the volume and therefore, without initial knowledge about the preferred location for their initiation, provided very little additional information. Streamlines proved to be non intuitive for non meteorologist viewers, because they do not illustrate the actual path of a virtual object in the flow, but the potential path, assuming a frozen snapshot in time. The meaning of an animation of streamlines was therefore rather confusing for researchers, who do not regularly work with streamlines (fig 1.B). One such “opportunistic” session with a cloud expert and an empirical forest micrometeorologist and a cloud meteorologist lead to a breakthrough in the conceptual understanding of the effects of the canopy. Cloud visualisation often use wind iso-surfaces to visualise updraft and downdraft channels within a cloud system. Forest micrometeorologists often analyse time series taken from point observations in and above the forest canopy. Typical dynamic micrometeorological features of the flow are ejection and sweep events. These are rapid transitions toward a strong updraft and downdraft, correlated with a strong scalar flux and exchange of air between the canopy volume and the atmospheric surface layer above it. Using “cloud” visualisation (two isosurfaces at prescribed negative and positive thresholds for vertical velocity) to illustrate ejection and sweeps allowed locating these flow feature and focused our attention on the fact that they tend to concentrate at locations where the canopy is relatively low and surrounded by tall trees (fig. 1.C). Consequent sessions with remote sensing experts lead to more informative, yet more abstract views. These views emphasized the canopy-influenced bias in the distribution of ejection events (fig. 1.D-E). Scientific Results from VR Experience. Thanks to comments from many collaborating researchers in several disciplines during joint data exploration sessions in the DiVE, a clear picture of the effects of canopy structure began to emerge. Given a good hypothesis, it was now possible to use statistical analysis and test these effects. The 30-minute time-averages (denoted by hat) of perturbations from the horizontal domain averages (denoted by pointed brackets) are calculated and used to evaluate the mean fluxes at each point along a horizontal cross-section through the domain, i.e.
(
)(
)
suuuuuuuuuuuuuuuuuuuuut w′ϕ ′ = w − w ϕ − ϕ where w is the vertical component of the wind,
ϕ is any scalar and a prime marks a perturbation quantity. The vertical fluxes of momentum, heat and water vapour were calculated at each height in the model. Fluxes were split using
VR Visualisation as an Interdisciplinary Collaborative Data Exploration Tool
863
conditional analysis to updraft (w>0) and downdraft (w<0) fluxes, indicating ejection and sweep events (respectively). A correlation coefficient for a second order polynomial correlation between the fluxes and the canopy height was calculated. The significance of these correlations was tested with a randomisation test, by generating 1000 canopies with the same parameters but a random phase. This made sure that spurious correlations due to mode-locking of wave patterns between turbulence and the canopy shape would be identified as not significant. Surprisingly, correlations where significant at relatively high altitudes (up to 5 times the canopy height), indicating a strong and persistent effect of canopy structure. This effect is concentrated in “ejection-sweep” hot spots where low drag due to low leaf density and strong surface gradients of temperature lead to increased ejection and sweep occurrence and affects the triggering of new sweep events aloft. This realisation led to the design of a virtual experiment of the effects of canopy structure on seed dispersal. Seeds dispersing from these ejection hot-spots experience higher probabilities of ejecting above the canopy and dispersing for long distance, and thus affecting the population dynamics of some wind dispersed forest species that are typically growing in or near gaps in the canopy [7] (fig. 1.F).
5 Conclusions The DiVE facility has proven an invaluable resource to bring the complex results from a dynamics, high resolution atmosphere-biosphere large eddy simulations of forest environments into a single framework that was both intuitive and interactive. Displaying the canopy and atmospheric data in conventional and intuitive ways, allowed forest ecologists and meteorologists to start the virtual tour at a “familiar” environment. While analysing the data using VR we learnt that some display schemes may prove non-intuitive and in the immersive experience. For example, streamlines were not helpful to identify patterns and were rather confusing for those not accustomed to their physical meaning. Importantly, the fully immersive animated VR has a strong visual appeal, which attracted viewers whom otherwise would not have participated in data exploration at early stages. These interactions with specialists from related fields across disciplines brought interesting insights that allowed improving the display and eventually let to new understandings of the simulation results. The ability to show multi-variable 3-D time dynamic data in an intuitive immersed environment allowed communication and exchange of ideas about effects that became apparent in the data and the mechanisms that led to these effects. The wand and stylus interfaces allowed changing the visualization and variables and focusing the viewers’ attention at different scales and sub-domains. This led to highly productive work sessions for collaborative data exploration that would have otherwise been very difficult on a non-VR computer screen, even with the same AMIRA visualization, and proved the major attraction of this application in the VR and the reason for repeated visits of the data in the DiVE that progressed to improved understanding of the data. Finally, the intuitiveness, flexibility and possibilities to collaborate in the VR interface facilitated initial data exploration, where the size and complexity of the data made it hard to determine the effective patterns of interactions between canopy and
864
G. Bohrer et al.
atmosphere without a priori knowledge of these effects. The joint cross-disciplinary data exploration in the VR facility was fundamental to establish hypotheses on the interaction between canopy and surrounding atmosphere, which we were able to statistically define and test. These led us to design several new experiments, ultimately showing the bi-directional interaction between the atmosphere and the biosphere in small scale, with important outcomes to ecology and meteorology, such as gap effect on seed dispersal.
Acknowledgements We wish to thank Roni Avissar for supervision and support during the project, Steve Feller for technical support in the DiVE, Gabriel Katul, Martin Otte, Mathieu Therezien, Ran nathan, Joe Wright, Hellen Muller-Landau, Robert Walko, Christoph Thomas, Robert Jackson, Anna Barros, Gregory Asner, Michael Keller, Heidi Holder, Bill Eichinger, Uwe Ohler and Sayan Mukherjee for collaborative data exploration in the DiVE. The DiVE facility is funded by NSF grant # MRI-0420632. GB was supported in part by NSF grant # DEB-0453665 and The John & Elaine French Fellowship of the Harvard University Center for the Environment. ML is supported by CNPq-Brazil, grant # 200686/2005-4.
References 1. Metzger, N., Zare, R.N.: Science policy - Interdisciplinary research: From belief to reality. Science 283(5402), 642–643 (1999) 2. Riva, G.: Applications of virtual environments in medicine. Methods Inf. Med. 42(5), 524– 534 (2003) 3. Desmeulles, G., et al.: The virtual reality applied to biology understanding: The in virtuo experimentation. Expert Syst. Appl. 30(1), 82–92 (2006) 4. Ausburn, L.J., Ausburn, F.B.: Desktop virtual reality: A powerful new technology for teaching and research in industrial teacher education. Journal of Industrial Teacher Education 41(4), 33–58 (2004) 5. Smith, S.S., et al.: Experiences in using virtual reality in design and graphics classrooms. Int. J. Eng. Educ. 23(6), 1192–1198 (2007) 6. Bohrer, G.: Large eddy simulations of forest canopies for determination of biological dispersal by wind. In: Department of Civil and Environmental Engineering, p. 150, Duke University, Durham, NC (2007) 7. Bohrer, G., et al.: Effects of canopy heterogeneity, seed abscission, and inertia on winddriven dispersal kernels of tree seeds. J. Ecol. 96, 569–580 (2008) 8. Dupont, S., Brunet, Y.: Simulation of turbulent flow in an urban forested park damaged by a windstorm. Bound. Layer. Meteor. 120(1), 133–161 (2006) 9. Dupont, S., Brunet, Y.: Edge flow and canopy structure: A large-eddy simulation study. Bound Layer Meteor. 126(1), 51–71 (2008) 10. Yue, W., et al.: Large-eddy simulations of plant canopy flows using plant-scale representation. Bound Layer Meteor. 124(2), 183–203 (2007) 11. Yue, W.S., et al.: A comparative quadrant analysis of turbulence in a plant canopy. Water Resour. Res. 43(5) (2007)
VR Visualisation as an Interdisciplinary Collaborative Data Exploration Tool
865
12. Yue, W., et al.: Turbulent kinetic energy budgets in a model canopy: comparisons between LES and wind-tunnel experiments. Environ. Fluid Mech. 8, 73–95 (2008) 13. Cruz-Neira, C., et al.: The CAVE - audio-visual experience automatic virtual environment. Commun. ACM 35(6), 64–72 (1992) 14. Visage Imaging. AMIRA (2008) (cited July 7, 2008), http://www.amiravis.com 15. Bergeron, T.: Synoptic meteorology - an historical review. Pure Appl. Geophys. 119(3), 443–473 (1981) 16. Hibbard, W.L., Santek, D.A.: The VIS-5D system for easy interactive visualization. In: Proceedings of Visualization 1990. IEEE CS Press, Los Alamitos (1990) 17. Hibbard, W.L., et al.: Interactive visualization of earth and space science computations. Computer 27(7), 65–72 (1994) 18. Mathematics and Computer Science Division Argonne National Laboratory. Cave5D Release 2.0 (2007) (cited June 30, 2008), http://www-unix.mcs.anl.gov/~mickelso/ CAVE2.0.html 19. Johnson, S.G., Edwards, J.: Vis5d+ (2001) (cited June 30, 2008), http://vis5d. sourceforge.net/ 20. Desert Research Institute. CAVCaM (2008) (cited July 9, 2008), http://www.cavcam. dri.edu/ 21. Roswintiarti, O., Raman, S.: Three-dimensional simulations of the mean air transport during the 1997 forest fires in Kalimantan, Indonesia using a mesoscale numerical model. Pure Appl. Geophys. 160(1-2), 429–438 (2003) 22. Magsig, M.A., Snow, J.T.: Long-distance debris transport by tornadic thunderstorms. Part I: The 7 may 1995 supercell thunderstorm. Mon. Weather Rev. 126(6), 1430–1449 (1998) 23. Bramer, D.J., et al.: Linking interactive concept models into the Visual Geophysical Exploration Environment (VGEE). In: 13th Symposium on Education. American Meteorological Society, Seattle (2004) 24. Semeraro, D., et al.: Collaboration, analysis, and visualization of the future. In: 20th International conference on interactive information and processing system (IIPS) for meteorology, oceanography, and hydrology. American Meteorological Society, Seattle (2004) 25. Loth, E., et al.: A virtual reality technique for multi-phase flows. Int. J. Comput. Fluid Dyn. 18(3), 265–275 (2004) 26. Nichol, J., Wong, M.S.: Modeling urban environmental quality in a tropical city. Landsc. Urban Plan. 73(1), 49–58 (2005) 27. Sen, S.I., Day, A.M.: Modelling trees and their interaction with the environment: A survey. Comput. Graph.-UK 29(5), 805–817 (2005) 28. Pretzsch, H., et al.: Models for forest ecosystem management: A European perspective. Ann. Bot. 101(8), 1065–1087 (2008) 29. Van Haevre, W., Bekaert, P.: A simple but effective algorithm to model the competition of virtual plants for light and space. J. WSGC 11(3), 464–471 (2003) 30. Deussen, O., et al.: Interactive visualization of complex plant ecosystems. In: Proceedings of Visualization. IEEE, Los Alamitos (2002) 31. Seifert, S.: Visualisierung von waldlandschaften. Allgemeine Forstzeitschrift/Der Wald 61, 1170–1171 (2006) 32. Pielke, R.A., et al.: A Comprehensive meteorological modeling system - RAMS. Meteorol. Atmos. Phys. 49(1-4), 69–91 (1992) 33. Deardorff, J.W.: Stratocumulus-capped mixed layers derived from a 3-dimensional model. Bound Layer Meteor. 18(4), 495–527 (1980) 34. Bhushan, S., Warsi, Z.U.A.: Large eddy simulation of turbulent channel flow using an algebraic model. Int. J. Numer. Methods Fluids 49(5), 489–519 (2005)
866
G. Bohrer et al.
35. Palace, M., et al.: Amazon forest structure from IKONOS satellite data and the automated characterization of forest canopy properties. Biotropica 40(2), 141–150 (2008) 36. Bohrer, G., et al.: A Virtual Canopy Generator (V-CaGe) for modeling complex heterogeneous forest canopies at high resolution Tellus 59(3), 566–576 (2007) 37. Medvigy, D., et al.: Mass conservation and atmospheric dynamics in the regional atmospheric modeling system (RAMS). Environ. Fluid Mech. 5(1-2), 109–134 (2005)
User Experience of Hurricane Visualization in an Immersive 3D Environment J. Sanyal, P. Amburn, S. Zhang, J. Dyer, P.J. Fitzpatrick, and R.J. Moorhead GeoResources Institute, Mississippi State University {jibo,amburn,szhang}@gri.msstate.edu,
[email protected], {fitz,rjm}@gri.msstate.edu
Abstract. Numerical models such as the Mesoscale Model 5 (MM5) or the Weather Research and Forecasting Model (WRF) are used by meteorologists in the prediction and the study of hurricanes. The outputs from such models vary greatly depending on the model, the initialization conditions, the simulation resolution and the computational resources available. The overwhelming amount of data that is generated can become very difficult to comprehend using traditional 2D visualization techniques. We studied the presentation of such data as well as methods to compare multiple model run outputs using 3D visualization techniques in an immersive virtual environment. We also relate the experiences and opinions of two meteorologists using our system. The datasets used in our study are outputs from two separate MM5 simulation runs of Hurricane Lili (2002) and a WRF simulation run of Hurricane Isabel (2003).
1 Introduction Over the last decade, simulation studies of weather phenomena have become more common due to the availability of large and powerful computers. These simulations rely on models that solve complicated numerical equations to mimic the atmospheric dynamics. The accuracy of these computationally intensive models varies depending on the choice of the geographic extent and the resolution of the simulation grid. The initial conditions also play a crucial role in the accuracy of the results. In spite of these constraints, the amount of output data generated is extremely large and the postprocessing and analysis tasks take up a substantial amount of time. Most meteorologists resort to statistical and visual analysis tools to gain insight into the data. The visualization tools that are commonly used typically generate 2D slices through the 4D (latitude, longitude, altitude, time) simulation grid [10, 12]. An ensemble of such 2D slices, generated over a set of time-steps, can be very difficult to comprehend and requires a lot of time, expertise, and experience to interpret. This holds true for observed data from satellite products as well. We believe that 3D immersive visualization can substantially help in the perception, evaluation and understanding of such datasets. The physical size of such environments, the interaction mechanisms, and the use of time-dependent immersive stereo may equip a researcher with much better tools to investigate these datasets. The large-scale destruction brought about by hurricanes in recent years has sparked an increasing interest in understanding them and predicting their behavior. In this G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 867–878, 2008. © Springer-Verlag Berlin Heidelberg 2008
868
J. Sanyal et al.
paper, we present our study of Hurricane Lili (2002) [1] and Hurricane Isabel (2003) [2]. Hurricane Lili was unique as it dramatically weakened from a category 4 hurricane to a category 1 in a period of just 13 hours. This weakening has been attributed to a shaft of dry air moving into the hurricane core from the south-west [3]. Hurricane Isabel was chosen as it was one of the most closely observed hurricanes and a number of simulation and visualization experiments have been performed on it. Authors Sanyal, Amburn, Zhang and Moorhead developed 3D immersive visualizations which depict the shaft of dry air entering Hurricane Lili and weakening it. We also created visualizations of other variables of interest and explored techniques that helped hurricane experts compare two model run outputs. Visualizations of the structure and progression of Hurricane Isabel were also created. Authors Fitzpatrick and Dyer, who are meteorologists, analyzed the visualizations and evaluated the hypothesis about the effectiveness of virtual environments in the study of hurricanes. Dr Fitzpatrick, a hurricane expert, participated in two 1 hour sessions. Dr Dyer, a precipitation meteorologist, participated in a 1½ hour session. Both of them found the immersive virtual environment more effective in the study of hurricanes.
2 Background 2.1 Synoptic History of Hurricane Lili Lili originated on 16th September 2002, and crossed mainland Cuba on October 1st with wind speeds of over 90 knots, gradually turning northward [1]. On the 3rd, it exhibited sustained wind speeds of 125 knots (category 4) over the Gulf of Mexico before dramatically weakening to having maximum wind speeds of 80 knots (category 1) before making landfall in Louisiana. The dramatic weakening in the 13 hour period on the 3rd of October has been of considerable research interest. The best track for Hurricane Lili is illustrated in Figure 1.
Fig. 1. Best track positions for Hurricane Lili (left) and Hurricane Isabel (right), as provided by the National Hurricane Center [1, 2]
User Experience of Hurricane Visualization in an Immersive 3D Environment
869
2.2 Synoptic History of Hurricane Isabel Isabel formed on September 1, 2003 [2] and became a category 5 hurricane on the 11th with sustained wind speeds of over 145 knots. These winds stayed in the 130-140 knots range until September 15th. During this time the eye exhibited a diameter of 3545 nautical miles. Vertical wind shear caused Isabel to gradually weaken and it made landfall near Drum Inlet, North Carolina, USA on the 18th as a category 2 hurricane. Hurricane Isabel was the most intense hurricane during the 2003 Atlantic hurricane season. The best track of Hurricane Isabel is illustrated in Figure 1. 2.3 Visualization of Hurricanes A number of visualization techniques have been used in the study of hurricanes. The Software Integration and Visualization Office (SIVO) at NASA’s Goddard Space Flight Center (GSFC) specializes in creating animations and educational videos of hurricanes [4]. SIVO has created high resolution visualizations that show the formation and evolution of ‘hot-towers’ within hurricanes [5]. The IEEE 2004 Visualization Conference Contest [6] targeted to foster innovative visualization and interaction techniques to study multiple characteristics of data. User interaction with the dataset was a key component in the contest. A Weather Research and Forecasting Model (WRF) [12] simulation of Hurricane Isabel was used. Doleisch, Muigg and Hauser [6] used interactive brushing of scatter plots and histograms to select regions of interest which were then used to selectively visualize specific features in the dataset. Johnson and Burns [6] used techniques to visualize multiple variables together. They used a particle system to represent the cloud structure while modifying the saturation and transparency to reflect properties of the data. Jiang et. al. [6] successfully integrated existing techniques to create informative visualizations of the Isabel dataset. They experimented with several vortex detection techniques and found them to be inadequate in identifying the eye of the hurricane. Schulze and Fosberg [6] have illustrated how multiple visualizations, potentially from multiple model runs, can be placed side-by-side for comparative analysis. More recently, Elboth and Helgeland [7] have shown how anisotropic diffusion and volume rendering can be used in the study of hurricanes. Hurricanes often cause severe storm surges and flooding in coastal areas. Zhang et al. [8] developed a 3D rendering and animation system to visualize such effects of hurricanes. Comparing and contrasting between simulation runs and actual observations is an important and challenging problem. Benger et. al. [9] created visualizations that highlight differences between computer simulations and observed data from satellites using streamlines, color and transparency. Our focus is on helping experts in the interpretation of their simulation results. This necessitated experimenting with various visualization techniques to display the differences between multiple model runs and to present the outputs in the same display space. We also focused on visualizing variables of interest that might help researchers gain a better insight.
870
J. Sanyal et al.
3 Simulation Models and Data Sources We used data from simulation runs of both Lili and Isabel. The following is a brief description of the models used to generate the data and the datasets themselves. 3.1 Data for Hurricane Lili We used data from the Four-Dimensional Variational Analysis (4DVAR) sensitivity run outputs of Zhang, Xiao and Fitzpatrick [3]. They used the fifth generation PennState University – National Center for Atmospheric Research Mesoscale Model (MM5 model) [10] to simulate the weakening of Hurricane Lili. Two of their simulation runs used the Bogus Data Assimilation (BDA) [11] technique to specify a vortex in the simulation model. While the first simulation ran without modifications and was called the BDAC run, the second simulation used satellite data1 for model initialization, and was called the SATC run. We used data from both simulation runs. The numerical model used by Zhang, Xiao and Fitzpatrick [3] used a triple-nested grid with Domain A, Domain B and Domain C respectively. We used outputs from Domain B which had a resolution of 27 km, dimensions of 85x85x33, and consisted of 9 time-steps in the 13 hour window starting at 2002:10:03 00:00 UTC. Domain A was a coarser grid with a resolution of 81 km and had dimensions of 48x50x33 while Domain C was a much finer, moving grid with a resolution of 9km and had dimensions of 142x142x33. The variables of interest were water-vapor mixing ratio (Q), temperature (T), pressure perturbation (pp), cloud-water mixing ratio (clw), rainwater mixing ratio (rnw) and the wind velocity components (u, v and w). 3.2 Data for Hurricane Isabel For Isabel, we used the dataset provided for the 2004 IEEE Visualization Conference Contest. The dataset was a grid of 500x500x100 across 48 time-steps. The variables of interest were moisture mixing ratios of cloud (qcloud), graupel (qgraup) which consisting of solid ice precipitation in the form of hail, sleet and snow pellets, ice (qice), snow (qsnow), and water vapor (qvapor). There were two more variables, one accounting for the total cloud moisture mixing ratio (cloud =qcloud + qice) and another for the total precipitation (precip = qgraup + qrain + qsnow). The pressure (p), temperature (tc) and wind velocity components (u, v and w) were also of considerable interest.
4 The Visualization Environment Our room-size visualization environment at Mississippi State University is called the Virtual Environment for Real-Time EXploration (VERTEX). It is a 3D immersive environment composed of three back-projected walls and a front-projected floor and is located at the High Performance Computing (Fig 2). The dimensions of the enclosed space is 10'x10'x7.5'. The right wall can be opened to align with the front wall. Each wall and the floor are driven at a resolution of 1400x1250 pixels. 1
Data from NASA’s Terra, Aqua, QuickSCAT and TRMM was used. NOAA’s AVHRR sensor aboard GOES was also used.
User Experience of Hurricane Visualization in an Immersive 3D Environment
871
Fig. 2. Hurricane Lili in the VERTEX. The floor, left wall and the center wall are shown in the photograph2.
The VERTEX is driven by four IBM Intellistation Z Pro workstations, each with a 3.06 GHz Intel Xeon processor, 2 GB of RAM and an NVIDIA Quadro FX3000G graphics card with 256 MB of video memory. Sixteen nodes serve as backend compute nodes. Each of these nodes is an IBM Intellistation M Pro workstation with a 3.2 GHz Intel Pentium IV processor, 2 GB of RAM and an NVIDIA GeForce FX5950 graphics card with 256 MB of video memory. The system runs SUSE Linux. Apart from these nodes, a separate dedicated workstation runs Windows XP and drives Google Earth and ESRI ArcGIS applications in the VERTEX. A separate workstation is used as the sound server. All inter-node communication is achieved by a gigabit Ethernet network using dedicated links and crossover cables. A user’s head and the wand is tracked with 6DOF using an acoustic tracking system. The wand provides five buttons and an analog joystick for user input.
5 Methods We used data from two MM5 simulation runs for Lili (BDAC and SATC runs), and one WRF simulation run for Isabel. The Isabel dataset was originally a grid of 500x500x100 across 48 time-steps. Running our visualizations at these resolutions slowed down the system dramatically resulting in almost complete loss of interactivity, so we were compelled to down-sample the data to a grid of 250x250x100 using a box filter. We did not sub-sample in time but rather used only the first 20 time-steps to create our visualizations. The Lili dataset, however, could be used without a downsampling step. We used a tool called vGeo [14] from Mechdyne Corporation in creating our visualizations. In the text that follows, “we” and “us” will usually refer to Sanyal, Amburn, Zhang and Moorhead. Dr. Fitzpatrick told us that the dry air mass influencing Hurricane Lili was lowlying and expressed his interest in visualizing the clouds and winds between 0-2km above the earth’s surface, and between 0-700mb of pressure, since this could reveal 2
Color version of the document can be found at: http://www.ngi.msstate.edu/projects/ Visualization/publications.php
872
J. Sanyal et al.
the dry air mass. We focused on the low altitude levels of the variables from the dataset to identify the air mass. We visualized the scalar variables by constructing isosurfaces and used transparency to combine multiple model outputs by modifying the alpha value for one set of isosurfaces to distinguish between the SATC and BDAC model runs (e.g. Fig 3, 4). We used individual colors as well as color ramps as a function of a variable to color the isosurfaces. Most of the scalar variables from both the datasets were visualized using these methods (e.g. Fig 3, 4). It is essential to have an understanding of the distribution of data values for constructing isosurfaces. We constructed histograms of the distribution of values for Hurricane Lili to get a better understanding of the variables (e.g. Fig 6). We also created histograms for the different variables of Isabel, which helped us to choose better isosurface values. Wind vectors were calculated from the orthogonal wind components u, v and w and plotted using arrow glyphs and a rainbow color map. The length, size and color of the arrow glyphs served as visual cues and were proportional to the magnitude of the horizontal wind speed (Fig 3a). The resulting wind fields matched up well with other variables such as the pressure perturbation and cloud-water mixing ratio (Fig 4d). We were interested in observing particle advection in the wind vector field within the hurricane system. We seeded 714 particles uniformly in the wind field and observed their advection tracks (Fig 4d, 5a). This was done for the Lili dataset only since the creation of particle tracks took over two hours which made us feel that these calculations would take substantially more time with the denser Isabel dataset. Separate colormaps were used to visualize the resulting pathlines from the two simulation runs. A mild gradient within the colormaps was used as a function of seeding altitude to distinguish between pathlines originating at various heights within the grid (Fig 5b). Geographic context was added to both datasets by using a Digital Elevation Model (DEM). While the DEM for Lili was textured with a topographical map of the region, the DEM for Isabel was colored with a topographical colormap (Fig 3a, b). Enabling geographic context was essential due to the interactive nature of the virtual environment as we easily lost our bearings without it. Based on feedback from our hurricane expert, we took the difference between scalar values of the BDAC and the SATC simulation runs of Lili and mapped them to a red-to-transparent-to-green colormap (Fig 3d, 4c). The choice of this colormap was based on the opponent process color theory [13]. Red and green are on opposite ends of the visual channel ensuring maximum visual separation between the model runs. Orthogonal slices through the resulting grid helped us create a ‘difference-volume’. Regions having the most dissimilar values would show up as the strongest red or green while regions having similar values would stay transparent. Having created the necessary framework, we invited the domain experts to evaluate our visualizations and test our hypothesis about the effectiveness of virtual environments.
6 Visualization Results and Expert Evaluation Our initial goal was to visualize the shaft of dry air entering and weakening Hurricane Lili. Using the water-vapor mixing ratio (q), we were able to visualize the mass of dry air (Fig 3a). Adding geographic context by using a shapefile of the United States, a
User Experience of Hurricane Visualization in an Immersive 3D Environment
873
Digital Elevation Model (DEM) and a topographic overlay of terrain of the region revealed that the mass of dry air probably originated in the Sierra-Madre Oriental Mountains of Mexico (Fig 3a).
a)
b)
c)
d)
Fig. 3. a) Shaft of dry air originating in the Sierra Madre Mountains of Mexico visualized with wind vectors. b) Isosurfaces of clouds in Isabel. c) Isosurfaces of q for SATC run combined with transparent isosurfaces of BDAC run. d) Difference volume for q from multiple planes through the grid. Red region to the bottom-left indicates the major model differences (dry air).
We showed Dr. Fitzpatrick the Lili dataset in an intial one hour session. It took him a while to adjust to the 3D environment since he was used to seeing 2D visualizations of data. He found the immersive stereo useful in comparing the BDAC and the SATC outputs (Fig 3c) and confirmed that the mass of dry air indeed originated in the SierraMadre Oriental Mountains of Mexico, and that the BDAC run completely missed this feature (Fig 3d). We identified the core of the hurricane by looking at the pressure distribution (Fig 4a). The hurricane’s core is characterized by an intense low-pressure region. Dr. Fitzpatrick called it a “good” method of identifying the hurricane’s eye. We combined the SATC and the BDAC outputs by keeping the BDAC isosurfaces slightly transparent so that they could be easily distinguished from each other. This was an effective method to compare model runs since both sets of isosurfaces used the same colormap. Using this technique we found a difference in the predicted position of the eye (Fig 4a). A time series animation revealed differences in their trajectories as well. We had constructed the wind field vectors colored with a rainbow map (Fig 3a). He found the visualization rather dark and somewhat difficult to perceive. On the other hand, he found the visualization of multiple 2D color-slices presented with 3D isosurfaces and the corresponding time-series animation quite useful.
874
J. Sanyal et al.
Following his feedback, we did two things: constructed histograms of the variables of interest, and generated a ‘difference-volume’ of the variables to help us compare differences between the two model runs. Unlike for other variables, the histogram of the temperature (t) variable had a rough saw-tooth shape which seemed very interesting to us (Fig 6). We conducted another one hour evaluation session with him in which he explained that the shape of the histogram of the temperature field was due to the spiral bands in the hurricane. He found the difference-volume to be quite effective as it easily revealed the shaft of dry air to the south-west from the difference values of the watervapor mixing ratio (q) (Fig 3d). Similarly, differences of the pressure perturbation (pp) revealed differences in the position of the hurricane’s eye (Fig 4c). Following the two sessions with Dr. Fitzpatrick, we applied the visualization techniques we used for Hurricane Lili to Hurricane Isabel. Since we had only one simulation run, we could apply only a subset of the techniques used with Lili. Evaluation of these results became immediately necessary and we conducted a 1½ hour session with Dr. Dyer. We used a slightly different strategy with Dr. Dyer. We started out by showing him the visualizations of Hurricane Lili on a desktop computer. We showed him the visualization of the mass of dry air which he found pretty effective (Fig 3a). He explained that adiabatic modifications in the lower atmosphere are of research interest. The shaft of air was quite dry at the surface so as it got sucked into the low-pressure core of the
a)
c)
b)
d)
Fig. 4. a) Isosurfaces of pressure perturbation pp for SATC run combined with transparent isosurfaces from BDAC run. b) Isosurfaces of pressure p in Isabel. c) Difference volume for pp. d) SATC run isosurfaces of cloud-water mixing ratio clw shown in yellow indicating rainclouds, fused with isosurfaces of rain-water mixing ratio rnw shown in shades of purple indicating intense rain. Particle advection tracks also shown.
User Experience of Hurricane Visualization in an Immersive 3D Environment
875
storm, it lifted, expanded and cooled, causing the relative humidity to come down. As it moved into the hurricane, saturated ‘parcels’ of air surrounding it tried to equalize the imbalance thereby losing energy and weakening the storm. What fascinated him was that Hurricane Lili as a whole did not weaken but its core was taken out and that caused its rapid dissipation. He found the time-step animation showing the weakening of the hurricane very useful. In particular, he was interested in looking at what level the dry air gets pulled into the system and how high it goes before it changes to the next iso-level. This is a challenging problem. Hurricanes, by themselves are synoptic scale phenomena covering 100 to 1000 kilometers. This requires a synoptic scale analysis which gives an overall picture of the storm, however, it misses the details of the parts of the hurricane which are individual thunderstorms. To study such details, a mesoscale analysis covering about 100 kilometers is necessary, but such an analysis misses the overall picture. Dr. Dyer also indicated that using the pressure perturbation (pp) to identify the eye of a hurricane was a good method (Fig 4a). We showed him comparisons of isosurfaces from the SATC model run with semi-transparent isosurfaces of the BDAC run (Fig 3c, 4a). He found comparisons of the temperature (t) fields extremely interesting. The SATC run exhibits what we thought were ‘hot-towers’ in Lili (Fig 5a). Hot-towers are created by vortices near the eye-wall of a hurricane that spin upwards creating an updraft of warm currents that sometimes extend well into the stratosphere. According to Dr Dyer, model simulations should exhibit relatively constant temperature. He speculated that the unusual temperature distribution in the SATC run may not be a hottower but rather a simulation artifact. To avoid simulation artifacts, researchers often use what is called a ‘warm-start’ (as opposed to a ‘cold-start’) where the simulation is initialized a few hours prior to the desired time window such that the model stabilizes. We switched to the VERTEX after having spent considerable time at the desktop. This was done so that Dr. Dyer could evaluate whether the virtual environment was better. We looked at the same visualizations that we demonstrated on the desktop and he felt that the stereo immersion indeed helped. He found it to be “better than isomaps” and stated that he would like to have such a tool for his analysis tasks. He found the interactivity extremely useful especially with tasks like “put me in the
a)
b)
Fig. 5. a) Particle advection tracks spiraling around what was initially thought to be a ‘hottower’ in Lili. b) Particle advection tracks in the wind field. The SATC advection tracks are mapped to a gradient of green colormap such that higher advection tracks have darker shades. BDAC advection tracks mapped to a gradient of purple colormap such that higher advection tracks have darker shades of blue.
876
J. Sanyal et al.
Fig. 6. Histogram of the distribution of temperature values in the SATC run of Lili
NNW, looking over the Gulf to the SW”. He felt that motion-tracking helped as one could walk towards or even squat and ‘look-into’ a structure. The wind vector fields appeared very interesting to him (Fig 3a, 5a). Typically one would think that the winds are stronger in the front-right quadrant of the direction of the track of the storm. However, according to him, recent evidence suggests that the winds are strongest in the north-east quadrant irrespective of the direction of propagation. A hurricane does not really move but it constantly redevelops. Its individual thunderstorms fuel the propagation which moves the low-pressure core forward. The visualization of the wind vectors seemed to agree with this logic. In fact, visualizations of the cloud-water mixing ratio (clw) and the rain-water mixing ratio (rnw) revealed that most of the intense rain occurred in front of the hurricane (Fig 4d). So for most hurricanes, at the time of landfall, the bulk of the heavy rain has already fallen. He also found the particle advection tracks “extremely useful” in the manner the pathlines depicted the particles spinning around the eye and spiraling upwards (Fig 5a). It even revealed to him a general tilt in the upward propagation which made thermodynamic sense. He was particularly interested in observing particle interactions originating from different regions in the storm (Fig 4d, 5). However, he was skeptical of the model’s capacity to resolve upper layer atmospheric dynamics. The ‘difference-volume’ appeared cloudy to him though he thought it might help in understanding model differences (Fig 3d, 4c). On the whole he felt that 3D immersive visualization can be much better than conventional 2D techniques and can be an effective tool to aid meteorologists in the study of hurricanes. In fact, presenting 2D slices along with 3D isosurfaces can educate users of the effectiveness of 3D.
7 Conclusion Simulating hurricane behavior using the various models is a challenging problem. A hurricane front is a system of developing and weakening thunderstorms. It is a lowpressure, hot-core system and exhibits dynamics and thermodynamics that these models do not capture completely. Through our visualizations we were able to identify the shaft of dry air originating in the Sierra-Madre Mountains of Mexico. This was in agreement with the findings of
User Experience of Hurricane Visualization in an Immersive 3D Environment
877
Zhang, Xiao, and Fitzpartick [3]. Being able to compare multiple model outputs is critical for real analysis and our comparative visualization techniques were found to be effective in highlighting the differences. Evaluation of our visualization results seemed to suggest that using a 4D representation for tasks such as hurricane visualization is better and more powerful than using 2D representations. Two meteorologists provided first hand experience of the advantages of using immersive environments for such studies. Their evaluation also proved very useful in validating our visualizations. We feel the necessity of a more extensive user study to establish our initial findings. It appears that visualizations such as the ones we created and the use of immersive environments like the VERTEX can help hurricane modelers and scientists to compare, validate, and refine their simulation models, as well as gain a better insight into their data.
8 Future Work Our visualizations of Hurricane Lili used only the Domain B outputs. It will be interesting to visualize the outputs of Domains A and C and be able to compare them. This may reveal interesting facts about the hurricane models. We were unable to conduct extensive evaluations of the Isabel dataset and plan to do so in the near future. We also intend to explore ways to isolate a ‘parcel’ and track its behavior over time. Looking at techniques to display particle interaction could also be rewarding. Formal user studies to validate how useful virtual environments are to study hurricanes are needed. We intend to work more closely with hurricane experts and improve our understanding of hurricanes to produce better visualizations.
Acknowledgement This work was supported under NOAA award NA06OAR4320264 06111039 to the Northern Gulf Institute. Roger Smith’s help was invaluable as a system administrator for the cluster and the VERTEX hardware. Additionally, he provided excellent photographic support.
References 1. Tropical Cyclone Report, Hurricane Lili, National Weather Service - National Hurricane Center (July 2008), http://www.nhc.noaa.gov/2002lili.shtml 2. Tropical Cyclone Report, Hurricane Isabel, National Weather Service - National Hurricane Center (July 2008), http://www.nhc.noaa.gov/2003isabel.shtml 3. Zhang, X., Xiao, Q., Fitzpatrick, P.J.: The Impact of Multisatellite Data on the Initialization and Simulation of Hurricane Lili’s. Rapid Weakening Phase, Monthly Weather Review 135, 526–548 (2007) 4. Software Integration and Visualization Office, NASA / Goddard Space Flight Center (July 2008), http://sivo.gsfc.nasa.gov/
878
J. Sanyal et al.
5. Towers in Tempest, NASA / Goddard Space Flight Center Scientific Visualization Studio (July 2008), http://svs.gsfc.nasa.gov/vis/a000000/a003400/a003413/ index.html 6. IEEE Visualization 2004 Contest (July 2008), http://vis.computer.org/ vis2004contest/ 7. Elboth, T., Helgeland, A.: Hurricane Visualization Using Anisotropic Diffusion and Volume Rendering. In: 5th Annual Gathering on High Performance Computing in Norway, May 30-31 (2005) 8. Zhang, K., Chen, S., Singh, P., Saleem, K., Zhao, N.: A 3D Visualization System for Hurricane Storm-Surge Flooding. IEEE Computer Graphics and Applications 26(1), 18–25 (January/February 2006) 9. Benger, W., Venkataraman, S., Long, A., Allen, G., Beck, S.D., Brodowicz, M., MacLaren, J., Seidel, E.: Visualizing Katrina - Merging Computer Simulations with Observations. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, pp. 340–350. Springer, Heidelberg (2007) 10. The MM5 Community Model (July 2008), http://www.mmm.ucar.edu/mm5/ 11. Zou, X., Xiao, Q.: Studies on the Initialization and Simulation of a Mature Hurricane Using a Variational Bogus Data Assimilation Scheme. J. Atmos. Sci. 57, 836–860 (2000) 12. The Weather Research and Forecasting Model (July 2008), http://www.wrfmodel.org/wrfadmin/publications.php 13. Hurvich, L.M., Jameson, D.: An opponent-process theory of color vision. Psychological Review, 384–404 (November 1957) 14. vGeo 3.0, VRCO, Mechdyne Corporation (July 2008), http://www.mechdyne.com
Immersive 3d Visualizations for Software-Design Prototyping and Inspection Anthony Savidis1,2, Panagiotis Papadakos1, and George Zargianakis2 1
Institute of Computer Science, Foundation for Research and Technology – Hellas 2 Department of Computer Science, University of Crete {as,papadako,zargian}@ics.forth.gr
Abstract. In software design, physical CRC cards (Classes – Responsibilities Collaborators) is a well-known method for rapid software-design prototyping, heavily relying on visualization and metaphors. The method is commonly applied with heuristics for encoding design semantics or denoting architectural relationships, such as card coloring, size variations and spatial grouping. Existing software-design tools are very weak in terms of interactivity, immersion and visualization, focusing primarily on detailed specification and documentation. We present a tool for visual prototyping of software designs based on CRC cards offering: 3d visualizations with zooming and panning, rotational inspection and 3d manipulators, with optional immersive navigation through stereoscopic views. The tool is accompanied with key encoding strategies to represent design semantics, exploiting spatial memory and visual pattern matching, emphasizing highly interactive software visualizations.
1 Introduction In software development, visualizations support various related activities, like supervising design structures, extrapolating implementation details, or observing system’s runtime behavior. Our work mainly concerns computer-assisted software-design with highly interactive 3d visualizations, enabling immersive navigation, while supporting encoding of design semantics through custom visual patterns and metaphors. Currently, there are numerous tools capable to automatically visualize design-related aspects from the source code structure, or from information ranging from modeling / design data (e.g. UML) or other forms of program meta-information. For instance, 3d Java code visualization in [Fronk and Bruckhoff 2006] reflects the hierarchical implementation structure while interactive 3d views allow visually query quantitative aspects of the source code. Additionally, semantically-related groups of components may be identified (a short of visual query) from program metainformation as highlighted areas of interest [Byelas and Telea 2006]. Sometimes visualizations are targeted in displaying aspects enabling programmers detect ‘code bad smells’ as in [Parnin and Goerg 2006]. Apart from static properties, behavior visualizations enable review dynamic characteristics, as in [Greevy et al. 2006] where traces of component instantiations and method invocations (messages) are rendered. Motivation. CRC cards [Beck and Cunningham 1989] have been extensively deployed as a visual software-design prototyping instrument apart of teaching and G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 879–890, 2008. © Springer-Verlag Berlin Heidelberg 2008
880
A. Savidis, P. Papadakos, and G. Zargianakis
process description. CRC cards emphasize the exploratory and visual nature of software-design prototyping, allowing heuristic visual encodings and symbolisms, like color or size variations (for classes and links) and post-it annotations carrying metainformation (e.g. brief documentation, implementation notes, etc.). Such heuristics, though not part of the original method as such, are crucial as they allow embody important semantic information not otherwise expressible. While visualization tools exist to enable developers better analyze the structure and behavior of existing systems, very little is done in supporting computer-assisted design as a visualizationcentric activity offering advanced 3d features for design exploration and semantics encoding. Contribution. We present a tool for exploratory software-design prototyping, aimed to be utilized together with general-purpose more comprehensive design methods like UML and other sorts of program visualizers, primarily supporting: (a) rapid visual software-design prototyping, with emphasis on effective interactive supervision and inspection; and (b) a corpus of tested visual encoding policies for software-design semantics to be applied during interactive visual design.
2 Related Work Quick CRC [Quikc CRC 2001] is a window-based 2d tool resembling in style and process the construction of interactive UML class diagrams. It emphasizes textual specification and documentation, rather than rapid conduct, visual design and exploratory process. EasyCRC [Raman and Tyszberiwicz 2007] is a 2d tool offering very limited interactive facilities, with an extension regarding CRC cards to model scenarios using UML sequence diagrams. CRC Design Assistant [Roach and Vasquez 2004] is yet another graphical 2d tool, intended to support students in designing real-life applications. It allows editing class name, description, super-class, subclass, and responsibilities information. All such CRC-card tools provide functionality mimicking the practicing of physical CRC cards in a 2d space, however, with no extra interactive flexibility. Practically, they are simpler forms of general-purpose more comprehensive design methods that one may adopt primarily for small-scale projects, if for some reasons UML is not adopted.
3 Visual 3d Design Encoding and Exploration Typical interactive configuration features concern the control of visualization parameters like camera, planes, axes, and special feedback effects (see Fig. 1). The dialogue box for editing card attributes (bottom-right of Fig. 1) is a transparent non-modal billboard surface, allowing design actions like navigation and focus change to be freely applied, applying attribute updates to the current focus card (class). Card visual attributes, like rotation angle, dimensions, color and positioning, are to be exploited by designers as heuristic symbolic vocabularies for encoding design semantics.
Immersive 3d Visualizations for Software-Design Prototyping and Inspection
881
Fig. 1. From left-to-right: workspace / project menus (arrows denote activation), camera control, camera view options, in card manipulators, and properties dialogue boxes Stereoscopic view
Front outline view
Top outline view
Editable Links
Fig. 2. Left: stereoscopic anaglyphic view; Right top: spatial rotation / repositioning / resizing via 3d manipulators; Bottom: outline views and links
The tools for 3d design are shown under Fig. 2: 3d rotation, repositioning and resizing are possible via special-purpose 3d manipulators appearing automatically when the respective in-card controls (see Fig. 1 middle) are clicked. Next we elaborate on the way interactive configuration of visual parameters for 3d cards allows designers introduce informal visual vocabularies to express semantic properties of the software design. We recall that the emphasis is shifted from an agreed unified vocabulary of standardized semantics for detailed or exhaustive design as in UML, to open add-hoc policies for visual representation of semantic information with emphasis on visual, rapid, exploratory prototyping. We present a set of encoding strategies that we have applied and assessed in the course of real practice. The set is not closed; we aim to demonstrate the expressive power of visual tools in handling software design artifacts. 3.1 Encoding through Varying Dimensions With variations of dimension we may encode quantitative design attributes, such as individual source size, overall project size (in terms of files), or properties implying key programming benefits, like reusability and genericity. An important remark is that a design structure may well be annotated with information that is known or predicted prior to implementation, or being consolidated after the implementation phase is initiated and probably entirely completed. Proposed encoding policies we have adopted are the following:
882
A. Savidis, P. Papadakos, and G. Zargianakis
Illustrate implementation size (proportional to width). Indicate implementation complexity (by height). Denote reusability potential (by thickness) - polymorphic algorithms, templates, generic classes. Signify comparative importance (larger size) – critical components other may have larger dimensions. Emphasize common dependencies (larger size) - when many components depend on the same single component, although evident from incoming links, we may also draw the target component with increased size.
Fig. 3. Encoding components with varying dimensions in the software design of a typical server architecture
An example is provided Fig. 3 for the design of a server, showing the protocol parser (left part, height denoting implementation complexity), distinct requested services (middle part, width denoting implementation size) and the service dispatcher (right part, thickness underlining potential reusability). 3.2 Encoding through Colors Color encoding is known to be amongst the most widely deployed methods to imply semantic information, capable of denoting grouping and classification. Coloring was also proposed as an extension to the visual vocabulary of UML. Suggested color encodings are:
Denote architectural grouping for classes. Indicate mission criticality with high-intensity colors. Signify common categories, like storage, UI, etc., with distinct reserved colors. Highlight inheritance properties, like abstract classes, interfaces, generic derived classes (i.e., mix-in inheritance).
Practically, color encoding alone is not sufficient for effective recognition of distinct artifacts and group relationships in severely crowded software-design spaces, unless appropriately backed-up with high-quality inspection / exploration facilities. For example, assume a designer working locally in a particular design context needing to temporarily suspend current activities to review quickly another part of the design
Immersive 3d Visualizations for Software-Design Prototyping and Inspection
883
space using color encoding knowledge, and then return back to the previous context resuming design work. In general, such a sequence of suspend, switch context, and resume design steps is common in design processes, reflecting the fact that designers frequently assess or recall interrelationships with other parts of a software system. In our tool, such switching is both fast and usable due to the immersive animated navigation facilities. In particular, the designer may quickly zoom-out, spot the target context via color encoding, zoom-in to focus and inspect it temporarily, and then apply zoom-out and zoom-in to return at the previous context. Alternatively, the designer may bookmark the current position so as to return automatically after such design reviewing activities. Multiple spatial bookmarks are supported, with a typical circular iteration style as in most document editors; we plan to include named bookmarks that include a displayed editable text-field. A similar sequence as above would be also performed using UML visual design tools. However, the offered zoom / focus control navigation facilities are not made to be that fast and immersive, resulting in valuable time being lost to handle design context switching. Moreover, when it comes to comprehensive and crowded design spaces, all UML design elements should be arranged in a single plane, meaning the inspection area (view frustum) becomes far larger in comparison to our system. 3.3 Encoding through Rotations Spatial rotation of objects may imply emphasis, distinction, and semantic separation, and can be used to directly attract visual attention (we tested that planar rotations over 30° are very clear). Strong variations on angles or identical / similar spatial rotation on a group of items are easily and quickly recognized by human vision. Some of the semantic aspects we encoded via rotations are provided below:
Indicate inheritance properties. Highlight specific functional roles or overall mission, e.g. communication, user interface, etc. Illustrate particular algorithmic category, such as numeric computations, search algorithms, or pattern matching. Emphasize reusability properties (e.g. template classes). Denote design volatility (e.g. in progress, incomplete design, under refactoring, under argumentation).
A few scenarios are provided under Fig. 4, showing how spatial rotations can be deployed. In some cases, an encoding policy allows to convey the design semantics through an appropriate metaphor. For instance, the choice of a rotated placement for proxy classes in Fig. 4 - top right, depicts the social metaphor of proxies as intermediaries laying in-between other modules while physically facing both of them. However, in most cases the choice of rotation encoding is by convention rather than metaphorically, like the encoding policy for super classes shown in Fig. 4 - bottom left. The usefulness of rotated views for classes concerns primarily large designs where speed of artifact inspection, detection and recall is very crucial.
884
A. Savidis, P. Papadakos, and G. Zargianakis
Fig. 4. Sample encodings via rotations: UI (top left), super classes (bottom left), proxies (top right) and library modules (bottom right)
In particular, rotations enable designers capture key design aspects directly from the design overviews (viewing with a far camera in our system) without requiring focus closer where extra information, possibly unnecessary for the task underhand, clutters the display. The same benefit cannot be always gained by color encoding, since colors cannot be freely used to represent all sorts of design aspects: overuse of color may result in less usable and understandable design images. In practice, blending color encoding with alternative visual encoding methods works better compared to the use of color encoding alone. 3.4 Encoding through Distinct Planes Placement of items on the same plane is a way to emphasize grouping in a 3d world. When combined with other visual grouping methods, distinct subgroups in a master group may become easily perceived. For instance, two orthogonal planes of cards, initially encoded with the same color, can be very quickly recognized from varying point of view. Another possibility is to use parallel planes. The geometrical placement of distinct planes is again a matter of chosen metaphoric representation: cubes, pyramids, plane layers, etc. may be designed. Overall, we have considered the following encodings related to planar placement:
Signify (sub) grouping for a set of related classes. Outline architectural metaphors via planar topologies (e.g., layered, star, etc). Illustrate architectural decomposition assigning distinct planes to architectural components. Emphasize segregation or exclusion (e.g. classes under consideration for inclusion in the final design).
Immersive 3d Visualizations for Software-Design Prototyping and Inspection
885
Fig. 5. The ‘extension wall’ concept (left-bottom, right-left) realized as a distinct plane of the main design (partially shown at to the right of the extension wall)
Fig. 6. (a) Views of a design organized in two planar groups for architectural components (top left, top right, bottom left); (b) component grouping with a pyramid metaphor (bottom right)
The ability to populate the design space with information not yet being part of the design itself is a very important feature. In particular, it enables designers introduce classes or packages that are still under consideration without shifting focus of attention from the main design space, like the ‘extension wall’ shown at Fig. 5. The main design of Fig. 5, partially shown at right part of sub-images, is the tool design itself (outlined in Fig. 8). The last encoding shows that 2d architectural metaphors may be deployed in a 3d space through extra grouping by distinct planes. For example, in Fig. 6 we use two parallel planes to illustrate architectural decomposition: (i) components (planes); and (ii) sub-components (classes in a plane). Additionally, a
886
A. Savidis, P. Papadakos, and G. Zargianakis
more comprehensive spatial organization metaphor like a pyramid is chosen for Fig. 6 – bottom right. Such a spatial organization supporting real-time inspection from any point of view allows severely reduce the visual complexity of the design image when compared to traditional planar topologies. Again, the organization of the design into distinct planes may be also combined with other types of encodings like planar / spatial rotations, varying dimensions, color differentiation (Fig. 6, top right), etc. 3.5 Encoding through Spatial Arrangement Free placement at distinct spatial positions allows realize a desirable topological pattern where the specific placement of design items denotes distinct semantic roles. Such a topological pattern may be known a priori, or may be totally heuristic, derived by designers after experimenting with alternative placements. There is actually no need to encode anything in particular during such a process, since the primary objective is to derive more usable and understandable representations:
Reflect metaphors of architectural organization, like layered structures, sequential processing, etc. Illustrate role categories, like the use of depth-sorted placement: e.g., placing I/O classes close to the camera. Emphasize work in progress, such as placing classes to be elaborated latter (i.e. ‘todo’ stuff) behind others.
Fig. 7. The incremental spatial reform of the originally planar Observer design pattern
As an example illustrating the benefits of spatial topologies we reform a view of the Observer pattern - originally in UML (see Fig. 7 - top left). In this view, the fact that concrete observers access the subject is not shown: if we introduce the necessary links for the latter, its UML image gets a little cluttered. One possible transformation to 3d CRC design is provided in Fig. 7 - label 1. In the 3d view we clearly illustrate the relationship among the Observer superclass and its concrete derivatives in a
Immersive 3d Visualizations for Software-Design Prototyping and Inspection
887
distinct plane, while putting the observed Subject class at a different spatial position (below). The different associations become more evident. The deployment of the pattern in an application is provided in Fig. 7 - right part, showing the introduction of a client class (top) that encompasses concrete observers for subject rendering purposes. Alternative topologies with extra encodings for the Observer pattern are also included in Fig. 7. For instance, we may organize concrete observers as parallel cards (label 2), or place concrete observers around the subject (placed in the middle) to give a social-metaphor connotation (label 3). 3.6 Animated Immersive Inspection While the support for navigation is not related to visual encoding facilities, it is very important since it is essential in enabling the effective and efficient conduct of the design activities. We put emphasis on activities such as: design exploration, reviewing of relationships and dependencies, shifting focus of attention, viewing design context of spatially proximate classes, and effective visual control over large design structures. Automatic animated reviews are initiated through specific mouse gesture; the animation speed is proportional to the speed of the mouse gesture; e.g., if the designer is making a gesture very fast, the animation is also fast. We support reviews for model rotation, zooming, and panning. An example is provided under Fig. 8, demonstrating the way complexity is reduced with spatial organization and visual encoding, and how spatial navigation is an effective global supervision and exploration activity.
Fig. 8. Views of the Flying Circus design in itself – miniature resident views are also supported, shown at top right
Mentally linking a design structure with a visual pattern is very crucial for design memorization. In this context, deriving an appropriate visual pattern (like the one shown in Fig. 8) is far easier to accomplish with a spatial topology in comparison to planar structures, meaning designers have far more chances to structure suitable design representations. Additionally, inspection and identification of component dependencies is an activity that is emphasized in our tool. For instance, bilateral method invocations may introduce undesirable coupling that should be rectified. Apparently, visual recognition through link arrows is far more efficient compared to extrapolating dependencies from the source code. The advantage over planar associations is that a dependency is easier assimilated when seen from varying perspectives (see Fig. 9).
888
A. Savidis, P. Papadakos, and G. Zargianakis
Fig. 9. Inspecting call-dependencies and cross-invocations from different perspectives via automotion flying cameras
3.7 Encoding through Spatial Labeled Links 3d connections amongst spatially distinct collections of classes are directly perceived by designers, something hardly possible in a respective 2d structure, making easier the identification of component or package inter-dependencies. Spatial links, when combined with particular arrangement policies for grouped classes allow depict in an emphatic way component cross-dependencies. In particular, our tool allows inspecting call-dependencies and cross-invocations from different perspectives via auto-motion flying cameras. Additionally, link labeling can be particularly useful since, not only it allows specialize the expectations raised for the target class (with whom collaborating), but also allows embody a micro-language (scripting) in the labels so as to carry extra information to be interpreted by accompanying tools. Typical labels we have used to convey important design information are:
ISA, MIXIN: link target is a base class (normal inheritance), or the link source is a generic derived class (mixin inheritance), respectively. HAS, HASMANY: link target is a constituent object (i.e. link source is aggregate). CALLS<what>: a method of the link target is invoked by the link source; by collecting together all incoming links for a given class we may gain its public exported interface.
Editing the visual design and topologically reorganizing classes and links, to make the design more readable, apparently presents far more possibilities in the 3d space compared to the 2d world, since overlapping of links can be eliminated with appropriate card placements. Although the 3d structure is still projected on the screen as a planar artifact, the facilities for navigation allow designers inspect their designs from varying angles and zooming factors so that a chosen links does not overlap with others on the screen. An example on the easiness of manipulating spatial links is provided in Fig. 10, showing how the left part is transformed to the middle part by repositioning classes with the aim to increase visual comprehension, by making links cleaner and revealing a hierarchical association discipline. From the interaction point of view, such a transformation takes only a couple of minutes to accomplish. Combined with semantic grouping the links may convey different types of dependencies at the architectural
Immersive 3d Visualizations for Software-Design Prototyping and Inspection
889
level: (i) planar links inside the same groups represent intra-component associations; and (ii) spatial links among distinct groups denote inter-component dependencies. The encoding method to denote grouping could be coloring or planar placement.
Fig. 10. Rearranging links to make dependencies more clear (top), adoption of spatial links for invocations (bottom left), and various spatial linking topologies in different designs (bottom)
4 Conclusions and Future Work In software design, the use of visual metaphors and symbolisms is well-known common practice. We frequently seek for architectural structures that encompass the key semantic aspects, being simultaneously more understandable and easier to memorize. The latter is crucial not only during the early phases where initial designs are shaped out, but also as systems continue to grow and evolve. In this context, visualizations are a sort of visual syntax for design semantics, affecting the way we assimilate, memorize, recall, reuse and adapt design structures. The latter relates to the human mechanism of cognition, since the perceived complexity of a system of objects is affected by the way we represent involved artifacts and relationships. Linking to this, we consider that our capacity for spatial memory, visual pattern recognition, and spatial orientation needs to be further exploited in a software design context. Technically, the focus of our work was far from questioning the usefulness of popular and practically approved methods such as UML. Instead, our primary objective has been to identify, develop
890
A. Savidis, P. Papadakos, and G. Zargianakis
and assess methods enabling the visual conduct of design prototyping with emphasis on exploration and semantics encoding. Our results may be incorporated in the form of specialized tool-boxes within existing design environments. We were motivated by the fact that software design, a critical activity of the software life cycle, is still carried out with instruments offering traditional graphical means of interaction that haven’t essentially progressed within the last decade. In the meantime, for code analysis and behavior monitoring / observation there are good tools offering very advanced visualization features. We consider this to be an imbalance that cannot be merely addressed by having more automatic post-design or postcoding visualization tools. We consider that the design activity should be genuinely supported as an exploratory visualization-centric process. Clearly, in such a context, automatic visualizers always play a crucial role as they reveal alternative system perspectives. [The tool is available from: http://www.ics.forth.gr/hci/files/plang/FLYINGCIRCUS.ZIP]
References QUICK CRC. 2001 (2001), http://www.excelsoftware.com/crccards.html Beck, K., Cunning, W.: A laboratory for teaching object-oriented thinking. In: Proceedings of the ACM OOPSLA 1989 Conference, New Orleans, Louisiana, pp. 1–6. ACM Press, New York (1989) Byelas, H., Telea, A.: Visualization of areas of interest in software architecture diagrams. In: ACM Symposium on Software Visualization (SoftVis), pp. 105–114. ACM Press, New York (2006) Fronk, A., Bruckhoff, A.: 3d visualization of code structures in Java software systems. In: ACM Symposium on Software Visualization (SoftVis), pp. 145–146. ACM Press, New York (2006) Greevy, O., Lanza, M., Wysseier, C.: Visualizing live software systems in 3d. In: ACM Symposium on Software Visualization (SoftVis), pp. 47–56. ACM Press, New York (2006) Parnin, C., Goerg, C.: Lightweight visualizations for inspecting code smells. In: ACM Symposium on Software Visualization (SoftVis), pp. 171–172. ACM Press, New York (2006) Raman, A., Tyszberowicz, S.: The easycrc tool. In: Proceedings of the IEEE International Conference on Software Engineering Advances ICSEA 2007, pp. 52–52. IEEE Press, Los Alamitos (2007) Roach, S., Vasquez, J.: A tool to support the crc design method. In: Proceedings of the International Conference on Engineering Education (2004), http://www.succeed.ufl. edu/icee/Papers/339_Roach-Vasquez(1).pdf Wirfs-Brock, R., Mckean, A.: Object Design-Roles, Responsibilities, and Collaborations. Addison-Wesley, Reading (2003)
Enclosed Five-Wall Immersive Cabin Feng Qiu, Bin Zhang, Kaloian Petkov, Lance Chong , Arie Kaufman, Klaus Mueller, and Xianfeng David Gu Center for Visual Computing and Department of Computer Science Stony Brook University, New York, 11794, USA {qfeng,bzhang,kpetkov,ari,mueller,gu}@cs.sunysb.edu,
[email protected]
Abstract. We present a novel custom-built 3D immersive environment, called the Immersive Cabin (IC). The IC is fully enclosed with an automatic door on the rear screen, and thus very different from existing CAVE environments. Our IC, the construction of the projection screens and stereo projectors as well as the calibration procedure are explained in details. The projectors are driven by our Visual Computing cluster for computation and rendering. Three applications that have been developed on the IC are described, 3D virtual colonoscopy, dispersion simulation for urban security, and 3D imagery and artistic creations.
1 Introduction Visual immersive exploration can be an immensely valuable tool in enhancing the comprehensibility of raw and derived data in a diverse set of applications, such as battlefield simulation, health care, engineering design, and scientific discovery. It exploits the brain’s remarkable abilities to detect patterns and features in 3D visuals and draw inferences from these. As a result, human interaction with 3D visual support is central to the analysis and understanding of complex data, and this requirement is only escalated by the current trend toward increasingly large datasets and complexities. In light of this fact, there is an urgent need for visualization technologies more powerful than those in existence today, and this was also pointed out in a seminal report entitled ”Data and Visualization Corridors”, cosponsored by DOE and NSF [1]. In their executive summary, they state ”There is a real need for design of effective human interfaces for 3D immersive environments, where conventional keyboards and mice are ineffective”. In response to this challenge, we established a novel 3D immersive environment, called the Immersive Cabin (IC) inspired by the Fakespace Immersive Workbench with Crystal River Engineering 3D sound with SensAble Technologies haptic Phantoms and a pair of PINCH Gloves as input devices, and three Ascension trackers [2]. It embraces a visualization paradigm known as ’Spatially Augmented Reality’ (SAR), in which any object in the physical environment can double as a passive display surface. The user is free to roam around in this environment and interact with these physical objects in a multitude of ways. We strongly believe that the concepts implemented in our IC will set new trends and will create new opportunities for a wide range
Corresponding author.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 891–900, 2008. c Springer-Verlag Berlin Heidelberg 2008
892
F. Qiu et al.
of domain applications for which the growth of current visualization technologies has started to level off. However, at the same time, it will also put in place a new platform and visual interface for education, training, simulation, engineering, industrial and architectural design, homeland security, and the arts.
2 Equipment and Construction The IC is an active stereo projection system that consists of three groups of components: projection screens and projection floor; active stereo projector pairs; and computers to drive the projectors. It is a fully enclosed system with a 29” wide door on the rear projection screen. The door is operated by an Otodor automatic door opener from Skylink, Inc. Augmented reality capabilities inside the IC are implemented through the use of a foldable wood table with projection table top. 2.1 Projection Screens and Projection Floor To optimally utilize the available 21 × 28 ft lab space, we placed the IC diagonally in a 21 × 21 ft region of the lab with the four projector pairs in the four corners (see Fig. 1). To preserve the native 4:3 aspect ratio of the Hitachi CP-SX1350 SXGA+ (1400×1050) multimedia projectors, the maximum size of the front, right, left, and back projection screen is 9.67 × 7.25 ft when wide angle lenses (0.8:1) are used with the four projector pairs. Therefore, the floor projection area is 9.67 × 9.67 ft, and Zoom lenses (1.1-1.5:1) are used with the floor projector pairs. The height of the four projection screens is lower than 3 meter walls of many existing CAVE system. However, due to the relatively small size of the IC and the pure black background, users have not complained or looked
Fig. 1. Computer model of the IC with the rear door open
Enclosed Five-Wall Immersive Cabin
(a)
893
(b)
Fig. 2. (a) Rear projection door shown in open position, and (b) Otodor automatic door opener
over the screens. The distance from the lens of a wall projector to its corresponding projection screen is 7.72 ft. The mirror for floor projection is 55 × 45” and is hung 8.67 ft above the ground. The floor projector is placed 2 ft away from the mirror. The TekPlex 100 acrylic rear projection screen from Stewart Filmscreen Corporation is used to construct the front, right, left, and back projection walls. A major difference between our IC and existing CAVE systems is that the IC is entirely enclosed by the four projection screens. To allow accessing, one of the projection screen has an automatic door (see in Fig. 2). The door is made of the same acrylic material and its frame has a vertical rotating spindle. The spindle is carefully installed several inches away from the corner so that the automatic door does not rotate around the corner. Therefore, the distance between the door and the two neighboring walls is minimal when the door is closed. Fig. 3 shows the projection screen with the door closed and open. The four vertical screens are suspended from the ceiling onto a wooden front projection platform to form the IC, (see Fig. 4). To suspend and stabilize the four vertical projection screens, steel bars are cut and mounted to the concrete ceiling so that 15 screen hanging cables and aluminum trusses could be mounted. The aluminum trusses are used to mount cameras, lights, trackers, and speakers. Then the floor in the center of an area sized 21 × 21 ft is leveled with epoxy flooring, a wooden front projection platform is installed, and front, right, and left projection screens are attached to the hanging cables.
(a)
(b)
Fig. 3. (a) The door of the rear screen in closed position, and (b) the door shown in open position
894
F. Qiu et al.
(a)
(b)
Fig. 4. (a) Screen hanging cables, aluminum trusses mounted to the ceiling, and (b) aluminum trusses painted in black to minimize light reflection and the floor projection mirror
(a)
(b)
Fig. 5. (a) Floor Beacon system, and (b) a rear Beacon system on the ground
2.2 Active Stereo Projector Pairs Five SX+ Beacon Active Stereo Synchronization systems from Fakespace are used to project images onto the walls and the floor. Each Beacon system contains one pair of Hitachi CP-SX1350 SXGA+ projectors installed on Fakespace 5 DOF projection alignment platform, Stereographics EXXR IR Emitter, and a control system from Crestron Electronics, Inc. Fig. 5 shows the floor beacon system and a rear beacon system for a wall. Stereographics Crystal Eyes 3 LCD stereo glasses and NuVision 60GX LCD stereo glasses are used with the Beacon systems to provide stereo. 2.3 Computers to Drive the Projector Pairs The five Beacon Systems are driven by 10 display nodes from our Visual Computing cluster (see Fig. 6), which contains 66 high-end Linux/Windows dual boot workstations, a gigabit Ethernet frontend network, and each computing/display node is connected to an InfiniBand network. Each projector is connected to the DVI port of a display node via Gefen HDTV DVI-D fiber optic cable. Each of the first 34 nodes of our Visual Computing Cluster contains dual-Intel Xeon CPUs, 2.5GB memory, 160GB hard disk, nVidia Geforce FX5800 Ultra graphics, an
Enclosed Five-Wall Immersive Cabin
(a)
895
(b)
Fig. 6. (a) Our Visual Computing Cluster, and (b) display nodes of our Visual Computing Cluster to drive the IC
Fig. 7. The calibration pattern
onboard Intel gigabit network interface card, a Terarecon VolumePro 1000 volume rendering board with 1GB memory, and an InfiniBand network card. Each of the other 32 nodes contains dual-Intel Xeon CPUs, 2GB memory, 160GB hard disk, nVidia Quadro FX4400 graphics, and gigabit and InfiniBand network cards. 2.4 Integration and Calibration To install and integrate the projection screens and projection floor portion of our IC, we pursued the following workflow: install mounts to the concrete ceiling; install and paint aluminum trusses; install floor projection enclosure; install screen hanging cables; level floor projection area; install wooden floor projection platform; install front, right, and left projection screens; install floor Beacon system onto floor projection enclosure; install fixed portion of rear projection screen; and install and align the rear screen door and wireless door opener. To set up the Beacon systems, we first installed the floor Beacon system onto a floor projection enclosure; set up front, right, left, and rear Beacon systems onto the projection stands and finally connected each projector to its driving computer using the Gefen HDTV DVI-D fiber optic cable. Fig. 7 shows the pattern used to calibrate the projectors. We have manually adjusted projector settings including keystone, focus, zoom, brightness and position (minor lens position) until: (1) lines in the pattern images are straight, evenly distributed, and either parallel or perpendicular; (2) pattern images of each projector pair are aligned; (3) pattern images of any two neighboring screens are aligned and continuous; and (4) the brightness of five projection screens is uniform. A foldable table is installed inside the IC to facilitate additional display and user interaction, specifically with haptic devices, as shown in Fig. 8.
3 Applications 3.1 3D Virtual Colonoscopy The 3D Virtual Colonoscopy (VC) is a non-invasive computerized medical procedure for examining the entire colon to detect polyps, the precursor of colon cancer [3,4].
896
F. Qiu et al.
(a)
(b)
(c)
Fig. 8. (a) Projector for the table, (b) folding wood table; and (c) haptic device on the table
While the patient is holding his or her breath, a helical CT scan of the patient’s abdomen is taken, capturing a sequence of 2D slices which covers the entire range of the colon. This scan takes 30-45 seconds and produces many hundred transaxial slices of 512×512 pixels, which are subsequently reconstructed into a 3D volume of 100-250 MB. Then, the colonic surface is extracted, and the physician virtually navigates inside the colon to examine the surface for polyps. Interactive 3D virtual colonoscopy is comprised of two primary components: camera control and interactive rendering. Camera control essentially defines how the physician navigates inside the colon. A desirable camera control should enable the physician to examine the surface easily and intuitively, and prevent the camera from penetrating through the colon wall. In our system, the centerline of the colon lumen is defined as the default camera path. During the navigation, the colon interior is displayed using volume rendering methods. Volume rendering provides more accurate mucosa display and other rendering effects such as using translucent transfer function for electronic biopsy images. One important requirement in VC is that the physician must inspect the entire colon surface during navigation. However, due to the bends and haustral folds on the colon surface, the physician needs to navigate from the rectum to the cecum, and then in the reverse direction for full coverage. VC typically uses a perspective projection with 89 degree field of view. It only covers about 77% of colon surface area in retrograde direction [5]. In comparison, our IC provides a 360 degree panoramic view inside the colon that substantially improves the colon surface coverage and investigation performance. The physician may use the five walls, especially the rear screen, to inspect the entire colon surface in a single navigation pass. 3.2 Dispersion Simulation for Urban Security Efficient response to airborne hazardous releases in urban environments requires the availability of the dispersion distribution in real time. We have developed such a system for simulation and visualization of dispersed contaminants in dense urban environments on the IC. Our simulations are based on a computational fluid dynamics model, the Lattice Boltzmann Method (LBM) [6]. The LBM simulates the fluid dynamics by microscopic collisions and propagations of fluid particle distributions on a discrete set of lattice nodes and links. These local distributions model the global behavior of the fluid and solve the Navier-Stokes equations in the incompressible limit. Lattice-based simulations have recently emerged as efficient and easy way to implement an alternative to simulations using finite elements. One advantage is the simple handling of
Enclosed Five-Wall Immersive Cabin
(a)
897
(b)
Fig. 9. (a) 3D colon model displayed in the IC, (b) navigation inside a colon in the IC
boundary conditions for complex and dynamic objects inside the flow field. Also, the highly local computations allow for efficient implementations on parallel architectures. These advantages position the LBM as the ideal method for simulating dispersion in dense urban environments and accelerating the computations on the GPU to achieve high simulation performance and interactive visualization of the evolving system. In practice, the simulation domains needed for open urban environments are too large for the local memory of a single GPU. In addition, rendering to the walls of our IC at interactive rates can be a challenge for even the fastest GPU, especially when advanced smoke rendering techniques are employed. Therefore, we have implemented our LBM simulation and visualization using 16 nodes of the Visual Computing GPU cluster. The algorithm for partitioning and distributing the simulation data is based on the work of Fan et al. [7]. Visualization is also performed on the cluster nodes since the simulation results already reside on the local GPU memory. We render the density volumes using volumetric ray tracing with single scattering and composite the resulting images from each node. The final image is overlaid on a mesh-based rendering of the urban environment. Buildings in this rendering pass are textured with memory-efficient synthetic facades that exhibit high level of perceived detail. We demonstrate our simulation and visualization system with the Times Square area data-set which covers New York City from 8th Avenue to Park Avenue and from 42nd Street to 60th Street. This region is approximately 1.46 × 1.19 km and contains 75 blocks and more than 900 buildings. Fig. 10 shows the rendering of smoke at different states of the simulation. By effectively utilizing the computational power of our GPU cluster, our system can achieve highly interactive framerates, as well as 360 degree immersive visualization of complex dynamic flow in a dense urban setting. 3.3 3D Imagery Research and Artistic Creations The IC is further used to explore scientific research on 3D image generation and artistic expression in virtual space. However, our effort not only assists academic efforts and communication in scientific research but also in other areas, especially the humanities. We designed the stereo image gallery as an expandable forest, divided into sections according to the contents and the archival time of the virtual specimen. The gallery
898
F. Qiu et al.
(a)
(b)
(c)
Fig. 10. Smoke dispersion simulated in the Times Square Area of New York City, (a) and (b) bird-eye views, the wind is blowing from the right to the left, (c) a snapshot during navigation
(a)
(b)
Fig. 11. (a) Conformal brain mapping, and (b) deformation of a teapot
navigation is conceived to follow a 3D map of a virtual landscape, and a 4th dimension ”timeline” is in sync with the ”growth” of the gallery forest. The first stage of the development has included 3D image representations of conformal brain mapping. A human brain cortex surface features complicated geometries. In order to register and compare different brain cortex surfaces, we have developed a novel conformal brain mapping method. Following principles from differential geometry theory, brain cortex surfaces can be mapped to the unit sphere preserving all angles. Such kind of angle-preserving (conformal) maps form a six dimensional Mobius transformation group. By mapping the brain to the canonical sphere, different brains can be registered and compared directly. We use the method proposed by Gu et al. [8], which is sufficiently general for registering other deformable shapes, and fuse medical imaging data sets with different modalities. Shape registrations and comparison can be directly visualized, evaluated and adjusted in the IC, and the manipulation is intuitive and efficient. Stereoscopic photography has been utilized to document the architecture of our campus buildings as well as its permanent art display. The IC is also an ideal media to plan and showcase the future developments of our campus. The high-tech center of Excellence in Wireless and Information Technology (CEWIT) building is the first building
Enclosed Five-Wall Immersive Cabin
(a)
899
(b)
Fig. 12. (a) User manipulation of virtual images through a wireless tablet PC, and (b) virtual representation of chariots displayed on our campus
being built in our new R&D campus. The IC is being used as the perfect media to display 3D models and animations of this new building to visualize and facilitate the planning of the building architecture, interior design, as well as the purchasing and installation of scientific research facilities. The immersive IC has been indispensible in visualizing and planning the extremely complicated special relationship and electromagnetic and optical influences between various devices for wireless and electromagnetic technologies.
4 Future Work The calibration procedure described in Sec. 2.4 currently relies on manual adjustments to projection parameters with the help of a set of test patterns. We are developing a system for automatic calibration using high resolution static cameras which will greatly simplify the tasks of color and brightness matching and geometry correction. On the hardware side, we are also extending our IC with a wireless tracking system, surround sound, cameras and lighting systems. With these additions, automatic calibration would become even more important. One of the challenges in programming an immersive visualization system such as our IC is that existing tools and applications may require significant code base changes. We are currently investigating the use of a unified rendering pipeline based on a scene graph model, which will ease the development of our urban modeling and simulation projects. The visualization aspect can be handled by this framework, while the developers remain oblivious to the underlying hardware architecture of the immersive visualization system. We are currently developing a pipeline based on the OpenSceneGraph library [9] with COLLADA files [10] as the main format for asset management and interfaces for advanced shading and cluster based simulations and rendering. As we continue developing both the hardware and software aspects of our IC, it is important to also study the usability of our applications. In particular, the 3D virtual colonoscopy application described in Sec. 3.1 is of interest to the medical community and usability studies of the IC implementation will allow for further refinement of our techniques.
900
F. Qiu et al.
Acknowledgements This work was partially supported by NSF grants CCF-0702699, CCF-0448399, DMS0528363, DMS-0626223, NIH grant CA082402, and the Center of Excellence in Wireless and Information Technology (CEWIT). We thank Autodesk Inc. for providing the Maya software for modeling IC.
References 1. Smith, P., Van Rosendale, J.: Keynote address: Data and visualization corridors. IEEE Visualization, 15 (1999) 2. Wan, M., Zhang, N., Kaufman, A., Qu, H.: Interactive stereoscopic rendering of voxel-based terrain. IEEE Virtual Reality, 197–206 (2000) 3. Kaufman, A.E., Lakare, S., Kreeger, K., Bitter, I.: Virtual colonoscopy. Communications of ACM 48, 37–41 (2005) 4. Pickhardt, P.J., Choi, J.R., Hwang, I., Butler, J.A., Puckett, M.L., Hildebrandt, H.A., Wong, R.K., Nugent, P.A., Mysliwiec, P.A., Schindler, W.R.: Computed tomographic virtual colonoscopy to screen for colorectal neoplasia in asymptomatic adults. The New England Journal of Medicine (NEJM) 349, 2191–2200 (2003) 5. Hong, W., Wang, J., Qiu, F., Kaufman, A., Anderson, J.: Colonoscopy simulation. SPIE Medical Imaging 6511, 0R (2007) 6. Succi, S.: The lattice Boltzmann equation for fluid dynamics and beyond. Numerical Mathematics and Scientific Computation. Oxford University Press, Oxford (2001) 7. Fan, Z., Qiu, F., Kaufman, A., Yoakum-Stover, S.: GPU cluster for high performance computing. In: ACM/IEEE Supercomputing Conference (2004) 8. Gu, X., Wang, Y., Chan, T.F., Thompson, P.M., Yau, S.T.: Genus zero surface conformal mapping and its application to brain surface mapping. IEEE Transactions on Medical Imaging 23, 949–958 (2004) 9. Burns, D., Osfield, R.: Open scene graph a: Introduction, b: Examples and applications. IEEE Virtual Reality, 265 (2004) 10. Barnes, M.: Collada. ACM SIGGRAPH Courses, 8 (2006)
Environment-Independent VR Development Oliver Kreylos W. M. Keck Center for Active Visualization in the Earth Sciences (KeckCAVES), University of California, Davis
Abstract. Vrui (Virtual Reality User Interface) is a C++ development toolkit for highly interactive and high-performance VR applications, aimed at producing completely environment-independent software. Vrui not only hides differences between display systems and multi-pipe rendering approaches, but also separates applications from the input devices available at any environment. Instead of directly referencing input devices, e. g., by name, Vrui applications work with an intermediate tool layer that expresses interaction with input devices at a higher semantic level. This allows environment integrators to provide tools to map the available input devices to semantic events such as selection, location, dragging, navigation, menu selection, etc., in the most efficient and intuitive way possible. As a result, Vrui applications run effectively on widely different VR environments, ranging from desktop systems with only keyboard and mouse to fully-immersive multi-screen systems with multiple 6-DOF input devices. Vrui applications on a desktop are not run in a “simulator” mode mostly useful for debugging, but are fully usable and look and feel similar to native desktop applications.
1 Introduction Let us address the most important question immediately: there are already a large number of (mutually incompatible) VR development toolkits [1,2]; why create yet another one? The main rationale behind development of Vrui was a perceived lack of portability of existing VR software. For example, an application written for a responsive workbench typically does not run in a CAVE, and a CAVE application cannot make use of two-handed interactions provided by a workbench. Applications written for either environment will not run on a desktop, or at least not in an effective manner. Recent history has shown that VR environments are not converging towards a common set of input devices – unlike desktop systems, which have converged to standard keyboards and multi-button wheel mice – but are becoming more diverse, especially due to experimental hardware and the introduction of new, sometimes problem-specific input devices. Although all VR toolkits hide differences in the display system (projection setup, stereoscopy, head tracking), and many hide differences in distribution (single-pipe, multi-pipe, cluster-based), we found none that offer a high enough level of abstraction when dealing with input devices. Most have the notion of a virtual input device [3] consisting of a 6-DOF location and a set of buttons and/or valuators and provide device drivers to virtualize tracking and input device hardware, and some contain proxies G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 901–912, 2008. c Springer-Verlag Berlin Heidelberg 2008
902
O. Kreylos
to support changing the association between hardware and virtual devices on-the-fly. Nonetheless, applications are still written directly for the set of virtual devices provided by their “native” environments. For example, an application written for a responsive workbench would directly reference the left and right glove input devices, by polling their locations or reacting to button/valuator events. Such an application is only portable to VR environments that offer a very similar set of devices, and would not run in, say, a CAVE with only one wand device, or let alone on a desktop with only a mouse and keyboard. Most VR toolkits contain simulators [4,5] to overcome this problem especially for desktop environments. However, a simulator has to map an environment’s real input devices to the virtual devices expected by the application without knowing how the application uses those devices. As a result, simulators usually provide very unintuitive interaction, and are hardly useful for anything but basic debugging. The Vrui toolkit follows a different approach: applications do not directly reference (virtual) input devices, but react to higher-level semantic events such as navigation, menu selection, dragging, etc. Vrui contains a separate tool layer that maps from input devices to semantic events. Each type of VR environment can define its own set of tools, which allow embedding applications as if they were native to each environment. Since tools are task-specific, each environment can provide the most intuitive and effective mappings that is supported by its set of input devices. For example, a desktop environment could provide a mouse-based navigation tool using a trackball metaphor, whereas a CAVE environment could provide a navigation tool that directly uses the wand, and a workbench could provide tools using both hands to achieve complex tasks. As a result, VR applications can be used intuitively in any environment, including the desktop. Our experience shows that Vrui applications running on the desktop are often as effective as native desktop applications. This benefits users who use a particular VR application, but do not always want to run it on a (shared) VR environment. For example, scientists might want to use a VR visualization application on their desks first to quickly assess the quality of a data set, and then use the same application in a CAVE to pinpoint problem areas. The concern about limited access to shared VR environments and the separation between desktop and VR software has been one of the main obstacles for wider use of VR in the sciences. A related concern is that most VR toolkits do not offer higher-level interactions; as a result, each application developer has to code her own style of navigation, GUI interaction, dragging, etc. This situation is similar to the early days of GUI development on the X window system where programmers had to invent their own ways of mapping behavior to mouse buttons or widgets, leading to little consistency across programs. Since Vrui tools are application-independent, they provide a consistent “look and feel” for applications. An additional benefit of the tool abstraction is that users can add new input devices to their environments without having to change their applications. For example, if a desktop environment is augmented with, say, a spaceball input device or a Nintendo Wii controller [6], all Vrui applications will immediately be able to use it without code changes or even recompiles. A lot of work in VR is geared towards creating new input devices; alas, many promising ones never enter the mainstream because there are very few applications that can use them. Vrui’s tool layer can overcome this problem:
Environment-Independent VR Development
903
hardware developers only need to provide a set of tools that map their new device to common semantic events, and any applications will immediately be able to use the device. 1.1 Aim and Focus of Vrui Although Vrui’s main contribution is the support for portable applications effected by its tool layer, it is a complete VR development toolkit based on OpenGL. It is written in C++ and geared towards developers who need to write highly interactive and highperformance VR applications that run on a wide range of VR systems, from desktops over single-pipe projected environments or HMDs to multi-pipe environments using shared-memory visualization servers or commodity clusters, or combinations of any of these components. From a user’s point of view, all environment types behave alike. A Vrui application is invoked the same way on all systems, and a single executable can run on any VR environment as long as the used computers are binary compatible. The Vrui runtime determines the structure of the local environment at start-up and takes any steps necessary to run the application, e. g., replicating processes across a cluster or initializing multiple pipes in a shared-memory system.
2 Related Work As already mentioned, there is a large number of VR development software ranging from low-level programming toolkits to content creation systems aimed at non-programmers. In this section, we list some representative examples and compare them to Vrui with regards to display abstraction, multi-pipe abstraction, and input abstraction. Bierbaum and Just’s SIGGRAPH ’98 course notes [1] contain a more extensive list. CAVELib [7] is a fairly low-level programming toolkit originally aimed at CAVEs and SGI graphics servers, but has since been ported to PCs and commodity clusters. CAVELib hides display setup by running individual render processes for each screen, and managing OpenGL projection and a model space transformation. Applications provide callback functions that insert their own OpenGL code. If run on a shared-memory multi-pipe system, CAVELib creates a single application process that shares data with all render processes; on a cluster, CAVELib runs identical application instances on all nodes and synchronizes them by sharing input device data. CAVELib contains drivers for a large variety of VR hardware, but does not offer input abstraction beyond virtual input devices. It also does not offer a high-level geometry library or 3D GUI components. CAVELib has a desktop simulator that allows direct control of the head tracker and a CAVE wand either via keys or a GUI [4], but in practice this simulator is unusable beyond basic debugging. VR Juggler [5] is a flexible programming toolkit with a level of abstraction similar to CAVELib. Display abstraction is based on the notion of projection surfaces and associated graphics pipes; application rendering code is invoked via virtual methods. Commodity clusters are supported by the Cluster Juggler [8] component, which uses application replication and synchronization by input distribution. VR Juggler uses a device driver and device classes to abstract VR hardware, but applications still have to reference virtual input devices directly by name. VR Juggler contains no high-level
904
O. Kreylos
geometry library or 3D GUI components. VR Juggler can be configured to match a wide range of VR environments, and it has a desktop simulator similar to CAVELib’s, with similar limitations. Although VR Juggler supports clusters via Cluster Juggler, the difference between clusters and other systems is not transparent to users: running a Cluster Juggler application requires manually starting application servers on each cluster node. Syzygy [2] is a programming toolkit aimed at commodity clusters. Syzygy has a split-personality approach to display and multipipe abstraction: it either offers a scene graph architecture, or cluster-based rendering very similar to Cluster Juggler. Applications can use one or the other, but not a hybrid of both approaches. As opposed to VR Juggler, cluster support is transparent; Syzygy contains its own distribution manager that automatically replicates applications invoked on the head node and provides reliable shutdown and even dynamic addition/removal of nodes. Input abstraction is handled similarly to VR Juggler, with the addition of filter chains that can process raw input device data from outside of an application. These filters can be used to implement environment simulators, but since applications still directly reference input devices and Syzygy offers no higher-level interactions, simulators (especially on the desktop) are still severely limited. The Responsive Workbench Simulator [9] is an interesting approach enabling desktop development of VR applications without requiring a simulator. Although the authors only aim to simulate a workbench environment on a desktop, their approach can easily be generalized to other environment types. The basic idea is to capture the input device data stream generated by a user interacting with a prototype of an application, and then give developers the ability to re-run the application on the desktop while playing back the captured stream to replicate errors and debug the application. The authors claim that application errors exposed by complex user interactions cannot be replicated in simulated desktop environments due to their limited interactivity; our experience has shown that a proper embedding of a VR application into the desktop can in fact expose the majority of bugs. However, since the ability to capture interactions is very useful for other purposes as well, such as movie generation, Vrui contains a device driver that can merge input data streams from arbitrary combinations of real and previously captured input devices. ITLib [10,11] is a framework to describe interaction techniques (ITs) in virtual environments separately from VR applications, and is similar to the input abstraction component of Vrui in aim and design. ITs are comparable to Vrui’s tools; both are used to translate input device data streams into application behavior, and both can be cascaded to form a data flow graph. The main differences are that ITs directly interact with 3D objects visualized by an application, whereas Vrui itself does not enforce an object model and leaves the connection between tools and application state to the application, that Vrui’s tool graph can be reconfigured at run-time by a user from inside the VR application, and that ITLib leaves processing of the IT flow graph to the application, whereas Vrui handles it internally. Another difference is that Vrui’s input graph is a bipartite graph of virtual input devices and tools, which we believe helps users manage flexibility such as using unbound input devices as “shortcuts,” as described in Section 3.4,
Environment-Independent VR Development
905
and that ITLib is primarily meant to provide a testbed for new interaction techniques, whereas Vrui’s tool layer is primarily meant to enhance portability.
3 System Architecture The Vrui toolkit has a microkernel architecture. It provides basic services, and each of its associated managers is responsible for a related set of more complex functionality. Vrui applications communicate with the toolkit by calling kernel functions, or by querying references to managers and invoking their methods directly. A Vrui application hands the main thread of execution over to Vrui after initialization, and is afterwards executed at well-defined times by callback invocation. Applications are typically implemented as objects derived from a common base class provided by Vrui, in which case callbacks are implemented as method invocations. The following sections describe the Vrui kernel and each manager in more detail, with a focus on those managers that are involved in implementing Vrui’s input and interaction abstraction. 3.1 Vrui Kernel The Vrui kernel is responsible for toolkit initialization, coordinating its associated managers, running an application’s main loop, maintaining a navigation transformation from application model coordinates to environment physical coordinates, and offering services to applications. During start-up, the Vrui kernel determines the layout of the local VR environment by reading an external configuration file, and initializes itself and its managers accordingly. If the local environment is cluster-based, the kernel will create instances of the application on each cluster node using remote execution, by default via ssh, and synchronize with the other instances. Finally, the kernel on each node will initialize the Vrui application itself. Cluster nodes communicate using a reliable broadcast- or multicast-based pipe abstraction providing very high bandwidth from the head node to the render nodes, and a low-latency barrier primitive for synchronization. Any Vrui component, such as managers or applications, can create private pipes for data transfer if desired. For example, the 3D Visualizer application [12] extracts visualization primitives such as slices or isosurfaces only on the head node, and broadcasts them to the render nodes using pipes. The basic processing unit of the kernel is a frame, which is executed inside its innermost loop. The kernel starts each frame by updating the states of all managers, causing polling of input devices, tool processing, GUI event handling, etc., then invokes the application’s per-frame callback, afterwards executes the display manager’s draw function, causing invocation of the application’s draw method, and finally synchronizes buffer swaps on all display windows. While all Vrui applications proceed through this sequence of frames, applications are encouraged to create any number of secondary threads to perform time-intensive computations asynchronously and parallel to Vrui’s main thread. An application’s perframe callback, invoked once per frame, mostly serves as a synchronization point in those cases. The 3D Visualizer application [12], for example, delegates extraction of visualization primitives such as isosurfaces to one secondary thread per primitive type, and synchronizes its display state with the secondary threads during the per-frame callback, leading to high frame rates and low latency even under heavy computational load.
906
O. Kreylos
3.2 Display Manager The Vrui display system is defined by sets of screens, viewers, and windows. A screen describes a projection surface of the environment, e. g., a monitor, a screen, or an HMD monitor, by its position/orientation and size in physical space, and by whether it is attached to a 6-DOF tracker. A viewer describes a user or virtual camera by defining the position of its eye(s) and the position/direction of an optional headlight, relative to either a 6-DOF tracker or a fixed transformation. A window is a representation of a display window and its associated graphics pipe. Each window has an associated screen and an associated viewer; their relative position/orientation define the window’s projection and view frustum. Windows are also responsible for creating stereoscopic images using one of several techniques, from anaglyphic stereo over several methods of creating passive stereo to quad-buffered active stereo. The screen/viewer/window model can describe a wide range of VR environments. For example, a desktop system would use a single screen representing the monitor and a single (fixed) viewer, and a single (mono or stereo) window displaying the 3D scene. A CAVE would be represented by one screen for each wall, one head-tracked viewer, and one window for each wall. It is also possible to define hybrid/combined environments: for example, our lab has a tiled passive-stereo display wall with a head-tracked viewer and an additional active-stereo HMD as secondary viewer. The Vrui display system ensures that the view through the HMD is always consistent with the view on the tiled wall. 3.3 Input Device Manager The Input Device Manager is responsible for translating “flat” device data provided by device drivers into virtual input devices. “Flat” device data is an unorganized collection of 6-DOF tracker data, digital (button) data, and analog (valuator) data, gathered from one or more device drivers. A virtual input device is a collection of an (optional) 6-DOF location and any number of buttons and valuators associated with that location. For example, a classic CAVE wand would be represented as one 6-DOF location plus three buttons plus two valuators (joystick X and Y); a regular mouse would have a 6-DOF location representing its position in screen space, three buttons, and one valuator (mouse wheel speed). Vrui currently contains device drivers for mice and keyboard as reported by the windowing system, and a client for its own stand-alone (potentially) remote low-level device driver. This low-level driver works with a wide variety of tracking systems and input devices, including desktop devices such as joysticks and spaceballs, but creating a Vrui device driver for 3rd-party software such as VRPN [13] or OpenTracker [14] would be straightforward. 3.4 Tool Manager Besides virtual input devices, the other component of Vrui’s input and interaction abstraction layer are tools. Tools describe environment-specific mappings from virtual input devices to application behavior, and enable the portability of Vrui applications between VR environments with different input device setups. As opposed to the VR
Environment-Independent VR Development
907
toolkits discussed in Section 2, Vrui applications do not directly reference virtual input devices, but instead work with tools that have a defined meaning. For example, an application that wants to allow a user to drag 3D objects does not install event callbacks with some pre-defined input device, but works instead with a dragging tool. Tools are created/destroyed at run-time by the user from inside the VR environment, and can be associated with arbitrary input devices. From the application’s point of view, tools offer callbacks related to the tool’s meaning, e. g., a dragging tool offers a dragStart callback to notify the application that a user has initiated a dragging operation at some location, a drag callback with an incremental transformation, and a dragEnd callback to notify that the user has released the dragged object. Vrui defines a set of operations common to many VR applications, such as locator tools (one-shot events happening at some position/orientation in space), the aforementioned dragging tools, navigation tools affecting the transformation from application model coordinates to physical coordinates, and some tools used internally by Vrui, such as popup menu selectors and GUI widget interactors. The tool approach has several distinct advantages over directly using input devices. For one, it allows users to dynamically change the association of application functions and input devices at run-time. At different stages of using an application, users might want to map different functions to the (limited) set of buttons/valuators. Alternatively, an application might provide several different interaction modes, e. g., creating different kinds of 3D objects using similar interactions in a CAD program, and instead of a user having to switch back and forth between these modes, several separate tools each connected to a different mode can be mapped to different buttons/valuators at the same time. In the CAD program example, a user might want to create shapes with one button, and edit paths with another button. Our experience shows that this flexibility makes complex VR applications more efficient to use. As a side effect, the dynamic nature of tools also enables dynamic change in the input device setup, once this feature is implemented. The VR Juggler toolkit [5], for example, uses “proxy devices” to allow changing the tracking hardware underlying some device at run-time, but it does not really allow adding or removing devices since applications do not expect that devices suddenly disappear or new devices are added. Even if applications were to detect new devices, they would not know for which functions they should use them. The tool method solves both problems: Vrui applications expect tools to appear/disappear dynamically, so changing tracking hardware would be represented as all tools associated with the old device suddenly disappearing, and then reappearing as the user creates tools for the new device. As a corollary, new devices could be added during run-time as well, since applications do not notice new devices, but only the new tools the user associates with them. The second benefit is that tools can be implemented specific to each VR environment, or according to a users’ tastes. For example, there are many different styles of navigating in VR environments, some of which work better in certain environments than in others, and users might prefer any of a set of alternatives. The Vrui toolkit provides several different implementations of navigation tools and lets environment administrators and users choose which ones can be created at run-time. For example, the “mouse navigation tool” is optimized for desktop environments and uses a virtual trackball metaphor
908
O. Kreylos
and the mouse wheel to zoom in/out, whereas the “wand navigation tool” uses buttons on a CAVE wand for navigation and zooming, and the “glove navigation tool” uses two hands in a “space grabbing” metaphor. By selecting the best-suited tool for the intended task and the available input devices, Vrui applications can be used effectively and efficiently in any VR environment, including the desktop. The difference between the tool model and the simulators provided by other VR toolkits is that simulators can only work at the input device level. In other words, if a CAVE application uses the wand to navigate, the simulator requires the user to navigate by moving a representation of the wand using mouse or keyboard. Although the application’s wand navigation metaphor might be intuitive in its native environment, it is highly unlikely that the indirection of using the mouse to control a 6-DOF device will be intuitive at all. Since Vrui tools know their purpose in the context of an application, it is possible to create intuitive mappings even in very restricted setups. The third benefit is that tools are neither part of Vrui itself (although Vrui provides a set of standard tools) nor part of Vrui applications (although applications can create custom tools). Instead, they are provided as external plug-ins that are loaded into the Vrui run-time environment on demand. This means that any third party can add new tools, and entire new tool functionality classes, without having to change the Vrui toolkit or any applications, and that tools create a consistent “look and feel” across VR applications. For example, one of our developers created an independent “VNC tool,” which allows a user to connect to, and control, a remote desktop from inside a VR environment using the VNC protocol. This tool is not part of Vrui itself, but can be used by any Vrui applications. The fourth benefit of tools is that they can be used to process input device data, akin to the “filters” provided by the Syzygy toolkit [2], but under control of the user from within the VR environment. Since virtual input devices in Vrui can be created/destroyed dynamically, a tool can create a private device upon initialization, and forward the location/button/valuator data from its source device to the private device after arbitrary processing. Users can then associate any tools they need with the private device in the same manner as for “real” virtual input devices. For example, a “fishing rod tool” could translate the origin of its private device along one axis of its source device under control of a valuator on the source device, and any application function would happen at the extended position without any additional support from either Vrui or an application. A more common example is a tool that intersects the pointing direction of its source device with any of the screens contained in a VR environment, and places a private input device at the intersection point. This tool is very useful in desktop environments, where it maps the position of the mouse onto the screen plane for arbitrary 3D interactions. Another example are tools mapping desktop devices such as joysticks or spaceballs into 3D space. Such devices are represented as virtual input devices without 6-DOF location, and the mapping tools convert the values of the joystick/spaceball axes into 3D translations and rotations that are then applied to a private input device representing the current 6-DOF location of the joystick/spaceball. Integrated into these tools are “native” navigation metaphors that allow 3D navigation without going through an intermediate private input device, such as a helicopter or airplane metaphor for joysticks, and a translation/rotation metaphor for spaceballs. Our experience shows that Vrui’s direct support
Environment-Independent VR Development
909
for desktop devices plays a major role in the effectiveness of Vrui applications on the desktop. The last benefit of tools is that they offer an elegant mechanism of providing 3D interaction widgets. Many 3D graphics toolkits aimed at the desktop provide such widgets to allow users to change the position/orientation of 3D objects by using only mouse and keyboard to interact with the widgets’ graphical representations. There are usually several types of widgets, representing more or less constrained modes of motion. For example, the Open Inventor toolkit [15] provides a “box dragger” to translate/rotate by dragging axes or faces of a 3D box with the mouse, and a “translation dragger” to translate along an axis by dragging a 3D arrow. In these toolkits, interaction widgets have to be created by the application at the appropriate times, leading to some extra effort for the programmer. In Vrui, interaction widgets are a special case of tools and are invisible to applications. Vrui allows creating “unbound” virtual input devices that are not associated with a device driver or a tool. These unbound devices have a graphical representation, and Vrui provides input device tools implementing different metaphors for interacting with those devices. Unbound devices are not only useful in desktop environments, but also in immersive environments. Users can treat them like “third hands” or “clamps” by associating arbitrary tools with them, and picking them up, moving them, and dropping them using an input device tool associated with a “real” input device. For many interactions, unbound devices offer a very flexible shortcut mechanism that our users have recently begun to explore. The Vrui Tool Manager is responsible for maintaining the set of available tool classes and the set of currently instantiated tools, and of notifying Vrui applications of tool creation/destruction. In the current version of Vrui, tools are created by pressing a button (or changing the value of a valuator) that does not currently have a tool associated with it. This pops up a menu listing the available tool classes, ordered in a hierarchy by function. Users can select a tool class from the menu, and press any additional buttons/valuators they want to assign the tool to in the case of multi-button/valuator tools. Upon releasing the initially pressed button/valuator, the tool manager creates a new tool of the selected class and associates it with all pressed buttons/valuators. Tools can be deleted by moving their input device into a designated “tool trash area” and pressing the button/valuator they are associated with. The position/size of the tool trash area is configured along with the Vrui environment. In desktop environments, it is typically a small box in the lower-right hand corner of the display window; in head-tracked immersive environments it is typically a box located at a fixed position relative to the user’s head. In effect, changing tools is a quick and intuitive process, and makes it easy to interact with complex programs using input devices with only a small number of buttons. In the extreme, any Vrui application could be effectively used with only a single input device with a single button. 3.5 Input Graph Manager The ability of Vrui tools to create private virtual input devices, and the existance of unbound devices that can be picked up and dropped by tools, requires special care to update input device and tool state in the proper order during the kernel frame. Conceptually, input devices and tools form a data flow network, or more precisely a bipartite
910
O. Kreylos
directed graph, shown in Figure 1. At the bottom level are the “real” virtual input devices directly connected to a device driver, any currently unbound virtual input devices, and the tools associated with either. On the next level are all virtual input devices currently owned by any tools on the bottom level, and all tools associated with those devices, and so forth. In general, each tool is in the same level as the highest-level input device it is associated with, and each input device is one level higher than the tool owning it. The Input Graph Manager is responsible for maintaining the data flow graph as input devices or tools are created/destroyed, and as unbound virtual input devices are grabbed/released by tools.
Input Device Manager
button−tool association permanent ownership temporary ownership Level 1
Level 2
Fig. 1. Diagram of Vrui’s input graph. The Input Device Manager on the left permanently owns all “real” input devices on the first level of the graph. Tools on the first level are connected to buttons on first-level input devices. Input devices on the second level are permanently or temporarily owned by tools on the first level. The middle tool in the first level is an input device tool that has currently grabbed an unbound input device.
4 Evaluation The Vrui toolkit has been under development at our lab for about eight years. During this time we developed a large number of complex VR applications, covering such distinct areas as static 3D model visualization and walk-throughs, high-resolution terrain mapping, interactive scientific visualization [12], 3D modelling, and games. The development of Vrui has mostly been driven by the needs of application development and the arrival of new VR hardware. The number of developed applications and their portability across a wide range of VR environments show that Vrui’s abstractions are effective. We have also learned that Vrui’s display and distribution abstractions make it easy to create ad-hoc combined VR environments from heterogeneous components. For example, our lab recently acquired a binocular-style HMD, and it only took attaching a spare tracker to the HMD, adding an existing desktop PC as another node to our tiled display wall’s cluster, and adding a few lines to Vrui’s configuration file, to integrate the HMD as a secondary viewer, allowing two head-tracked users to view a single 3D scene on the display wall from their respective points of view. One collaborating institution concerned with the development of input device hardware found Vrui their only choice to evaluate a newly developed device by testing it with several real applications. Several non-computer scientists with programming experience but no background in 3D graphics recently picked up Vrui and have been able to develop fairly complex VR
Environment-Independent VR Development
911
applications, showing that Vrui’s API is relatively easy to learn. Two computer science graduate students were able to integrate the OpenSG scene graph architecture into Vrui as a class project, and two physics graduate students are currently developing text-tospeech and speech recognition components that will be integrated with Vrui’s 3D GUI components and its tool layer. The main advantage of Vrui compared to other VR programming toolkits is the effectiveness of its desktop environment. Instead of only being useful for debugging, VR applications are fully usable on a non-stereoscopic desktop PC without extra input devices. VR applications are usually developed and tested entirely on the desktop, and only rarely run in a VR environment before deployment. Our users often use VR applications on their desktops for previewing, and only use the (shared) VR environments for important tasks requiring high accuracy or complex interactions. Vrui has even replaced glut and Open Inventor as our desktop 3D graphics toolkit of choice for development of rapid prototypes and algorithm testbeds. Vrui is more convenient even for programs that are not meant to ever be run in VR due to its very low code overhead – a simple application to view and navigate through a 3D scene created via OpenGL requires 16 lines of code besides the rendering code – integrated navigation, interaction, and 3D GUI components.
5 Conclusions and Future Work We presented Vrui (Virtual Reality User Interface), a C++ development toolkit for VR applications. Vrui’s main difference compared to other VR toolkits positioned at a similar level, such as CAVELib [7], VR Juggler [5], and Syzygy [2], is the higher semantic level at which it separates a VR application from the input devices available in a particular VR environment. This separation, and its implementation using a layer of tools, leads to applications that run effectively and without change on a wide range of VR environments ranging from desktop systems with only keyboard/mouse to immersive multi-screen environments with multiple 6-DOF input devices. Vrui contains no “desktop simulator:” applications run natively on the desktop, and a recent user study [16] shows that they can be as effective as applications directly developed for the desktop. Our immediate next goal is to release the current version of Vrui under the GNU General Public License. We already distributed evaluation versions of Vrui to several collaborating institutions, and the current package builds easily under many versions of Linux/Unix and Mac OS X. The OS independence provided by the included lower-level libraries would make a port to Windows possible, although we are not currently planning one. The main avenues of short-term future development are adding an abstraction layer to support spatially distributed collaborative VR environments, and improvements to Vrui’s user interface. The management of tools and input devices, although functional, could be improved; and the 3D GUI components, initially meant as a temporary fix until true 3D user interactions would be implemented, have proven so useful that we want to provide a richer set of widget classes.
912
O. Kreylos
References 1. Bierbaum, A., Just, C.: Software tools for virtual reality application development. In: ACM SIGGRAPH 1998 Course 14: Applied Virtual Reality, pp. 3.1–3.45 (1998) 2. Schaeffer, B., Goudeseune, C.: Syzygy: Native PC cluster VR. In: Proc. of IEEE VR 2003, Washington, DC, p. 15. IEEE Computer Society Press, Los Alamitos (2003) 3. He, T., Kaufman, A.: Virtual input devices for 3D systems. In: Proc. of IEEE Visualization 1993, pp. 142–148. IEEE, Los Alamitos (1993) 4. Eliason, J.: The CAVEGui project (2001), http://www.evl.uic.edu/cavern/cavegui 5. Bierbaum, A., Just, C., Hartling, P., Meinert, K., Baker, A., Cruz-Neira, C.: VR Juggler: a virtual platform for virtual reality application development. In: Proc. of Virtual Reality 2001, pp. 89–96. IEEE Computer Society Press, Los Alamitos (2001) 6. Kreylos, O.: Wii controller for virtual reality (2007), http://www.youtube.com/watch?v=KyvIlKSA0BA 7. Cruz-Neira, C., Sandin, D., DeFanti, T.: Surround-screen projection-based virtual reality: the design and implementation of the CAVE. In: Proc. of SIGGRAPH 1993, Anaheim, CA, pp. 135–142. ACM Press, New York (1993) 8. Olson, E.C.: Cluster Juggler – PC cluster virtual reality. Master’s thesis, Iowa State University, Ames, Iowa (2002) 9. Koutek, M., Post, F.H.: The responsive workbench simulator: a tool for application development. Skala, V. (ed.) Journal of WSCG 10 (2002) 10. Figueroa, P., Green, M., Watson, B.: An object oriented description of interaction techniques in virtual reality environments (2000), http://www.cs.northwestern.edu/∼ watsonb/docs/ski00.itlib.pdf 11. Figueroa, P., Green, M., Watson, B.: A framework for 3D interaction techniques. In: Proc. of CAD/Graphics 2001. International Academic Publishers (2001) 12. Billen, M.I., Kreylos, O., Hamann, B., Jadamec, M.A., Kellogg, L.H., Staadt, O., Sumner, D.Y.: A geoscience perspective on immersive 3D gridded data visualization. Computers and Geosciences 34, 1056–1072 (2008) 13. Taylor, R.M., Hudson, T.C., Seeger, A., Weber, H., Juliano, J., Helser, A.T.: VRPN: A deviceindependent, network-transparent VR peripheral system. In: Proceedings of the ACM Symposium on Virtual Reality Software & Technology (VRST) (2001) 14. Reitmayr, G., Schmalstieg, D.: OpenTracker – an open software architecture for reconfigurable tracking based on XML. In: Proc. of IEEE Virtual Reality Conference 2001, p. 285 (2001) 15. Wernecke, J.: The Inventor Mentor: Programming Object-Oriented 3D Graphics with Open Inventor., 2nd edn. Addison-Wesley Longman Publishing Co., Boston (1993) 16. Billen, M.I., Kreylos, O., Kellogg, L.H., Hamann, B., Staadt, O., Sumner, D.Y., Jadamec, M.: Study of 3D visualization software for geo-science applications. Technical Report TR06–01, W. M. Keck Center for Active Visualization in the Earth Sciences (KeckCAVES), Davis, CA 95616 (2006)
Combined Registration Methods for Pose Estimation Dong Han1, , Bodo Rosenhahn2 , Joachim Weickert3 , and Hans-Peter Seidel4 1
4
University of Bonn, Germany
[email protected] 2 University of Hannover, Germany
[email protected] 3 Saarland University, Germany
[email protected] Max-Planck Institute for Informatics, Germany
[email protected]
Abstract. In this work, we analyze three different registration algorithms: Chamfer distance matching, the well-known iterated closest points (ICP) and an optic flow based registration. Their pairwise combination is investigated in the context of silhouette based pose estimation. It turns out that Chamfer matching and ICP used in combination do not only perform fairly well with small offset, but also deal with large offset significantly better than the other combinations. We show that by applying different optimized search strategies, the computational cost can be reduced by a factor eight. We further demonstrate the robustness of our method against simultaneous translation and rotation.
1
Introduction
Shape registration is an important technique in computer vision. It is present in many applications, such as image segmentation, object recognition and classification, motion tracking or image retrieval. The task of shape registration is to establish point-to-point correspondences between two images [1]. In the context of this paper, by registration or matching we mean to estimate the geometric transformation between the reference image and the 3D target object. As a classic method for solving correspondence problems in computer vision, shape matching has been intensively studied in recent years, see e.g. [2,3]. A survey is available in [4]. A very popular shape matching algorithm is the iterated closest points (ICP) algorithm [5], which uses explicit representations like points and curves. Some variants of the ICP algorithm were evaluated in [6] by comparing each stage of the algorithms and their speed of convergence. Another popular approach is to use optic flow which has been an intensive research field for decades, because of its capability for image sequence analysis. Horn
The work was done when the author was writing his thesis at Saarland University and Max-Planck Institute for Informatics.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 913–924, 2008. c Springer-Verlag Berlin Heidelberg 2008
914
D. Han et al.
and Schunck presented in [7] a global method to build dense flow field using variational framework. A performance evaluation of many well-known optic flow methods is available in [8]. The Chamfer distance matching algorithm was proposed by Barrow et al. [9]. It has the nice property of being able to deal with large offsets, efficient and easy to implement. Borgefors [10,11] improved the Chamfer matching by using a more reasonable confidence measure and embedding the basic algorithm into a resolution pyramid, which has reduced the computational cost significantly. A recent application was presented in [12], where Gavrila used a multi-feature hierarchical algorithm to match N templates simultaneously and demonstrated the application in traffic sign detection. We will explain these registration algorithms with more details in Section 2. In most work, a performance analysis with respect to complexity or stability, especially for diverse registration methods as compared here (ICP, optic flow, Chamfer) is still missing. Therefore, the main contribution of this work is to individually analyse these approaches and to evaluate their performance in all possible combinations for different rigid motions. We further investigate different variants for Chamfer matching to improve the speed without degradation in performance. As test scenario we concentrate on 2D-3D pose estimation. By Pose we refer to the definition in [13] as “the transformation needed to map an object model from its own inherent coordinate system into agreement with the sensory data”. In general, the task of pose estimation is to find this transformation. In the scope of this paper, we restrict the transformation to a rigid body motion. Much work on pose estimation has been done in different abstraction level of geometric descriptors (see [14] for detailed overviews). In this work, we extend the joint pose estimation algorithm in [15] by embedding the Chamfer distance matching, to be able to deal with large movements. Section 2 summarizes the three used registration methods. The experimental setup is discussed in Section 3. Performance of the presented approaches is evaluated in Section 4. This paper is concluded with a brief summary in Section 5.
2 2.1
Registration Algorithms Registration with ICP
The iterated closest points algorithm was introduced by Besl and McKay [5]. It is a method for aligning 3D models based on geometry, widely used for registering outputs of 3D scanners. ICP starts with two point clouds and an initial guess for their relative rigid body transformation. The basic idea is to refine this transformation and to minimize the error by iterating the following steps: 1. Give a good assumption of the initial relative pose of the object model to the reference image. 2. Find the closest points, which results in pairs of corresponding points. 3. Calculate the transformation such that an error metric is minimized [6]. 4. Go to Step 2.
Combined Registration Methods for Pose Estimation
915
In [16], Zhang proposed an improved ICP algorithm to determine the correspondence set between the image contour and the 3D object pose, where he used a K-dimensional tree to partition the point sets, which reduced the computational cost of registering large image data significantly. In [15], this ICP variant was applied for a pose tracking system, in order to find the point correspondences between the image silhouette and the 3D contour. 2.2
Registration with Level Set Functions and Optic Flow
The level set method was originally introduced by Dervieux and Thomasset [17] in 1979 and became popular by a paper of Osher and Sethian [18] in 1988. Since the level set method could easily integrate further constraints like 2D or 3D shape prior, it has been very popular in image segmentation in recent years. In [19], Brox et al. introduced a segmentation algorithm integrating color, texture and motion information, where the nonlinear diffusion was used for feature extraction and the level set approach was applied for the total energy minimization. The original idea of level set formulation is to define a smooth function Φ : Ω → R, which represents a contour Γ in Rn as the set where Φ(x) = 0, bounding an open region Ω. Then the level set function Φ for defining the image region has the following properties: Φ(x) > 0 if x ∈ Ω1 , Φ(x) < 0 if x ∈ Ω2 , and Φ(x) = 0 if x ∈ Γ , where Ω1 represents the object and Ω2 represents the background. The zero-level line is the searched boundary between the two regions. This formulation has several advantages. It is invariant under topological changes of the regions, which makes it very convenient to handle occlusion. It can also be easily extended, e.g. by having further constraints like shape prior [20]. For the task of matching, the object contour C is formulated with the level set function introduced in [21]: ⎧ ⎨ D(x, C) for x inside C Φ(x) = −D(x, C) for x outside C ⎩ 0 for x ∈ C where D(x, C) denotes the Euclidean distance of x ∈ Ω to the closest point x˜ on the contour C. Obviously this level set formulation is invariant under rigid body motion. For matching the distance transformed images, instead of the straightforward distance measure d(Φ1 , Φ2 ) = Ω (Φ1 (x) − Φ2 (x))2 dx, it is suggested in [21] to find the optimized transformation which minimizes the energy functional: E(τ ) = (Φ1 (x) − Φ2 (τ x))2 + α(|∇u|2 + |∇v|2 )dx (1) Ω
where τ x = x + w(x) and w(x) := (u(x), v(x)) is the displacement vector. α is a regularization parameter for penalizing the smoothness. Minimization of the functional yields the estimation of the shape deformation field w, which comes down to an optic flow estimation problem. In the context of pose estimation, optic flow was computed between the distance transform of the image contour
916
D. Han et al.
and the distance transform of the projected model contour, so as to get additional correspondences of 2D points in successive images. This improves the 3D registration. 2.3
Registration with Chamfer Distance Matching
Barrow et al. [9] proposed the Chamfer matching algorithm in 1977. The algorithm compares two images: the input image and the so-called template image. The goal is to find the best fit of edge points between them. It has several advantages, such as the ability to handle noisy or distorted images. With a linear complexity to the number of the corresponding points, the algorithm is very fast.
7 6 6 3 0
4 3
3
4 3 3 0
3
3 0 4 3
3
4 3 6
3 0
3 4
0 3
6 4
3
Fig. 1. The transformed template is superimposed on the distance transformed image. The distance measure is evaluated by the r.m.s. average of the pixel values that are hit. In this case, it is:
1 3
1 (32 7
+ 32 + 32 + 42 + 42 + 32 + 32 ) = 3.67.
Assume an input image and a template image are given. Both of them are binary and pre-segmented. The images contain feature points and non-feature points. Here one does not care what exactly the specific feature is, one just separates the features from non-features. Some distance transformation (DT) [10] is applied on these images, usually the Euclidean distance transform or its approximation [11]. The transformed template is placed over the transformed input image, as is illustrated in Figure 1. This superimposition gives the possibility to measure the correspondences between the contours. Values of the pixels on the transformed input image, which the transformed template (also regarded as polygon in [11]) hits, measure exactly how the image differs from the template. To evaluate the matching, one just needs to consider this array, whose elements represent distance to the nearest feature point. The distance measurement has been improved over the last several decades. In the original Chamfer matching, the arithmetic average was chosen for the matching measure (so-called Chamfer distance): D = n1 ni=1 vi , where n is the number of points in the polygon. In [11], Borgefors has chosen the root mean square average (r.m.s. for short): n 1 D = d n1 i=1 vi2 , where d is the unit distance in the distance transform. (Note: in Figure 1, d = 3.)
Combined Registration Methods for Pose Estimation
917
In [10], a comparison of four different average measurements was presented: median, arithmetic average, r.m.s., and maximum. After careful investigation, it was observed that r.m.s. generated the fewest local minima that disturbed the algorithm, therefore resulting in a more accurate convergence.
3
Application: Silhouette Based 2D-3D Pose Estimation
The three used registration algorithms are tested in the context of silhouette based pose estimation. In the joint system, 3D objects (a non-convex teapot model is used in the experiments) are projected onto the image plane by a multi-view setup. The projection matrices are predefined. The images which are captured by a calibrated camera system, are processed by using an image segmentation algorithm. This procedure gives the silhouette of the considered object. The correspondence estimation between the initial pose and the segmented object silhouette is achieved by using the registration algorithms: CM [9,11], ICP [16], OF [22] and their combinations. The resulting point correspondences are used to generate the new pose and the updated pose is once again projected. Then the new generated contour proceeds as shape prior to the segmentation procedure. The iteration between pose estimation and contour segmentation is repeated several times. Finally the estimated pose is used for the next frame. The whole process is illustrated in Figure 2. 3D Model
3D Object Capture by Camera system
Projection onto the image plane
Object image
2D Contour
Shape Prior Segmentation
Matcher (ICP & OF & CM)
Object Silhouette
Pose Estimation
Next Frame Updated Pose
Fig. 2. Flow Chart of the silhouette based pose estimation algorithm
918
D. Han et al.
Fig. 3. Some frames from a stereo image sequences (400 frames). The first column is the initial translated image. The estimated pose is shown in the second column, which has not converged. The third column is the initialization of another translated image, where the corresponding converged result is shown in the last column. Top row: left view. Bottom row: right view.
In the following section, the performance of ICP, optic flow as well as the Chamfer matching are evaluated. By simultaneous usage of correspondences from different registration algorithms, their combinations are also analyzed.
4 4.1
Experiments and Evaluation Convergence Analysis of Plain 3D Translations
In this section, three different registration algorithms (ICP, Optic flow, and Chamfer) are compared in terms of silhouette based pose estimation. A teapot model is used for the experiments. We use a similar experimental setup as described in [21]. The model silhouette is moved in 3D to be far from the object contour in the images. To this end, we compute the contour and the pose in the first frame as usual, then we disturb the initial pose by translation and rotation. In the first experiment, we analyze the accuracy and stability of the three registration algorithms with respect to the same initial movements in translation. Figure 3 shows certain frames taken from a stereo image sequences of the teapot. The teapot is captured from the left and right cameras. Initially the pose of the teapot is disturbed in the area of [-15,15] cm along the x, y, and z axes. Random samples of translational disturbances are generated in this interval. We test the performance of registration methods for correspondence estimation. When the difference between the estimated pose and the correct pose is within some predefined threshold, it is regarded as converged. In the left column of Figure 4, we show the correspondence clouds of 400 random samples of translational disturbances in [-15, 15] cm by invoking ICP, optic flow and Chamfer matching respectively. Blue stars denote the cases where the registration method succeeds in converging back to the correct pose, while
Combined Registration Methods for Pose Estimation
919
Table 1. Comparison of convergence rates (of the correspondence clouds from Fig. 4) Matcher Convergence Rate (translation) ICP 30.6% OF 46.2% CM 51.3% ICP + OF 47.3% CM + ICP 95.3% CM + OF 81.8% ICP + OF + CM 81.5%
Table 2. Convergence rates of applying CM and ICP simultaneously with different initial movements Matcher ±15cm ±17.5cm ±20cm ±22.5cm ±25cm CM + ICP 95.3% 84% 78% 69.6% 63.2%
Table 3. Comparison of convergence rates of the original Chamfer matching algorithm; its simplified version, which only convolves the central part; an improved version with more accurate stopping criterion; and the version with another search strategy. All the percent numbers stand for the convergence rates.
CM(original) CM(central part) CM(new stopping) CM(new search)
±7.5cm 88.8% 56.8% 87.8% 79.3%
±15cm ±10cm (& ICP) ±15cm (& ICP) 47.9% > 95% 94.8% 27.8% 56.7% 39.9% 51.3% 98.8% 95.3% 48.5% 94.5% 77.8%
speed 11 f/h 40 f/h 70 f/h 80 f/h
red crosses denote the unconverged cases. Regarding that ICP becomes unstable when the initial pose is disturbed too much (as shown in the top left image), Chamfer matching is much better at dealing with large movements in translation. In the bottom left image, it is shown that, with an initial movement of as large as [-15, 15] cm, a very wide area has converged. For this reason, we expand the correspondences near the spout of the teapot in a nearly converged frame, in order to tell the different properties between these matchers. They are visualized in Figure 6. The small line segments between the contour points and estimated corresponding points are sketched. They are generated by optic flow, ICP, and Chamfer matching respectively. One can see that, in the middle image the line segment is perpendicular to the object silhouette, i.e. the object is still moving towards its exact location and has not yet converged. Therefore we conclude that ICP is good at dealing with tiny offsets and improving the pose result subtly. While the object in the right image has already converged, so that Chamfer matching at this stage would not give further improvement any more. Therefore we are motivated to augment the capability of Chamfer matching with ICP and to demonstrate they are actually complementary to each other.
920
D. Han et al. ICP convergent ICP not convergent
150 100 50 0 -50 -100 -150
-150
OF+ICP convergent OF+ICP not convergent
150 100 50 0 -50 -100 -150
-100
-50
0
50
100
-50 -100 -150 150
0
50
150 100 -150
-100
-50
0
50
100
OF convergent OF not convergent
150 100 50 0 -50 -100 -150
-150
50
150 100
CM+ICP convergent CM+ICP not convergent
150 100 50 0 -50 -100 -150
-100
-50
0
50
100
-50 -100 -150 150
0
50
150 100 -150
-100
-50
0
50
100
CM convergent CM not convergent
150 100 50 0 -50 -100 -150
-150
-50 -100 -150 150
0
-50 -100 -150 150
0
50
150 100
CM+OF convergent CM+OF not convergent
150 100 50 0 -50 -100 -150
-100
-50
0
50
100
-50 -100 -150 150
0
50
150 100 -150
-100
-50
0
50
100
-50 -100 -150 150
0
50
150 100
Fig. 4. With random disturbance in the area of [-15, 15] cm in all three dimensions. The left column is the convergence performance of solely using ICP, Optic Flow and CM respectively. The right column is the convergence performance of invoking combination of ICP and OF, combination of CM and ICP, and combination of CM and OF respectively. See text for more details. This figure is best viewed in color.
Initial Location
Fig. 5. Illustration of the local search during Co-convolution
Combined Registration Methods for Pose Estimation
921
Fig. 6. The spout of the teapot from the right view of a nearly converged frame. From left to right: OF, ICP and Chamfer Matching respectively.
In the second experiment, we evaluate the performance of different combination of registration algorithms at dealing with large movements in translation. These algorithms can easily be combined in the sense of adding up their respective estimated point correspondences. The right column of Figure 4 shows the correspondence clouds applying certain combination of ICP, optic flow and Chamfer matching. Table 1 shows the convergence rates of all the three different matchers and their combinations. The combination of ICP and CM gives the best result. 95.3% of the random instances have converged with an initial movement of [-15, 15] cm. After further experiments, we observed that, even with movement of [-25, 25] cm, about 60% convergence rate could be achieved by applying CM and ICP simultaneously. Table 2 shows the decreasing convergence rate with the increasing movement from [-15, 15] cm till [-25, 25] cm. 4.2
Variants for Chamfer Matching
In the Chamfer distance matching algorithm, we define an operation called Co-convolve, which takes two matrices and goes through the larger-size matrix by superimposing the smaller-size matrix at all possible locations. The boundary area is filled with zero if necessary. The whole image has been involved in the Co-convolution process. Large images slow down the algorithm. In our setup (2 × 2.33 GHz AMD Opteron processor, C++ code, image size: 376 × 284) we achieved 11 frames per hour. The slow convergence is partly due to the unessential regions within the image, which also deteriorates the convergence rate. One way to speed up the method is to omit the construction of the boundary pixels of the template image, namely just to convolve its central part. Then the observed speed is four times as fast as before, but meanwhile the convergence rate decreases enormously. Therefore a new stopping criterion is implemented. A local search (as illustrated in Figure 5) is performed during Co-convolution and is forced to stop when the best matching location is not improved during five consecutive circulations. This has shown to increase the speed dramatically. The algorithm has a speed of 70 frames per hour while invoking only Chamfer matching in the image registration part. The combination of Chamfer matching and ICP gives again the best result. As an alternative, another search strategy for Chamfer matching has been implemented. The method starts from the initial location as usual, which is considered as seed. Then the eight neighbors of the seed are evaluated. The best
922
D. Han et al.
Convergence rates of CM
Convergence rates of ICP
Convergence Rate
Convergence Rate
100
100
80
80
60
60
40
40 20
20
20
20
18
18
16
16
14
015
12
20
Rotation (degree)
10
25
Translation (cm)
14
015 20
35
Convergence rates of CM+ICP
8
30
Translation (cm)
6
Rotation (degree)
10
25
8
30
12
35
6
Convergence rates of ICP+OF
Convergence Rate
Convergence Rate
100
100
80
80
60
60
40
40 20
20
20
20
18
18
16
16
14
015
12
20
10
25
Translation (cm)
Rotation (degree)
14
015
35
12
20
10
25
8
30
Rotation (degree)
8
30
Translation (cm)
6
35
6
Convergence rates of CM+ICP+OF
Convergence Rate 100 80 60 40 20
20 18 16 14
015
12
20
10
25
Translation (cm)
Rotation (degree)
8
30 35
6
Fig. 7. From left to right, from top to bottom: convergence rates of CM; ICP; CM+ICP; ICP+OF; and CM+ICP+OF respectively. The x-,y- and z-coordinates are initial movement in translation, initial movement in rotation and convergence rate in percent respectively. This figure is best viewed in color.
matching location among the seed and its eight neighbors is regarded as the new seed. If this best matching location is the seed itself, the search is forced to stop. By this means, the search is always progressing towards the best matching direction and makes it even faster. It has a speed of about 80 frames per hour. Table 3 compares the speed, convergence rate of applying CM alone and that of applying CM and ICP together among these four different cases.
Combined Registration Methods for Pose Estimation
4.3
923
Combined Rotations and Translations
Whereas in the above experiments only translational movements are analyzed, we now switch to simultaneous rotations and translations. In addition to the initial movements in translation, movements in rotation around the x, y, and z axes are performed. For each image registration algorithm and their combinations, we evaluate the performance dealing with simultaneous translation and rotation. The convergence rate of each instance of the matching algorithms is shown in a separate diagram in Figure 7 respectively. For each instance, 20 distinct initializations (all possible transformations by combining movements in translation {±7.5, ±10, ±12.5, ±15, ±17.5 (cm)} and movements in rotation {±5, ±10, ±15, ±20 (degrees)}) are investigated in the experiments, where the individual results are connected with lines. x-coordinates are scaled by 2, since we actually have movements in the whole interval, e.g. ±15cm represents [-15, 15] cm. From the surfaces constructed by these segments, we can see in a concrete situation, which combination of the matchers performs best. Generally, the single ICP algorithm performs quite well at dealing with both translations and rotations. However, the accuracy is deteriorating with increasing movements. Combined with optic flow algorithm, the precision of ICP decreases less rapidly. Obviously the combination of ICP and Chamfer matching improves the stability and convergence behavior of the pose estimation algorithm significantly. A convergence rate of over 60% can be achieved even when the pose of the object model is disturbed in the area of [-17.5, 17.5] cm along the axes and [-20, 20] degrees around the axes.
5
Conclusions
In this work we individually analyze three registration methods (Chamfer, ICP and optic flow) and respective combinations in their performance for 2D-3D pose estimation. The experiments reveal, that the methods have different advantages and shortcomings, but can efficiently be combined to gain a registration algorithm which can handle large displacements in reasonable time and accuracy. After optimizing, our proposed joint algorithm works in 70 to 80 frames per hour on a standard PC, which increased the efficiency by factor 7, compared to the original implementation. The evaluation results also reveal, that it is not always advantageous to combine in a naive way all available registration methods (here ICP, Chamfer and OF). Instead, reasonable combination of registration methods which compensate for each other’s drawbacks (here Chamfer and ICP) can be much better for both performance and convergence.
References 1. Goshtasby, A.: Tutorial on 2-d and 3-d image registration (2004) (Accessed:1st July), http://www.cs.wright.edu/∼ agoshtas/CVPR04 Registration Tutorial.html 2. Latecki, L.J., Lakamper, R.: Shape similarity measure based on correspondence of visual parts. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1185–1190 (2000)
924
D. Han et al.
3. Paragios, N., Rousson, M., Ramesh, V.: Non-rigid registration using distance functions. Comput. Vis. Image Underst. 89, 142–165 (2003) 4. Veltkamp, R., Hagedoorn, M.: State-of-the-art in shape matching. Technical Report UU-CS-1999-27, Utrecht University, the Netherlands (1999) 5. Besl, P., McKay, N.: A method for registration of 3d shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14, 239–256 (1992) 6. Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Proc. 3rd Int’l Conf. on 3D Digital Imaging and Modeling, pp. 145–152 (2001) 7. Horn, B., Schunck, B.: Determining optical flow. Art. Intel. 16, 185–203 (1981) 8. Barron, J., Fleet, D., Beauchemin, S.: Performance of optical flow techniques. International Journal of Computer Vision 12, 43–77 (1994) 9. Barrow, H., Tenenbaum, J., Bolles, R., Wolf, H.: Parametric correspondence and chamfer matching: Two new techniques for image matching. In: International joint Conference on Artificial Intelligence, pp. 659–663 (1977) 10. Borgefors, G.: An improved version of the chamfer matching algorithm. In: 7th Intl Conf. on Pattern Recognition, Montreal, Canada, pp. 1175–1177 (1984) 11. Borgefors, G.: Hierarchical chamfer matching: A parametric edge matching algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 10, 849–865 (1988) 12. Gavrila, D.: Multi-feature hierarchical template matching using distance transforms. In: IEEE Int’l Conf. on Pattern Recognition, Brisbane, Australia (1998) 13. Grimson, W.: Object recognition by computer: the role of geometric constraints. MIT Press, Cambridge (1990) 14. Goddard, J.: Pose and Motion Estimation From Vision Using Dual Quaternion Based Extended Kalman Filtering. PhD thesis, University of Tennessee (1997) 15. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shape knowledge for joint image segmentation and pose tracking. IJCV 73, 243–262 (2007) 16. Zhang, Z.: Iterative point matching for registration of free-form curves and surfaces. Int. J. Comput. Vision 13, 119–152 (1994) 17. Dervieux, A., Thomasset, F.: A finite element method for the simulation of a rayleigh-taylor instability. In: Rautman, R. (ed.) Approximation methods for Navier-Stokes problems, vol. 771, pp. 145–158. Springer, Heidelberg (1979) 18. Osher, S., Sethian, J.: Fronts propagating with curvature-dependent speed: algorithms based on Hamilton-Jacobi formulations. J. Comput. Phys. 79, 12–49 (1988) 19. Brox, T., Rousson, M., Deriche, R., Weickert, J.: Unsupervised segmentation incorporating colour, texture, and motion. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 353–360. Springer, Heidelberg (2003) 20. Cremers, D.: Dynamical statistical shape priors for level set based tracking. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1262–1273 (2006) 21. Rosenhahn, B., Brox, T., Cremers, D., Seidel, H.P.: A comparison of shape matching methods for contour based pose estimation. In: Reulke, R., Eckardt, U., Flach, B., Knauer, U., Polthier, K. (eds.) IWCIA 2006. LNCS, vol. 4040, pp. 263–276. Springer, Heidelberg (2006) 22. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004)
Local Non-planarity of Three Dimensional Surfaces for an Invertible Reconstruction: k-Cuspal Cells ´ Marc Rodr´ıguez, Ga¨elle Largeteau-Skapin, and Eric Andres Laboratory XLIM, SIC Department, University of Poitiers BP 30179, UMR CNRS 6712 86962 Futuroscope Chasseneuil Cedex, France
Abstract. This paper addresses the problem of the maximal recognition of hyperplanes for an invertible reconstruction of 3D discrete objects. k-cuspal cells are introduced as a three dimensional extension of discrete cusps defined by R.Breton. With k-cuspal cells local non planarity on discrete surfaces can be identified in a very straightforward way.
1
Introduction
For some years now, the discrete geometry community tries to propose an invertible reconstruction of a discrete object. A reconstruction is a transformation from the discrete to the Euclidean world that transforms a discrete object into an Euclidean one. The aim is to propose a reconstruction that is invertible (the discretisation of the reconstruction is equal to the original object), and that generates Euclidean objects with as few polygons as possible. Ideally for instance, the discretisation of a cube should be reconstructed as an Euclidean cube with only 6 faces. This is something that is not possible with a Marching Cube reconstruction that yields a number of polygons proportional to the number of discrete points. This is something we would like to avoid when handling very big discrete objects or multi-scale discrete objects. We consider the discrete analytical framework where the reconstruction is divided into two steps: In the first step, the boundary of a 3D discrete object is decomposed into discrete plane pieces [Rev91] [And03] (discrete straight lines segments in 2D) and in a second step those plane pieces are replaced by Euclidean polygons. This approach has provided very good results in 2D especially with approaches based on parametric spaces such as the J. Vittone approach [VC00] [VC99]. The algorithm was adapted by R.Breton [BSDA03] to the standard model [And03] and generalized by M. Dexet to the upper dimensions [DA08] using the topological framework of abstract cells complexes of V. Kovalevsky [Kov93]. Reconstruction in 3D is however not very convincing so far. As we can see in Fig.1(b), an invertible reconstruction does not guarantee a natural reconstruction. When performing the recognition step, the biggest recognised piece of discrete plane may actually be too big. It does not take into account the local differential G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 925–934, 2008. c Springer-Verlag Berlin Heidelberg 2008
926
´ Andres M. Rodr´ıguez, G. Largeteau- Skapin, and E.
Fig. 1. Reconstructions examples based on maximal recognition. a) the discrete object, the colored pixels is the first set recognized like a hyperplane (a line). b) the reconstructed object. c) the expected result.
behaviour of the discrete surface. To avoid this problem, we introduce, in this paper, k-cuspal cells as the three dimensional extension of the 2D discrete cusps [BSDA03] (see Fig.4). A k-cuspal cell is a cell of dimension k on the boundary of a discrete object seen as a discrete abstract cell complex. The characterisation of a cell as a cuspal cell is very easy and can be performed on the fly during the recognition process. The aim is to guide the recognition step in order to obtain a more natural reconstruction as illustrated in Fig.1(c). We start, in section 2 with some basics on discrete geometry and a recollection of 2D discrete cusps. In section 3, we introduce k-cuspal cells which are the 3D extension of the 2D cusp points. We end section 3, with several illustrations of k-cuspal cells on a set of various discrete objects. We conclude in section 4 and discuss several perspectives.
2 2.1
Preliminaries Basic Notations
Let Zn be the subset of the nD euclidean space Rn that consists of all the integer coordinate points. A discrete (resp. euclidean) point is an element of Zn (resp. Rn ). A discrete (resp. euclidean) object is a set of discrete (resp. euclidean) points. We denote pi the i-th coordinate of a point or vector p. The voxel V(p) ⊂ Rn of a discrete nD point p is defined by V(p) = [p1 − 12 , p1 + 12 ]×...×[pn − 12 , pn + 12 ]. Two discrete points p and q in dimension n are k-neighbours, with O ≤ k ≤ n, n=1 if |pi − qi | for 1 ≤ i ≤ n, and k ≤ n − i |pi − qi |. An abstract cell complex [Kov93], C = (E, B, dim), is a set E of abstract elements provided with an antisymetric, irreflexive, and transitive binary relation B ⊂ E × E called the bounding relation, and with a dimension function dim : E → I from E into the set I of non-negative integers such that dim(e ) < dim(e ) for all pairs (e , e ) ∈ B. A k dimensional cell is called a k-cell.
Local Non-planarity of Three Dimensional Surfaces
927
Fig. 2. An example of three dimensional abstract cells complexes
In a three dimensional discrete space, we describe an object by its abstract cell complex boundary. The boundary is a set of 0-cells, 1-cells and 2-cells (See Fig.2). In classical topology, the Jordan theorem states that every non-self-intersecting boundary in the euclidean space divides the space into an ”inside” and an ”outside”. This theorem is easily verified for abstract cells complexes but with a lot of difficulties in classical discrete spaces. We know, furthermore, that if a discrete object is a 2-connected set of discrete points then its boundary can be described as a 2-dimensional manifold [Fra95]. In practice, this is very important in three dimensional segmentation when handling several objects in an image. For instance, in medical imaging, when dealing with several organs, working within the framework of abstract cell complexes allows to have a common 2-dimensional boundary between different organs. When reconstructing each organ we can ensure that the common boundary is identical for both reconstructions (see Fig.3).
Fig. 3. a) Two dimensional example with several objects b) Example of classical boundary representation (pixel with grid are boundary pixels for its respective region) c) Boundary representation with abstract cells complexes
A standard discrete hyperplane [And03] of dimension n and normal vector C = (c0 , ..., cn ) ∈ Rn+1 is defined as the set of lattice points p = (p1 , ..., pn ) ∈ Zn n n n (|c |) (|c |) such that − i=12 i ≤ c0 + i=1 (ci pi ) < i=12 i where c1 ≥ 0, or c1 = 0 and c2 ≥ 0, or ..., or c1 = c2 = ... = cn−1 = 0 and cn ≥ 0. Standard hyperplanes are a particular case of analytical discrete hyperplanes defined by J.-P. Reveilles [Rev91]. The reason why standard discrete hyperplanes are interesting is because they are (n − 1)-connected objects. The abstract cell complex boundary of a discrete 2-connected object can be seen as a set of standard hyperplane pieces as long as you shift the grid and take 0-cells as integer coordinate points.
928
2.2
´ Andres M. Rodr´ıguez, G. Largeteau- Skapin, and E.
A Two Dimensional Solution to the Maximal Recognition Problem: Discrete Cusp [BSDA03]
In a two dimensional Euclidean space, a cusp is a point of a simple connected curve that does not have a tangent but that has a tangent on either side of the point (a ”left” and a ”right” tangent that are different). In the discrete world, everything is linear in the sense that if you have a curve then you can always decompose your curve into line segments. In [V96], A. Vialard has defined a discrete tangent on a point p of a 1-connected simple curve as the longest discrete line segment centered on p and included in the curve. It is easy to see that every point (except for the end points of an open curve), is always center of a discrete 1-connected line segment of at least length 3. This is not the case anymore if we consider 5 consecutive curve points. However, in [BSDA03], R.Breton, defined as discrete cusp, a discrete points that does not have a Vialard tangent of length at least 5. Definition 1 (Discrete Cusp [BSDA03]). A discrete point p in a discrete (locally) simple curve C = p1 , p2 , . . . , pi−2 , pi−1, p, pi+1 , pi+2 , . . . , pk−1 , pk is a discrete cusp if the set pi−2 , pi−1 , p, pi+1 , pi+2 is not a discrete straight line segment. Discrete cusps are particularly interesting when doing discrete analytical reconstruction. Discrete analytical 2D boundary reconstruction is performed in two steps: decompose the boundary into discrete line segments and replace each discrete line segment by a Euclidean line segment. By definition, if a discrete line segment L = p1 , p2 . . . , pk−1 , pk contains a discrete cusp p then p is one of the following points p1 , p2 , pk−1 or pk . In Fig. 1(a), we see an example with the 2D square. Each of the corners of the square are discrete cusps. If we start a line recognition on the upper left corner, going clockwise, the longest possible recognized 1-connected line ends one point after the next corner (cusp). If we want the result of Fig. 1(c), the recognition needs to stop on the cusp and not one point after the cusp. Tracking down discrete cusps is very useful since it allows us to guide the discrete analytical recognition step. In the reconstructed Euclidean shape, a vertex corresponds often to a discrete cusp [BSDA03]. A discrete cusp is a local discontinuity in the discrete tangents and can easily be detected with a Freeman coding[Free70] (see Fig.4).
3 3.1
A Solution to the Three Dimensional Maximal Recognition Problem: k-Cuspal Cells Definitions
We are now going to discuss the extension of the notion of 2D discrete cusps into 3D. The main idea behind the definition of discrete cusps on three dimensional surfaces is that orthogonal projections of a three dimensional discrete standard line are two dimensional discrete standard lines [And03]. The three dimensional
Local Non-planarity of Three Dimensional Surfaces
929
1003 0103
0010 0110
0001
0101
0000
0011
The point belongs to a straight line The point is a discrete cusp 0123 Local Freeman code of the curve
Fig. 4. Possible configurations of at most five discrete points with their freeman code. On the left points that aren’t discrete cusps and on the right, configurations corresponding to discrete cusps.
extension of discrete cusps is found in all the orthogonal sections of the discrete object boundary. We are interested in the orthogonal sections crossing the voxels by their center. The k-cells, k ∈ {0, 1}, of the two dimensional space obtained by this kind of orthogonal sections represent (k + 1)-cells in the three dimensional space. Let us now define three-dimensional k-cuspel cells. Contrary to dimension 2, there are different types (dimensions) of cusps in dimension 3. In dimension two, only points (0-cells in the discrete cell complex representation) can be discrete cusps. In dimension three, we have 0-cuspal cells that are discrete points (0-cells) and 1-cuspal cells formed by a pair of discrete points (1-cells). Let us first define 1-cuspal cells. When performing an orthogonal section through the center of voxels, a three dimensional 1-cuspal cell corresponds to a discrete 2D cusp found in the orthogonal section (see Fig.5). Definition 2 (1-cuspal cells). A 1-cell (in an abstract cell complex) is a 1cuspal cell if and only if it belongs to the boundary of a discrete object and if its orthogonal section is a two dimensional discrete cusp. There are three kinds of 1-cuspal cells since there are three ways (directions) of performing an orthogonal section. For example, if c0 is a discrete cusp in the section X = α (α ∈ Z) then its associated 1-cell c1 is said to be a 1-cuspal cell according to X. Theorem 1. Let O ⊂ Z3 be a discrete object and F its boundary in an abstract cell complex representation. If P ⊂ F is a standard hyperplane as abstract cell complex, then P does not contain any 1-cuspal cell. Proof. Let P be a discrete standard hyperplane of inequation −w ≤ a0 + a1 x + a2 y + a3 z < w (with w = a1 +a22 +a3 ). The intersection of such hyperplane with
930
´ Andres M. Rodr´ıguez, G. Largeteau- Skapin, and E.
Fig. 5. A 1-cuspal cell c1 correspond to a discrete cusp c0 in an orthogonal section
an orthogonal section is a discrete two dimensional line D of inequation: −w ≤ a +a a0 + a1 x + a2 y < w where w ≥ 1 2 2 . That is the inequation of a thick line in Z2 . Therefore, no discrete cusp is found in D and so, no 1-cuspal cell found in P. A 1-cuspal cell can be seen as a 1-cell that belongs to two different hyperplanes. Each 1-cell is bounded by two 0-cells. We can see on Fig.5 that the two 0-cells bounding c1 have the same cuspal property as c0 . Therefore the cuspal property of a 1-cell is inherited by its border. Now that we have defined 1-cuspal cells, we can define 0-cuspal cells: Definition 3 (0-cuspal cells). A 0-cell of a discrete object boundary (represented by an abstract cell complex) that bounds three 1-cuspal cells according to the three different directions is a 0-cuspal cell (see Fig.6).
Fig. 6. Example of 0-cuspal cell
Equivalently to what happens in dimension two, k-cuspal cells indicate 0-cells and 1-cells that can not be in the center of a somewhat large disk (orthogonal section of length at least 5 in the direction of the cusp) that is included in a standard analytical discrete plane. In Fig. 1, we see an example of a 3D cube. Each corner of the cube are O-cuspal cells and each edge of the cube are 1cuspal cells. When performing three-dimensional reconstruction, there is, as the first step, a discrete plane recognition step. The boundary of the discrete object
Local Non-planarity of Three Dimensional Surfaces
931
is decomposed into discrete analytical standard 3D planes. Just as in dimension 2, 1-cuspal cells form good markers that allow us to stop the recognition process. Ideally, 1-cuspal cells and 0-cuspal cells should be on the border of the discrete analytical standard plane segments that are recognized. 3.2
Implementation and Results
A test platform was developed to represent discrete objects and their k-cuspal cells [ABL01]. SpaMod (for Spatial Modeler) is a multilevel topology based modeling software that allows at the same time discrete and Euclidean embeddings. The 1-cuspal cell detection is processed in linear time according to the number of 1-cells of the object boundary. k-cuspal Cells on Discrete Cubes. Firstly, we checked the localisation of kcuspal cells on discrete cubes with various orientations in space. By construction, on the un-rotated cube, all edges are identified by 1-cuspal cells and vertices by 0-cuspal cells. We have then tried to find k-cuspal cells on several rotated cubes (see Fig.7). The rotation we used is the E. Andres discrete rotation [JA95].
Fig. 7. 1-cuspal cells detection on rotated cubes
Fig. 8. Exemples of discrete sphere 1-cuspal cells detection
All the 1-cuspal cells detected are on the rotated edges of the cube. As expected, according to theorem 1 no 1-cuspal cells have been detected on the faces of the cube. Each face of the cube is indeed a standard discrete plane. Depending on the orientation of the cube, the location of 0-cuspal can vary and do not always correspond to the location of the expected vertices of the cube. However, we notice that they are regularly distributed among the cube edges. This can be explained by the fact that we use discrete rotations and because our window of convolution (of sort) is rather small.
932
´ Andres M. Rodr´ıguez, G. Largeteau- Skapin, and E.
Fig. 9. k-cuspal cells detection on a random polyedron. a) and c) the detection is done with a window of size five. b) and d) the detection is done with a window of size seven.
Fig. 10. Exemples of acute angles which are not detected with a window size lower than seven
k-cuspal Cells on Discrete Spheres. One could expect no cuspal cells on discrete spheres. This is of course not true in general and depends highly on the size of the discrete sphere. Here, 1-cuspal cells represent local points where the curve can not be a standard hyperplane and may be an object edge. Looking for discrete spheres 1-cuspal cells localisation is interesting because spheres are locally tangent to a plane and have no edges (see Fig.8). We apply our test on the E. Andres discrete analytical sphere model [And94]. On little spheres, a lot of k-cuspal cells are detected which was to be expected. When the radius grows, only a few k-cuspal cells are still detected. Usually this happens at specific discrete change of quadrant (there are 48 quadrants in 3D that are the equivalent to the 8 octants in 2D). Random Polyhedrons. When testing with the discretisation of random polyhedrons, we can see in Fig.9 that not all discrete edges are composed of our 1-cuspal cells. This was of course to be expected since the length of the line segments we are testing are rather small (length of 5). The faces of the discrete polyhedrons are discrete planes. It is rather easy to expend our definition for 1-cuspal cells by increasing the window by which we do our test (the length of the line segment we consider no define a 1-cuspal cell. This allows of course to detect smaller variations in the orientations of the faces (in the tangents on the boundary). Stanford Bunny. The last series of tests was done on the Standford Bunny. The results are quite interesting considering the simplicity of the technique. We
Local Non-planarity of Three Dimensional Surfaces
933
Fig. 11. k-cuspal cells detection on the stanford bunny
can see in the Fig.11, that even on a complex object like that, 1-cuspal cells detect many of the orientation changes.
4
Conclusion
In this paper we introduced the notion of k-cuspal cells that are the extension of 2D discrete cusps introduced by R. Breton [BSDA03] to 3D. k-cuspal cells locate local non-planarity on three dimensional discrete surfaces. k-cuspal cells improve the maximal recognition problem by allowing a more natural decomposition of a discrete surface into discrete plane pieces. We have added the detection of k-cuspal cells in our multi-representation modeling software SpaMod [ABL01]. We hope to be able to propose a more natural analytical reconstruction with the help of these k-cuspal cells. This is one of the goals of this work. The notion of k-cuspal cells has been developed so that it can be computed on the fly during the recognition process and so that it provides a very simple way of addressing planarity changes on discrete surfaces. This notion is enough when dealing with simply the recognition process but as soon as we are interested in a more detailed analysis of the non-planarity of discrete surfaces, a more extended notion of such cells seems necessary. Increasing the size of the neighborhood we are considering in order to characterize changes in planarity is one of the ongoing work. We would like also to find a notion of cuspal edge that identifies a discrete 3D line segment as edge of two differently oriented discrete polygons. Right now the discrete cuspal cells we identify are not necessarily connected even though they are located on the edge. Our analytical reconstruction is defined for any dimension, we can thus imagine a n dimensional extension of k-cuspal cells: k-cuspal cells could be recursively detected in the space orthogonal sections.
934
´ Andres M. Rodr´ıguez, G. Largeteau- Skapin, and E.
References [ABL01]
Andres, E., Breton, R., Lienhardt, P.: spaMod: design of a spatial modeler. In: Bertrand, G., Imiya, A., Klette, R. (eds.) Digital and Image Geometry. LNCS, vol. 2243, pp. 90–107. Springer, Heidelberg (2002) [And94] Andres, E.: Discrete circles, rings and spheres. Computer & Graphics 18(5), 695–706 (1994) [And03] Andres, E.: Discrete linear objects in dimension n: the standard model. Graphical Models 65, 92–111 (2003) [BSDA03] Breton, R., Sivignon, I., Dupont, F., Andres, E.: Towards an invertible Euclidean reconstruction of a discrete object. In: Nystr¨ om, I., Sanniti di Baja, G., Svensson, S. (eds.) DGCI 2003. LNCS, vol. 2886, pp. 246–256. Springer, Heidelberg (2003) [DA08] Dexet, M., Andres, E.: A generalized preimage for the digital analytical hyperplane recognition. Discrete Applied Mathematics (accepted, 2008) [Fra95] Fran¸con, J.: Topologie de Khalimsky-Kovaleski et Algorithmique Graphique. In: 1st International Workshop on Discrete Geometry for Computer Imagery, Clermont-Ferrand, France, pp. 209–217 (1995) [Free70] Freeman, H.: Boundary encoding and processing. In: Lipkin, B.S., Rosenfeld, A. (eds.) Pictures Processing and Psychopictories, pp. 241–266. Academic, New York (1970) [Kov93] Kovalevsky, V.: Digital geometry based on the topology of abstract cells complexes. In: Discrete Geometry for Computer Imagery 1993, pp. 259–284 University Louis-Pasteur, Strasbourg, France (1993) [JA95] Jacob, M.-A., Andres, E.: On Discrete Rotations. In: 5th Int. Workshop on Discrete Geometry for Computer Imagery, Clermont-Ferrand, France, pp. 161–174 (September 1995) [Rev91] Reveill`es, J.-P.: G´eom´etrie Discr`ete, calcul en nombres entiers et algorithmique. Th`ese d’´etat, University Louis Pasteur, France (1991) [SDC06] Sivignon, I., Dupont, F., Chassery, J.-M.: Reversible vectorisation of 3D digital planar curves and applications. Image Vision and Computing, 1644– 1656 (2006) [V96] Vialard, A.: Geometrical parameters extraction from discrete path. In: Discrete Geometry for Computer Imagery 1996, pp. 24–35 (1996) [VC99] Vittone, J., Chassery, J.-M.: (n, m)-cubes and Farey nets for naive planes understanding. In: Bertrand, G., Couprie, M., Perroton, L. (eds.) DGCI 1999. LNCS, vol. 1568, pp. 76–90. Springer, Heidelberg (1999) [VC00] Vittone, J., Chassery, J.-M.: Recognition of digital Naive planes and polyhedrization. In: Nystr¨ om, I., Sanniti di Baja, G., Borgefors, G. (eds.) DGCI 2000. LNCS, vol. 1953, pp. 296–307. Springer, Heidelberg (2000)
A New Variant of the Optimum-Path Forest Classifier Jo˜ ao P. Papa and Alexandre X. Falc˜ ao Institute of Computing, State University of Campinas, Av. Albert Einstein 1216, Campinas, S˜ ao Paulo, Brazil {papa.joaopaulo,alexandre.falcao}@gmail.com
Abstract. We have shown a supervised approach for pattern classification, which interprets the training samples as nodes of a complete arcweighted graph and computes an optimum-path forest rooted at some of the closest samples between distinct classes. A new sample is classified by the label of the root which offers to it the optimum path. We propose a variant, in which the training samples are the nodes of a graph, whose the arcs are the k-nearest neighbors in the feature space. The graph is weighted on the nodes by their probability density values (pdf) and the optimum-path forest is rooted at the maxima of the pdf. The best value of k is computed by the maximum accuracy of classification in the training set. A test sample is assigned to the class of the maximum, which offers to it the optimum path. Preliminary results have shown that the proposed approach can outperform the previous one and the SVM classifier in some datasets.
1
Introduction
Pattern recognition techniques aim to find decision rules, which can separate samples from distinct classes. The methods can be divided into three categories, unsupervised, supervised, and semi-supervised, according to the knowledge about the labels (classes) of the samples in a given training set. Unsupervised approaches have no prior knowledge about the labels, while supervised techniques have fully information about them. Semi-supervised methods use both labeled and unlabeled samples for training. A dataset is usually divided in two parts, a training set and a test set, being the first used to project the classifier and the second used for validation, by measuring its classification errors (accuracy). This process must be also repeated several times with randomly selected training and test samples to achieve a conclusion about the statistics of its accuracy (robustness). Several approaches for supervised classification have been proposed under certain assumptions about the distribution of the samples in the feature space. Simple techniques can deal with linearly separable classes (Figure 1a), such as the well known perceptrons. Piecewise linearly separable classes (Figure 1b) require more robust techniques, such as Artificial Neural Networks using Multilayer Perceptrons [1]. If the classes have some known shape, one can use Gaussian G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 935–944, 2008. c Springer-Verlag Berlin Heidelberg 2008
936
J.P. Papa and A.X. Falc˜ ao
(a)
(b)
(c) Fig. 1. Examples of some feature spaces: (a) Linearly separable. (b) Piecewise linearly separable. (c) Non separable.
Mixture Models [2], for instance. However, if we have non separable classes (Figure 1c), Support Vector Machines [3] (SVMs) can handle them by nonlinearly mapping the samples into a higher-dimension feature space, in which the classes are assumed to be linearly separable. Unfortunately, practical applications usually involve non separable classes and the assumption of the SVMs about the linear separability in a higher-dimension space does not hold very often. Other techniques try to handle the overlapping problem between classes by taking local decisions, based on the distances between nearby samples (e.g., the k-nearest neighbors [4]). However, as far as we know, the strength of connectivity between samples in the feature space seems to not have caught much attention in supervised classification, except by our previous work [5,6]. This method interprets a training set as a complete graph (i.e., the arcs connect all pairs of nodes), in which the arcs are weighted by the distance between the feature vectors of their corresponding nodes. A path in the graph is a sequence of nodes connecting two terminal samples, each path has a value given by a path-value function (e.g., the maximum arc weight along the path), and a path is optimum when its value is minimum. The “strength of connectedness” between two samples is then inversely proportional to the value of an optimum path between them. Prototypes from all classes are identified in the training set, among the closest samples between distinct classes. The prototypes compete with each other, such that each sample is assigned to its most strongly connected prototype, forming an optimum-path forest rooted at the prototypes (an optimal partition of the training set). The classification of a new sample
A New Variant of the Optimum-Path Forest Classifier
937
identifies which tree would contain it, if it were part of the forest, and assigns to it the label of the corresponding root. Unsupervised versions of the OPF classifier have also been presented [7,8], in which a k-nn adjacency relation between samples was exploited to create the arcs of the graph whose nodes are unlabeled training samples. The nodes are weighted by their probability density values (pdf) and the value of the best k is obtained by minimizing a graph-cut measure, due to the absence of label information in the training set. The path-value function assigns the minimum pdf value along the path and its maximization for each sample outputs an optimum-path forest rooted at the maxima of the pdf. Each maximum then defines an optimum-path tree (cluster). In this work, we present a supervised approach which exploits the graph model of the unsupervised OPF classifiers [7,8] and the prior knowledge of the class labels in the training set. The best value of k is found by maximizing the accuracy of classification in the training set. The path-value function is the same of the unsupervised approaches, which elects the maxima of the pdf as prototypes. A new sample is assigned to the class of the maximum which offers to it the optimum path. The proposed method presents some theoretical advantages over ANN-MLP and SVM: (i) it handles multiple classes without modifications or extensions, (ii) it does not make assumptions about the shape and/or separability of the classes, and (iii) it runs training much faster. Preliminary results indicate that the proposed approach can outperform the traditional supervised OPF and SVM classifiers. This paper discusses related works in pattern recognition (Section 2), describes the proposed OPF classifier (Section 3), shows results that compare the proposed OPF classifier with support vector machines [3] and the traditional OPF classifier [6] using several datasets (Section 4), and states conclusions in Section 5.
2
Related Works
Artificial neural networks (ANN) [1] and support vector machines (SVM) [3] are among the most actively pursued supervised approaches in the recent years. An ANN multi-layer perceptron (ANN-MLP) trained by back propagation, for example, is an unstable classifier. Its accuracy may be improved at the computational cost of using multiple classifiers and algorithms (e.g., bagging and boosting) for training classifier collections [9]. However, it seems that there is an unknown limit in the number of classifiers to avoid an undesirable degradation in accuracy [10]. ANN-MLP assumes that the classes can be separated by hyperplanes in the feature space. Such assumption is unfortunately not valid in practice. SVM was proposed to overcome the problem by assuming it is possible to separate the classes in a higher dimensional space by optimum hyperplanes. Although SVM usually provides reasonable accuracies, its computational cost rapidly increases with the training set size and the number of support vectors. As a binary classifier, multiple SVMs are required to solve a multi-class problem [11]. Two main approaches are one-versus-all (OVA) and one-versus-one (OVO). OVA projects c SVMs to separate each class from the others. The decision is taken for the class with highest confidence value. OVO requires c(c−1) SVMs by taking 2
938
J.P. Papa and A.X. Falc˜ ao
into account all binary combinations between classes. The decision is usually taken by majority vote. Tang and Mazzoni [12] proposed a method to reduce the number of support vectors in the multi-class problem. Their approach suffers from slow convergence and high computational cost, because they first minimize the number of support vectors in several binary SVMs, and then share these vectors among the machines. Panda et al. [13] presented a method to reduce the training set size before computing the SVM algorithm. Their approach aims to identify and remove samples likely related to non-support vectors. However, in all SVM approaches, the assumption of separability may also not be valid in any space of finite dimension [14]. Approaches based on optimum-path forest (OPF) have been presented for supervised [6] and unsupervised [7,8] pattern classification, as described in Section 1. The proposed supervised OPF approach differs from the traditional supervised OPF [6] and the unsupervised OPF [7,8] in several aspects: (i) the traditional OPF uses a complete graph as the adjacency relation between the samples in the feature space, applies a maximum arc-weighted path-value function to evaluate the optimum-paths and estimates prototypes in the boundaries of the classes. The proposed OPF, on the other hand, exploits the k-nearest neighbor (k-nn) graph, applies a minimum node-weighted path-value function based on the pdf to evaluate the optimum-paths, and estimates prototypes at the maxima of the pdf (interior of the classes). (ii) The unsupervised OPF uses a graph-cut measure to estimate the best value of k and the proposed OPF finds the best value of k by maximizing the accuracy of classification in the training set. We are essentially extending the unsupervised OPF for supervised classification purposes. For example, in the unsupervised approach, each root defines one optimum-path tree (cluster) with a unique label. In the proposed OPF, the class of the root propagates to the remaining nodes of its optimum-path tree.
3
Proposed Optimum-Path Forest Classifier
Let Z = Z1 ∪ Z2 , where Z1 and Z2 are, respectively, the training and test sets. Every sample s ∈ Z has a feature vector v(s) and d(s, t) is the distance between s and t in the feature space (e.g., d(s, t) = v(t)− v(s)). A function λ(s) assigns the correct label i, i = 1, 2, . . . , c, of class i to any sample s ∈ Z. We aim to project a classifier from Z1 , which can predict the correct label of the samples in Z2 . This classifier creates a discrete optimal partition of the feature space such that any unknown sample can be classified according to this partition. Let k ≥ 1 be a fixed number for the time being. An k-nn relation Ak is defined as follows. A sample t ∈ Z1 is said adjacent to a sample s ∈ Z1 , if t is k-nearest neighbor of s according to d(s, t). The pair (Z1 , Ak ) then defines a k-nn graph for training. The arcs (s, t) are weighted by d(s, t) and the nodes s ∈ Z1 are weighted by a density value ρ(s), given by 1 ρ(s) = √ 2πσ 2 k
∀t∈Ak (s)
exp
−d2 (s, t) 2σ 2
,
(1)
A New Variant of the Optimum-Path Forest Classifier
939
d
where σ = 3f and df is the maximum arc weight in (Z1 , Ak ). This parameter choice considers all nodes for density computation, since a Gaussian function covers most samples within d(s, t) ∈ [0, 3σ]. However the density value ρ(s) be calculated with a Gaussian kernel, the use of the k-nn graph allows the proposed OPF to be robust to possible variations in the shape of the classes. A sequence of adjacent samples defines a path πt , starting at a root R (t) ∈ Z1 and ending at a sample t. A path πt = t is said trivial, when it consists of a single node. The concatenation of a path πs and an arc (s, t) defines an extended path πs · s, t. We define f (πt ) such that its maximization for all nodes t ∈ Z1 results into an optimum-path forest with roots at the maxima of the pdf, forming a root set R. We expect that each class be represented by one or more roots (maxima) of the pdf. Each optimum-path tree in this forest represents the influence zone of one root r ∈ R, which is composed by samples more strongly connected to r than to any other root. We expect that the training samples of a same class be assigned (classified) to an optimum-path tree rooted at a maximum of that class. The path-value function is defined as follows. ρ(t) if t ∈ R f1 (t) = ρ(t) − δ otherwise f1 (πs · s, t) = min{f1 (πs ), ρ(t)}
(2)
where δ = min∀(s,t)∈Ak |ρ(t)=ρ(s) |ρ(t) − ρ(s)|. The root set R is obtained onthe-fly. The method uses the image foresting transform (IFT) algorithm [15] to maximize f1 (πt ) and obtain an optimum-path forest P — a predecessor map with no cycles that assigns to each sample t ∈ / R its predecessor P (t) in the optimum path P ∗ (t) from R or a marker nil when t ∈ R. The IFT algorithm for (Z1 , Ak ) is presented below. Algorithm 1: IFT Algorithm Input: A k-nn graph (Z1 , Ak ), λ(s) for all s ∈ Z1 , and path-value function f1 . Output: Label map L, path-value map V , optimum-path forest P . Auxiliary: Priority queue Q and variable tmp. 1. For each s ∈ Z1 , do 2. P (s) ← nil, L(s) ← λ(s), V (s) ← ρ(s) − δ 3. and insert s in Q. 4. While Q is not empty, do 5. Remove from Q a sample s such that V (s) is 6. maximum. 7. If P (s) = nil, then V (s) ← ρ(s). 8. For each t ∈ Ak (s) and V (t) < V (s), do 9. tmp ← min{V (s), ρ(t)}. 10. If tmp > V (t) then 11. L(t) ← L(s), P (t) ← s, V (t) ← tmp. 12. Update position of t in Q.
940
J.P. Papa and A.X. Falc˜ ao
Initially, all paths are trivial with values f (t) = ρ(t) − δ (Line 2). The global maxima of the pdf are the first to be removed from Q. They are identified as roots of the forest, by the test P (s) = nil in Line 7, where we set its correct path value f1 (s) = V (s) = ρ(s). Each node s removed from Q offers a path πs · s, t to each adjacent node t in the loop from Line 8 to Line 12. If the path value f1 (πs · s, t) = min{V (s), ρ(t)} (Line 9) is better than the current path value f1 (πt ) = V (t) (Line 10), then πt is replaced by πs · s, t (i.e., P (t) ← s), and the path value and label of t are updated accordingly (Line 11). Local maxima of the pdf are also discovered as roots during the algorithm. The algorithm also outputs an optimum-path value map V and a label map L, wherein the true labels of the corresponding roots are propagated to every sample t. A classification error in the training set occurs when the final L(t) = λ(t). We define the best value of k ∗ ∈ [1, kmax ] as the one which maximizes the accuracy Acc of classification in the training set. The accuracy is defined as follows. Let N Z1 (i), i = 1, 2, . . . , c, be the number of samples in Z1 from each class i. We define F P (i) F N (i) ei,1 = and ei,2 = , (3) |Z1 | − |N Z1 (i)| |N Z1 (i)| where F P (i) and F N (i) are the false positives and false negatives, respectively. That is, F P (i) is the number of samples from other classes that were classified as being from the class i in Z1 , and F N (i) is the number of samples from the class i that were incorrectly classified as being from other classes in Z1 . The errors ei,1 and ei,2 are used to define E(i) = ei,1 + ei,2 ,
(4)
where E(i) is the partial sum error of class i. Finally, the accuracy Acc of the classification is written as c c 2c − i=1 E(i) E(i) Acc = = 1 − i=1 . (5) 2c 2c The accuracy Acc is measured by taking into account that the classes may have different sizes in Z1 (similar definition is applied for Z2 ). If there are two classes, for example, with very different sizes and the classifier always assigns the label of the largest class, its accuracy will fall drastically due to the high error rate on the smallest class. It is expected that each class be represented by at least one maximum of the pdf and L(t) = λ(t) for all t ∈ Z1 (zero classification errors in the training set). However, these properties can not be guaranteed with path-value function f1 and the best value k ∗ . In order to assure them, we first find the best value k ∗ using function f1 and then execute Algorithm 1 one more time using path-value function f2 instead of f1 . ρ(t) if t ∈ R f2 (t) = ρ(t) − δ otherwise −∞ if λ(t) = λ(s) f2 (πs · s, t) = (6) min{f2 (πs ), ρ(t)} otherwise.
A New Variant of the Optimum-Path Forest Classifier
941
Equation 6 weights all arcs (s, t) ∈ Ak such that λ(t) = λ(s) with d(s, t) = −∞, constraining optimum paths within the correct class of their nodes. The training process in our method can be summarized by Algorithm 2. Algorithm 2: Training Training set Z1 , λ(s) for all s ∈ Z1 , kmax and path-value functions f1 and f2 . Output: Label map L, path-value map V , optimum-path forest P . Auxiliary: Variables i, k, k∗ , M axAcc ← −∞, Acc, and arrays F P and F N of size c. Input:
1. For k = 1 to kmax do 2. Create graph (Z1 , Ak ) weighted on nodes by Eq. 1. 3. Compute (L, V, P ) using Algorithm 1 with f1 . 4. For each class i = 1, 2, . . . , c, do 5. F P (i) ← 0 and F N (i) ← 0. 6. For each sample t ∈ Z1 , do 7. If L(t) = λ(t), then 8. F P (L(t)) ← F P (L(t)) + 1. 9. F N (λ(t)) ← F N (λ(t)) + 1. 10. Compute Acc by Equation 5. 11. If Acc > M axAcc, then 12. k∗ ← k and M axAcc ← Acc. 13. Destroy graph (Z1 , Ak ). 14. Create graph (Z1 , Ak∗ ) weighted on nodes by Eq. 1. 15. Compute (L, V, P ) using Algorithm 1 with f2 .
For any sample t ∈ Z2 , we consider the k-nearest neighbors connecting t with samples s ∈ Z1 , as though t were part of the graph. Considering all possible paths from R to t, we find the optimum path P ∗ (t) with root R(t) and label t with the class λ(R(t)). This path can be identified incrementally, by evaluating the optimum cost V (t) as V (t) = max{min{V (s), ρ(t)}}, ∀s ∈ Z1 .
(7)
Let the node s∗ ∈ Z1 be the one that satisfies the above equation. Given that L(s∗ ) = λ(R(t)), the classification simply assigns L(s∗ ) to t.
4
Evaluation
We executed the traditional OPF (OPFt ), proposed OPF (OPFp ) and SVM 10 times to compute their accuracies, using different randomly selected training (Z1 ) and test (Z2 ) sets. We use the LibSVM package [16] for SVM implementation, with Radial Basis Function (RBF) kernel, parameter optimization and the oneversus-one strategy for the multi-class problem. The experiments used some combinations of public datasets — CONE TORUS (2D points)(Figure 2a), SATURN (2D points) (Figure 2b), MPEG-7 (shapes) (Figure 3) and BRODATZ (textures)
942
J.P. Papa and A.X. Falc˜ ao
(a)
(b)
Fig. 2. 2D points dataset: (a) CONE TORUS and (b) SATURN
(a)
(b)
(c).
(d)
(e)
(f)
Fig. 3. Samples from MPEG-7 shape dataset (a)-(c) Fish e (d)-(f) Camel
Fig. 4. Texture images from the Brodatz dataset used in our experiments. From left to right, and from top to bottom, they include: Bark, Brick, Bubbles, Grass, Leather, Pigskin, Raffia, Sand, Straw, Water, Weave, Wood, and Wool.
(Figure 4) — and descriptors — Fourier Coefficients (FC), Texture Coefficients (TC), and Moment Invariants (MI). A detailed explanation of them can be found in [17,6]. The results in Table 1 are displayed in the following format: x(y), where x and y are, respectively, mean accuracy and its standard deviation. The percentages of samples in Z1 and Z2 were 50% and 50% for all datasets. The results show that OPFp can provide better accuracies than OPFt and SVM, being about 50 times faster than SVM for training.
A New Variant of the Optimum-Path Forest Classifier
943
Table 1. Mean accuracy and standard deviation Dataset-Descriptor
OPFt
OPFp
MPEG7-FC
0.7192(0.0066) 0.7237(0.0048) 0.7140(0.0049)
MPEG7-MI
0.7676(0.0060)
BRODATZ-TC
0.8207(0.0037) 0.8517(0.0062)
0.8781(0.0070) 0.8822(0.0096) 0.8791(0.0106)
CONE TORUS-XY 0.8824(0.0113) 0.8675(0.0129) SATURN-XY
5
SVM
0.8728(0.0337)
0.9040(0.0195) 0.9100(0.0161) 0.8940(0.0265)
Conclusions
A very efficient modification of the traditional OPF classifier [6] was proposed and evaluated in some public datasets. The classification results were good and similar to those reported by the traditional OPF and SVM approaches. However, the OPF classifiers are about 50 times faster than SVM for training. It is also important to note that the good accuracy of SVM was due to parameter optimization. We are currently doing more experiments and investigating a semi-supervised approach by optimum-path forest, the applications of the OPF classifiers in Medical Image Analysis, and the use of Genetic Programming (GP) for arc-weight estimation in OPF, to combine distances from shape, color and texture descriptors.
References 1. Haykin, S.: Neural networks: A comprehensive foundation. Prentice-Hall, Englewood Cliffs (1994) 2. Reynolds, D., Rose, R.: Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 3, 72–83 (1995) 3. Boser, B., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: 5th Workshop on Computational Learning Theory, pp. 144–152. ACM Press, New York (1992) 4. Fukunaga, K., Narendra, P.M.: A branch and bound algorithms for computing k-nearest neighbors. IEEE Transactions on Computers 24, 750–753 (1975) 5. Papa, J., Falc˜ ao, A., Miranda, P., Suzuki, C., Mascarenhas, N.: Design of robust pattern classifiers based on pptimum-path forests. In: Mathematical Morphology and its Applications to Signal and Image Processing (ISMM), MCT/INPE, pp. 337–348 (2007) 6. Papa, J., Falc˜ ao, A., Suzuki, C., Mascarenhas, N.: A discrete approach for supervised pattern recognition. In: Brimkov, V.E., Barneva, R.P., Hauptman, H.A. (eds.) IWCIA 2008. LNCS, vol. 4958, pp. 136–147. Springer, Heidelberg (2008) 7. Rocha, L.M., Falc˜ ao, A.X., Meloni, L.G.P.: A robust extension of the mean shift algorithm using optimum path forest. In: 8th Intl. Workshop on Combinatorial Image Analysis, Buffalo-NY, USA, RPS, pp. 29–38 (2008) ISBN 978-981-08-0228-8 8. Cappabianco, F., Falc˜ ao, A., Rocha, L.: Clustering by optimum path forest and its application to automatic GM/WM classification in MR-T1 images of the brain. In: The Fifth IEEE Intl. Symp. on Biomedical Imaging (ISBI), Paris, France (accepted, 2008)
944
J.P. Papa and A.X. Falc˜ ao
9. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley Interscience, Hoboken (2004) 10. Reyzin, L., Schapire, R.E.: How boosting the margin can also boost classifier complexity. In: 23th Intl. Conf. on Machine learning, pp. 753–760. ACM Press, New York (2006) 11. Duan, K., Keerthi, S.S.: Which is the best multiclass svm method? an empirical study. Multiple Classifier Systems, 278–285 (2005) 12. Tang, B., Mazzoni, D.: Multiclass reduced-set support vector machines. In: 23th ICML, pp. 921–928. ACM Press, New York (2006) 13. Panda, N., Chang, E.Y., Wu, G.: Concept boundary detection for speeding up SVMS. In: 23th Intl. Conf. on Machine learning, pp. 681–688. ACM Press, New York (2006) 14. Collobert, R., Bengio, S.: Links between perceptrons, MLPS and SVMS. In: 21th Intl. Conf. on Machine learning, p. 23. ACM Press, New York (2004) 15. Falc˜ ao, A., Stolfi, J., Lotufo, R.: The image foresting transform: theory, algorithms, and applications. IEEE TPAMI 26, 19–29 (2004) 16. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines (2001), http://www.csie.ntu.edu.tw/∼ cjlin/libsvm 17. Montoya-Zegarra, J., Papa, J., Leite, N., Torres, R., Falc˜ ao, A.: Learning how to extract rotation-invariant and scale-invariant features from texture images. EURASIP Journal on Advances in Signal Processing 2008, 1–15 (2008)
Results on Hexagonal Tile Rewriting Grammars D.G. Thomas1 , F. Sweety2 , and T. Kalyani2 1 Department of Mathematics Madras Christian College, Chennai - 600 059
[email protected] 2 Department of Mathematics St. Joseph’s College of Engineering, Chennai - 600 119
[email protected]
Abstract. Tile rewriting grammars are a new model for defining picture languages. In this paper we propose hexagonal tile rewriting grammars (HTRG) for generating hexagonal picture languages. Closure properties of HTRG are proved for some basic operations. We compare HTRG with hexagonal tiling systems. Keywords: Hexagonal tile rewriting grammars, picture languages, Hexagonal tiling systems and locally testable languages.
1
Introduction
Image generation in formal languages can be done in many ways. For example weighted finite automata, cellular automata are helpful to compress images and also to generate them. Using grammar system images can be generated by rewriting rules. Parallel and sequential array grammars were also used to generate images. In studies of hexagonal picture languages several grammars were introduced in the literature to generate various classes of pictures. One such grammar is the hexagonal kolam array grammar (HKAG) by Siromoneys [9] generating a class of hexagonal pictures (arrays) (HKAL). Based on the study [9], Subramanian [10] considered a hierarchy of hexagonal array grammars (HAG) generating classes of hexagonal arrays on triangular grids (HAL). Recently Dersanambika et al. [5] have defined local and recognizable hexagonal picture languages (HLOC and HREC). Tile rewriting grammars (TRG) are a new model for defining picture languages [2,3,4], combining the rewriting rules with the tiling system [7]. A TRG rule is a scheme having a nonterminal symbol to the left and a local language to the right over terminals and nonterminals; that is, the right part is specified by a set of fixed size tiles [3], the size of a TRG rule need not be fixed. To our knowledge, this approach is novel and is able to generate an interesting gamut of pictures: grids, spirals, and in particular a language of nested frames, which is in some way the analogue of a Dyck language [3]. For other interesting studies of rectangular, hexagonal and triangular arrays, we refer to [1,6,8,12]. In this paper, we introduce hexagonal tile rewriting grammars and generate interesting images. We compare the class HTRG with hexagonal tiling systems. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 945–952, 2008. c Springer-Verlag Berlin Heidelberg 2008
946
D.G. Thomas, F. Sweety, and T. Kalyani
We prove that this model has greater generative capacity than the hexagonal tiling systems.
2
Preliminaries
Let Σ be a finite alphabet of symbols. A hexagonal picture p over Σ is a hexagonal array of symbols of Σ. The set of all hexagonal arrays over the alphabet Σ is denoted by Σ ∗∗H . A hexagonal picture language L over Σ is a subset of Σ ∗∗H . Let Σ (,m,n,)H be the set of hexagonal pictures of size (, m, n). For the notions of size of a hexagonal picture, hexagonal subpictures, hexagonal tile, local hexagonal picture languages (HLOC), recognizable hexagonal picture languages (HREC), hexagonal tiling systems (HT S), tiling recognizable hexagonal picture languages (L(HT S)), we refer to [11] which has appeared in this volume.
3
Hexagonal Tile Rewriting Grammars and Hexagonal Picture Languages
In this section, we introduce some basic definitions, hexagonal tile rewriting grammars with an example and obtain a basic property. Definition 1. Let p be a hexagonal picture of size (, m, n). A subpicture of p at position (i, j, k) is a picture q such that, if ( , m , n ) is the size of q, then ≤ , m ≤ m, n ≤ n. We denote it by q (i,j,k) p, or q p ≡ ∃ i, j, k(q (i,j,k) p). Moreover, if q (i,j,k) p, we define coor(i,j,k) (q, p) as the set of coordinates of p where q is located. Conventionally, coor(i,j,k) (q, p) = φ, if q is not a subpicture of p. If q coincides with p, we write coor(p) instead of coor(1,1,1) (p, p). Example 1. Let p be a hexagonal picture of size (3, 3, 3) and let q be a subpicture of p at position (1, 2, 1) of size (2, 2, 2). Then the coor(1, 2, 1)(q, p) = {(1, 2, 1), (1, 2, 2), . . . , (2, 3, 2)}. p
p
111
p
p p
p
311
331
332
133
p
232
p
p
132
p
231
p
123
p
131
p
321
p
122
p
221
p
113
p
121
211
p
112
233
p
333
Fig. 1. Subpicture of a heaxgonal picture
Results on Hexagonal Tile Rewriting Grammars
947 γ
Definition 2. Let γ be an equivalence relation on coor(p), written (x, y, z) ∼ (x , y , z ). Two subpictures q (i,j,k) p, q (i ,j ,k ) p are γ-equivalent, written γ q ∼ q , iff for all pairs (x, y, z) ∈ coor(i,j,k) (q, p) and (x , y , z ) ∈ coor(i ,j ,k ) (q , p) γ it holds (x, y, z) ∼ (x , y , z ). A homogeneous C-subpicture q p is called maximal with respect to relation γ iff for every γ-equivalent C-subpicture q we have, coor(q, p) ∩ coor(q , p) = φ
or
coor(q , p) ⊆ coor(q, p)
Definition 3. For a picture p ∈ Σ ∗∗H , the set of subpictures (or tiles) with size (g, h, k) is Bg,h,k (p) = {q ∈ Σ (g,h,k)H /q p}. Definition 4. Consider a set of tiles ω ⊆ Σ (i,j,k)H . The locally testable hexagonal language in the strict sense defined by ω (written HLOCu (ω), ‘u’ stands for unbordered picture) is the set of pictures p ∈ Σ ∗∗H such that B(i,j,k) (p) ⊆ ω. The locally testable hexagonal language defined by a finite set of tiles HLOCu,eq ({ω1 , ω2 , . . . , ωn } ‘eq’ stands for equality test) is the set of pictures p ∈ Σ ∗∗H such that for some x, Bi,j,k (p) = ωx . The bordered locally testable hexagonal language defined by a finite set of tiles HLOCeq ({ω1 , ω2 , . . . , ωn }) is the set of pictures p ∈ Σ ∗∗H such that for some x, Bi,j,k (ˆ p) = ω x . Definition 5 (Substitution). If p, q, q are hexagonal pictures, q (i,j,k) p, and q, q have the same size, then p[q /q](i,j,k) denotes the hexagonal picture obtained by replacing the occurrence of q at position (i, j, k) in p with q . Hexagonal Tile Rewriting Grammars Definition 6. A hexagonal tile rewriting grammar (HT RG) is a tuple (Σ ∪ {b}, N, S, R), where Σ is the terminal alphabet, {b} is the blank symbol, N is a set of nonterminal symbols, S ∈ N is the starting symbol, R is a set of rules. R may contain two kinds of rules, Type 1 : A → t, where A ∈ N , t ∈ (Σ ∪ {b} ∪ N )(g,h,k)H with g, h, k > 0 Type 2 : A → ω, where A ∈ N , ω ⊆ {t/t ∈ (Σ ∪ {b} ∪ N )(x,y,z)H } with x, y, z > 0 Notice that type 1 is not a special case of type 2. Intuitively a rule of type 1 is intended to match a subpicture of small bounded size, identical to the right part t. A rule of type 2 matches any subpicture of any size which can be tiled using all the elements t of the set ω. The blank symbol ‘b’ is introduced in HTRG rules in order to generate arrow head hexagons of six types. Definition 7. Consider a hexagonal tile rewriting grammar G = (Σ ∪ {b}, N, S, R). Let p, p ∈ (Σ ∪ {b} ∪ N )(,m,n)H be pictures of identical size, and let γ, γ be equivalence relations over coor(p). We say that (p , γ ) is derived in one step from (p, γ), written (p, γ) ⇒G (p , γ ) iff for some A ∈ N and for some rule ρ : A → · · · ∈ R there exists in p a A-subpicture r (i,j,k) p, maximal with respect to γ, such that
948
D.G. Thomas, F. Sweety, and T. Kalyani
(i) p is obtained substituting r with a picture s (i.e.,) p = p[s/r](i,j,k) where ‘s’ is defined as follows: Type 1 : Fixed size : if ρ = A → t, then s = t Type 2 : Variable size : if ρ = A → ω, then s ∈ HLOCu,eq(ω) . (ii) Let z be coor(i, j, k). Let Γ be the γ-equivalence class containing z. Then, γ is equal to γ, for all the equivalence classes = Γ ; Γ in γ is divided in two equivalence classes, z and its complement with respect to Γ (= φ if z = Γ ). More formally, γ = γ\{(x1 , y1 , z1 ), (x2 , y2 , z2 )/(x1 , y1 , z1 ) ∈ z or (x2 , y2 , z2 ) ∈ z}. The subpicture r is named the application area of rule ρ in the derivation step. n We say that (q, γ ) is derivable from (p, γ) in n steps, written (p, γ) ⇒G (q, γ ) iff p = q and γ = γ , when n = 0, or there are a picture r and an equivalence n−1 relation γ such that (p, γ) ⇒ G (r, γ ) and (r, γ ) ⇒G (q, γ ). We use the ∗ abbreviation (p, γ) ⇒G (q, γ ) for a derivation with n ≥ 0 steps. Definition 8. The picture language defined by a grammar G (written L(G)) is the set of p ∈ Σ ∗∗H such that, if |p| = (, m, n), then (S (,m,n) , coor(p) × coor(p) × coor(p)) ⇒∗G (p, γ)
(1)
⇒∗G
where the relation γ is arbitrary. In short we write S p. We note that the derivation starts with a S-picture isometric with the terminal picture to be generated, and with the universal equivalence relation over the coordinates. The equivalence relations computed by each step of (1) are called geminal relation. When writing examples by hand, it is convenient to visualize the equivalence classes of a geminal relation, by appending the same numerical subscript to the pixels of the application area rewritten by a derivation step. Example 2. Let G =.. (Σ ∪ {b}, N, S, R), where .. . ....
Σ = {........
S
-
S S
,
. .... ... .... .. . . . .... ....... .... .... .. .... .... ....
S S
.... .... .... ... .... .... ... . ... .... .... .. . . . .... ....
0 , 0
0 S
S S
S
.. ... ....
.... .... .... .
.. ... .... ...
, 0}, N = {S}, and R consists of the rules:
, ... ....
; S
-
... ...... ...
.. .... ... ....
0
S
S,
. .... .... ....... .... .... .
S S
.... ...
S
0
0
S, S S
S S
.... .... .... .
0 S S 0
S, S
S S, S
S
0 S
S S
S, S
S
S
S
S
0
0
S
0
0
S
S
S
S
S
S
S
S S
S, S S
S S
0 , 0 0
S 0
0
S,
0
0 , 0
.... .... .... . .... .... . ...
S
0
S
0 S
.... .... .... .
S, S S
.... .... ..
,.............. ,............. ,
S
. .... .... ....
S
.... .... .... .
.... ...
S S
.... .. ... ... ... ....
,
Results on Hexagonal Tile Rewriting Grammars
949
A picture in L(G) is: .. .... .... ...
0 .. ... ... ....
0 0
0
0 .. .... ... ....... .... .... .
0
0 .. .... ... ....... .... .... .
0 0
0 .... .... .... .
0
0
0
0 .. ... ... ....
0
0
0
0 0
.... .... .... .... .... . ...
.... .... .... .... .... . ...
.. .... .... ...
0
0 0
.. ... ... ....
.... .... .... .
.... .... .... .
0
.... .... .... .
.. ... .... ....... .... .... . .... .... .... .
0
.... .... .... .
.... .... .... .
0
.. ... ... ....
.... .... .... .
0 0
.. .... .... ...
0
0 .. .... ... ....... .... .... .
0
0
.... .... .... .... .... . ...
0 0
.... .... .... .... .... . ...
0 0
0 .. .... .... ...
0
and is obtained by repeated application of the variable size rule and fixed size rule. Example 3 (Dyck analogue). The next language can be defined by a sort of blanking. But since the terminals cannot be deleted without shearing the picture, we replace them with a character ‘b’ (blank or background). To obtain the grammar, we add the following rules to the rules in Example 2. ⎞ ⎛ ⎞ S S ⎜ S S S ∗ X ⎜ X ⎟ S S ⎟ ⎜ ⎟ ⎜ S ⎟ ⎟ ⎜ ∗ S S S S X X X S S ∗⎟ S→⎜ ⎜ ⎟ ⎜ ⎟ ⎝ S S S ∗ X X ⎠ ⎝ X X X ⎠ S S ∗ ∗ X X X ⎛
S
∗
S
∗
X
⎛ X → ⎝S
S
S S
S
S
∗
∗
⎞ S⎠
A picture in L(G) is: 0 0 0
∗
0 0
∗
0
0
0
∗ ∗
0
0 0
0
0
0
0 0
0 0
∗
0 0
0
∗
0 0
∗
0 ∗
0 ∗
∗
∗ 0
0 ∗
0
0 0
0
950
D.G. Thomas, F. Sweety, and T. Kalyani
Now we give basic properties. For this, we require the notion of arrow head catenations. There are six arrow head catenations defined in [9]: upper left catenation, upper right catenation, left catenation, lower left catenation, lower right .... .... .. .. .l ..l l .. l ..l .... ....lR .... catenation, right arrow head catenations denoted by I, , , , ... , respectively. Theorem 1 (Closure Property). The family L(HTRG) is closed under union, arrow head catenations, rotation about 180◦ , and projection. Proof. Consider two grammars G1 = (Σ ∪ {b}, N1, A, R1 ) and G2 = (Σ ∪ {b}, N2, B, R2 ). Suppose for simplicity that N1 ∩ N2 = φ and S ∈ N1 ∪ N2 . Then it is easy to show that the grammar G = (Σ ∪ {b}, N1 ∪ N2 ∪ {S}, S, R1 ∪ R2 ∪ R) where ⎧ ⎛ ⎞ ⎛ ⎞⎫ A A B B ⎨ ⎬ Union U : R = S → ⎝ A A A ⎠ , S → ⎝ B B ⎠ ⎩ ⎭ A A B B is such that L(G) = L(G1 ) ∪ L(G2 ). .... .... ... ... .l .l l . -l ..l .... ....lR ... .... Concatenation : , I, , ⎞, ⎫ , ⎧ ⎛ A A B B ⎨ ⎬ R = S → ⎝A A A B B ⎠ ⎩ ⎭ A A B B -l is such that L(G) = L(G1 ) L(G2 ). Similarly the other catenation cases are analogous. Rotation about 180◦: Construct the grammar G = (Σ ∪ {b}, N, A, R ) where R is such that, if B → t ∈ R1 is a type 1 rule, then B → tR is in R and if B → ω ∈ R1 is a type 2 rule, then B → ω is in R , with t ∈ ω ⇒ tR ∈ ω . It is easy to verify that L(G) = L(G1 )R . Projection π: Consider a grammar G = (Σ ∪ {b}, N, S, R) and a projection π : Σ1 → Σ2 . It is possible to build a grammar G = (Σ2 ∪ {b}, N , S, R ) such that L(G ) = π(L(G)). Indeed, let Σ1 be a set of new nonterminals corresponding to elements in Σ1 . Then N = N ∪ Σ1 ∪ {b}, R = φ(R) ∪ R , where φ : Σ1 ∪ {b} × {0} → Σ1 ∪ {b} × {k} is the alphabetical mapping π(a) = ak , where k is a fixed unused updating index, and it is naturally extended to HTRG rules.
4
Comparison Results
We prove that the class of local hexagonal picture languages is strictly included in L(HTRG) and L(HT S) ⊆ L(HT RG). Theorem 2. L(HLOC) ⊆ L(HT RG). Proof. Consider a local hexagonal language over Σ ∪ {b} defined by the set of allowed hexagonal tiles Δ. Here ‘b’ stands for blank set.
Results on Hexagonal Tile Rewriting Grammars
⎧ ⎨
951
⎫ p q ⎬ y0 z0 x y z ∈ Δ . Then an equivalent HTRG is Let Δ0 = x0 ⎩ ⎭ u0 v0 u v G = (Σ ∪ {b}, {S}, S, R) where R is the set {S → θ/θ ⊆ Δ0 }. p0
q0
Lemma 1. L(HLOCu,eq ) ⊆ L(HT RG) Proof. Consider a local hexagonal picture language over Σ ∗∗H defined, without boundaries by the sets of allowed tiles {ω1 , ω2 , . . . , ωn }, ωi ⊆ Σ (2,2,2)H . An equivalent grammar is S → ω1 /ω2 / . . . /ωn . We now consider the notions: HT Seq and HT Su,eq . Definition 9. The hexagonal tiling systems HT Seq and HT Su,eq are the same as a HTS, with the following respective changes; • Replace the local language involved with HLOCeq ({ω1 , ω2 , . . . , ωn }), where ωi is a finite set of hexagonal tiles over Γ . • Replace the local language involved with HLOCu,eq ({ω1 , ω2 , . . . , ωn }), where ωi is a finite set of hexagonal tiles over Γ . In HT Su,eq there is no boundary symbol #. Lemma 2. L(HT Seq ) ≡ L(HT S) Proof. We first prove L(HT S) ⊆ L(HT Seq ). This is easy, because if we consider the hexagonal tile set ω of a HTS, by taking {ω1 , ω2 , . . . , ωn } = P (ω), (the power set) we obtain an equivalent HT Seq . Next we have to prove that L(HT Seq ) ⊆ L(HT S). In [7], the family of languages L(HLOCeq (Ω)) where Ω is the set of tiles, is proved to be a proper subset of L(HT S). But L(HT S) is closed with respect to projection, and L(HT Seq ) is the closure with respect to projection of L(HLOCeq (Ω)). Therefore, L(HT Seq ) ⊆ L(HT S). Lemma 3. L(HT Su,eq ) ≡ L(HT Seq ). Proof. First we prove L(HT Seq ) ⊆ L(HT Su,eq ). Let T = (Σ, Γ, (ω1 , ω2 , . . . , ωn ), π) be a HT Seq . For every hexagonal tile set ωi separate its tiles containing the boundary symbol # (call this subset ωi ) from the other tiles (ωi ). This is ωi = ωi ∪ ωi . Introduce a new alphabet Γ and a bijective mapping br : Γ → Γ . We use symbols in Γ to encode boundary, and new tile set δi to contain them: for every tile t in ωi , if there is a tile in ωi , which overlaps with t, then encode this boundary in a new tile t and put it in the set δi . Consider a HT Su,eq , T = (Σ, Γ ∪ Γ , Ω, π ) where π extends π to Γ as follows π (br(a)) = π (a) = π(a), a ∈ Γ , and ubr : Γ ∪ Γ → Γ is defined as ubr(a) = br−1 (a), if a ∈ Γ , otherwise = a, and it is naturally extended to tiles and tile sets. Ω is the set {ω/ω ⊆ ωi ∪ δi ∧ ubr(ω) = ωi ∧ ω ∪ δi = φ ∧ 1 ≤ i ≤ n} The proof of L(T ) = L(T ) is obvious and is omitted.
952
D.G. Thomas, F. Sweety, and T. Kalyani
To prove L(HT Su,eq ) ⊆ L(HT Seq ), let T = (Σ, Γ, {ω1 , ω2 , . . . , ωn }, π) be a HT Su,eq . To construct an equivalent HT Seq , we introduce the boundary tile sets δi , defined as follows. For every tile, d a
e
f # #
# d
a
# e, d
b
g
# e
b
c ∈ ωi , the following tiles are in δi :
b
# #, #
c
d a
b, # f e # a b b c b c #, # f g , f g #. g # # # # #
Consider a HT Seq , T = (Σ, Γ, Ω, π) where Ω is the set {ω ∪ ωi /ω ⊆ δi ∧ ω = φ ∧ 1 ≤ i ≤ n}. It is easy to show that L(T ) = L(T ). Theorem 3. L(HT S) ⊆ L(HT RG). Proof. It follows from theorem 1, lemma 1, lemma 2, lemma 3 and the fact that L(HT Su,eq ) is the closure of L(HLOCu,eq ) with respect to projection.
References 1. Brimkov, V.E., Barneva, R.: Analytical honeycomb geometry for raster and volume graphics. The Computer Journal 48(2), 180–199 (2005) 2. Cherubini, A., Crespi Reghizyi, S., Pradella, M., San Pietro, P.: Picture languages: Tiling systems versus tile rewriting grammars. Theoretical Comp. Sci. 356, 90–103 (2006) ´ 3. Crespi Reghizyi, S., Pradella, M.: Tile rewriting grammars. In: Esik, Z., F¨ ul¨ op, Z. (eds.) DLT 2003. LNCS, vol. 2710, pp. 206–217. Springer, Heidelberg (2003) 4. Crespi Reghizyi, S., Pradella, M.: Tile rewriting grammars and picture languages. Theoretical Comp. Science 340, 257–272 (2005) 5. Dersanambika, K.S., Krithivasan, K., Martin-Vide, C., Subramanian, K.G.: Hexagˇ c, J. (eds.) IWCIA 2004. LNCS, onal Pattern Languages. In: Klette, R., Zuni´ vol. 3322, pp. 52–64. Springer, Heidelberg (2004) 6. Deutsch, E.S.: Thinning algorithms on rectangular, hexagonal and triangular arrays. Communications of the ACM 15(9), 827–837 (1972) 7. Giammarresi, D., Restivo, A.: Recognizable Picture Languages. Internat J. Pattern Recogn. Artif. Intell. 6(2 & 3), 241–256 (1992) 8. Luszak, E., Rosenfeld, A.: Distance on a hexagonal grid. IEEE Transactions on Computers 25(5), 532–533 (1968) 9. Siromoney, G., Siromoney, R.: Hexagonal arrays and rectangular blocks. Computer Graphics and Image Processing 5, 353–381 (1976) 10. Subramanian, K.G.: Hexagonal Array Grammars. Computer Graphics and Image Processing 10, 388–394 (1979) 11. Sweety, F., Thomas, D.G., Kalyani, T.: Collage of Hexagonal Arrays. In: Bebis, G., et al. (eds.) ISVC 2008, Part I. LNCS, vol. 5358. Springer, Heidelberg (2008) 12. Wuthrich, C.A., Stucki, P.: An algorithm comparison between square and hexagonalbased grids. Graphical Models and Image Processing 53(4), 324–339 (1991)
Lloyd’s Algorithm on GPU Cristina N. Vasconcelos1 , Asla S´a2 , Paulo Cezar Carvalho3, and Marcelo Gattass1,2 1
Depto. de Inform´ atica - Pontif´ıcia Universidade Cat´ olica (PUC-Rio) 2 Tecgraf (PUC-Rio) 3 Instituto de Matem´ atica Pura e Aplicada (IMPA)
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. The Centroidal Voronoi Diagram (CVD) is a very versatile structure, well studied in Computational Geometry. It is used as the basis for a number of applications. This paper presents a deterministic algorithm, entirely computed using graphics hardware resources, based on Lloyd’s Method for computing CVDs. While the computation of the ordinary Voronoi diagram on GPU is a well explored topic, its extension to CVDs presents some challenges that the present study intends to overcome.
1
Introduction
The Voronoi Diagram is a well known partition of space determined by distances to a specified discrete set of points in space. Formally it is defined as follows: Given an open set Ω of d , a set of n different sites (or seeds) zi , i = 1...n, and a distance function d, the Voronoi Diagram (or Tessellation) is defined as n distinct cells (or regions) Ci such that: Ci = {w ∈ Ω|d(w, zi ) < d(w, zj ), f or i, j = 1...n, j = i}
(1)
Voronoi Diagram computation is a topic of great interest not only in Computational Geometry but also in several scientific fields. One of its important variants is the Centroidal Voronoi Diagram (CVD), a special kind of Voronoi Diagram for which the points comprising the set that generates the tessellation are also the centers of mass of the Voronoi cells. Generally speaking, CVD application is motivated by its capacity to cluster data, to select the optimal location for point placement, and its characterization as minimizer of an energy functional. Relevant theoretical and applied papers involving the computation of CVDs, whose properties have been well studied, are available in the literature [1, 2, 3]. A traditional sequential algorithm for CVD computation is Lloyd’s algorithm [1], which iterates the computation of Voronoi tessellations and their regions’ centroids until a convergence criterion is satisfied (similarly to optimal k-means cluster computation). Formally it is described as follows [1]: Given a set Ω ∈ n , a positive integer k, and a probability density function ρ defined over the considered domain: G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 953–964, 2008. c Springer-Verlag Berlin Heidelberg 2008
954
C.N. Vasconcelos et al.
1. Initialization: select an initial set of k points {zi }ki=1 ; k k 2. Voronoi Tessellation: compute {Ci }i=1 of Ω associated with {zi }i=1 ; 3. Centroid Computation: compute the mass centroids of the Voronoi regions {Ci }ki=1 found in step 2. These centroids are the new set of points k {zi }i=1 ; 4. Convergence Test: if this new set of points meets a convergence criterion, terminate; otherwise, return to step 2; The CVD has been used in several different contexts, such as data and image compression, image segmentation and restoration, decorative mosaics, quantization, clustering analysis, optimal distribution of resources, cellular biology, statistics, studies on the territorial behavior of animals, optimal allocation of resources, grid generation, meshless computing and many others [1, 4, 5]. As a consequence of its large applicability, algorithms for an efficient and accurate construction of CVDs are of substantial interest. Our goal is to redesign Lloyd’s algorithm in order to propose an efficient parallel implementation on GPU, taking advantage of the decreasing cost of programmable graphics processing units (GPUs). The computation of Discrete Voronoi Tessellation using graphics hardware has been explored using different approaches [6, 7, 8, 9, 10], but its extension to CVD computation entirely on GPU presents some interesting challenges. Usually, in the literature, the Voronoi diagram is computed on GPU, while centroid computation and update, and the convergence of Lloyd’s algorithm are computed on CPU, demanding a data read-back time related to passing the GPU-computed Voronoi diagram to the CPU as well as passing the new site positions computed on the CPU back to the GPU. Modern GPU architectures are designed as multiple pipelines with massive floating-point computational power dedicated to data-parallel processing. The algorithm proposed here fulfills its architectural requirements by presenting a solution with independent data-parallel processing kernels, with no communication between data elements in each step of Lloyd’s algorithm’s computation. This paper is structured as follows: the next section describes the existing methods for sequentially computing the CVD and the proposals found in the literature for computing the Voronoi diagram on GPU (Section 2). Then, an overview of the parallel algorithm suitable for current Graphics Hardware resources is presented (Section 3), and centroid computation is detailed (Section 4). In Section 5 we present efficiency results for different scenarios that illustrate the speed and quality of our solution compared to common CPU-GPU solutions.
2
Related Work
The computation of 2D and 3D Discrete Voronoi Diagrams using graphics hardware was initially proposed by Hoff et al. [6]. In their proposal, a mesh is created representing the distance function for each Voronoi site with bounded error. The distance mesh is orthogonally projected in a way that, for each sample in image space, the closest site and the distance to that site is solved by means of
Lloyd’s Algorithm on GPU
955
hardware-implemented polygon scan-conversion and Z-buffer depth comparison. After projection, each pixel in the frame buffer stores a color-coded identification of the site to which it is closest, while the depth buffer stores the distance to that site. The evolution of programmable graphics hardware spurred the development of new methods for computing the Discrete Voronoi Diagram and its dual, the Delaunay triangulation, as can be seen in recent publications [7, 8, 9, 10]. Recently, Rong and Tan [8] proposed a novel algorithm called Jump Flooding Algorithm (JFA) based on the idea of information propagation. This parallel algorithm solves the 2D Voronoi Diagram with almost constant time throughput regardless of the number of Voronoi sites used, but only in the final resolution adopted. The approach was later extended by the authors ( [9] and [10]). We have adopted the solution proposed in [8] to implement the discrete Voronoi computation step of Lloyd’s Method, as will be discussed in Section 3. CVD computation based on Lloyd’s method with a mixed CPU-GPU approach was initially proposed by Hausner [4], and formulated in the k-means context by Hall and Hart [11]. In both studies, the GPU is used to perform distance computations (computing the clusters, composed by Voronoi regions) while the CPU is responsible for computing and updating the centroids and for checking convergence at each iteration. During the cluster/Voronoi region construction step, the graphics hardware evaluates the covered space and writes the minimum metric value for each sampled point of the space in the depth buffer. It also registers the IDs of the cluster/Voronoi regions that generated those values in the color buffer, producing a texture that represents the processing space within the cluster/Voronoi regions. As the texture that stores the cluster IDs must be read back to the CPU for further processing, these methods face a huge efficiency bottleneck related to communication from the GPU to the CPU. The centroid computation step has not been solved in GPU to date. In the literature, the closest proposal to our method consists in finding the centroids using a variant of the parallel programming pattern called parallel reduction operator, adapted to generate multiple outputs as described in Subsection 4.1. The reduction operator pattern is widely used in GPGPU applications in cases that require generating a smaller stream (usually a single-element stream) from a larger input stream [12,13,14]. Common examples of reduction operators include +, ×, ∨, ∧, ⊗, ∪, ∩, max and min. Its design responds to GPGPU challenges, as each one of its nodes is responsible for performing partial computations, which can be seen as independent processing kernels with gather operations over previously computed values, i.e. that operate by reading the corresponding values from a texture where the previous results have been stored. Thus, while a reduction is computed over a set of n data elements in O( np log n) time steps using parallel GPU hardware (with p elements processed in one time step), it would cost O(n) time steps for a sequential reduction on the CPU [12]. A variant of the reduction operator, called multiple parallel reduction, can run many reductions in parallel with (O(log2 N ) steps of O(M N ) work [12, 13, 14, 15,
956
C.N. Vasconcelos et al.
16]). It is useful for reducing an input dataset to multiple outputs, such as in the proposal presented by Fluck et al. for computing image histograms [16], and the uniquely colored object localization from natural images by Vasconcelos et al. [17]. As described in Subsection 4.2, the algorithm proposed in this paper for centroid computation can be seen as a multiple parallel regional reduction operator, and can be extended to other applications beyond CVD centroid computation.
3
Parallel Pipeline for Lloyd’s Algorithm on GPU
This section presents an overview of our proposal designed for data-parallel processing considering currently available GPU resources. The main questions to be solved are how to formulate the processing steps for parallel computing and how to define the data flow between processing steps, eliminating CPU-GPU transfers. The overview of the proposed data flow is illustrated in Figure 1. It represents the interaction between consecutive Lloyd’s Algorithm steps by producing intermediate results within GPU memory to be read at the next step of the pipeline. The following subsections describe how each step interacts with such flow.
Fig. 1. Algorithm Pipeline and Data Flow
3.1
Voronoi Tessellation
Motivated by the near constant output rate for a varying number of sites, our study has adopted the solution proposed by [NOMES DOS AUTORES] [8] to implement the discrete Voronoi Tessellation step of Lloyd’s Method. Other GPU solutions could be used, as long as they generate a texture with the space partition. Traditionally, each Voronoi site is represented with a unique random color. In our method, we initially create the colors (IDs) of the sites using a sequential enumeration. By creating such sequential IDs, we are able to use them in the mapping algorithm created for centroid computation. This enumeration is done only once in a preprocessing step, during the creation of the sites, so that for n sites, the IDs vary between 0 and n-1. Another adaptation implemented in our method is that the ID of each Voronoi site is saved using a single channel of the output texture. Observe that when Voronoi computation using graphics hardware [4] was proposed by Hausner,
Lloyd’s Algorithm on GPU
957
representing the IDs using a single channel would limit the number of sites to 256, as in older graphics hardware each color channel was limited to 8 bits. Thus, the use of the three color channels in previous proposals was a requirement for the construction of Voronoi diagrams with more than 256 sites. However, modern GPUs offer the resource of using float 32-bit textures, providing enough precision to identify uniquely a huge set of sites using a single color channel. In our pipeline, the Voronoi Tessellation processing step is responsible for reading site positions from a texture and computing the corresponding space tessellation. Site positions are read from a texture directly from GPU memory space rather than being passed from CPU to GPU. This processing step only reads from the texture, leaving the Centroid Computation step responsible for writing the position updates to such texture. By arranging the site data into a texture, our iteration cycle can pass its contents along the algorithm pipeline without requiring CPU intervention. In previous proposals, the CPU would calculate the new centroid positions and then create primitives that set the sites over the centroid positions found. In our algorithm, all primitives are created over the origin ((0,0) in 2D or (0,0,0) in 3D) but are translated to their positions on GPU by the Voronoi Tessellation procedure after the corresponding position of each site has been read from the texture. After a new Voronoi Tessellation is computed, the output generated is a singlechannel texture with enough resolution to cover the represented space, where each sample of the space is represented with a texel with an identification of the Voronoi site that is closer to that sample. 3.2
Centroid Computation and Convergence Test
The second step of our pipeline receives the texture representing the Voronoi Tessellation and is responsible for generating a Centroid Matrix containing the new (x, y) or (x, y, z) coordinates of the centroids. In a textural representation, each channel of the texture can be used to save one of the centroid coordinates. Centroid computation will be detailed in Section 4. Each iteration of the proposed cycle generates a centroid matrix. Instead of overwriting the previously calculated centroid matrix, we fit them sequentially into a new texture, storing convergence history so that convergence analysis can be done by processing this texture over time. The criterion used for convergence is a threshold on the total sum of the distances between current and previous centroid positions. The total sum of distances is calculated by initially creating a 1D texture (the size of one texel per Voronoi site) which, for each texel coordinate n, stores the distance between the current and the previous positions of the Voronoi site identified with the nth ID. Both current and previous values are read from the Convergence History Texture using as texture coordinates a time counter (current iteration number) and the ID of the site. The total sum is found by repeatedly applying a reduction operator over this 1D texture that produces partial sums of the distance values, until producing a single value representing the total.
958
4
C.N. Vasconcelos et al.
Centroid Computation
The Centroid Computation step is responsible for generating the Centroid Matrix by collecting data from a texture representing the space mapped into the Voronoi Tessellation. 4.1
Centroid Computation by Multi-dimensional Reduction
Centroid Computation can be implemented by applying a multi-dimensional reduction operator as proposed by Fluck et al. and Vasconcelos et al. [16,17]. Both methods consist of two steps: several local evaluations analyzing the Voronoi Tessellation texture against the site set, and a multiple parallel reduction to add up these partial results. Here, we follow [17] rather than [16], as it also considers texture cache patterns. Initially a base is constructed containing partial sums of the location information regarding the Voronoi regions, i.e. partial sums of their pixel coordinates. The base texture has an implicit subdivision into tiles that defines the local evaluation domains. Each tile is a grouping of size n, where n is the number of Voronoi sites. The size of the base texture may be larger than that of the Voronoi Tessellation texture as its resolution must be large enough to cover it with the tiles; thus, each of its dimensions must be an integer multiple of the corresponding tile dimension. The parallel algorithm to create the base is defined in such a way that each processing unit is associated with a single site and is responsible for producing an evaluation of the Voronoi Tessellation texture restricted to the pixels covered by a tile. More precisely, each processing unit counts how many pixels of the Voronoi Tessellation texture within the tile domain actually belong to the corresponding Voronoi region and stores the sum of their pixel positions. Thus, the ith texel of a given tile stores information regarding the count and location of the pixels that are identified with Voronoi site i. In order to do that, each processing unit sweeps the region in the Voronoi Tessellation texture associated with the current tile, keeping track of the number of pixels classified as belonging to the corresponding Voronoi region (see figure 2) and of the sums of their x and y coordinates in image space. Once the base is created, a multi-dimensional parallel reduction is used to assemble the local evaluators of each Voronoi site’s data from the different tiles into the base texture and generate a global result in a single storage space for each site, i.e. to produce a single tile output. Each site data is gathered by recursively adding the values read from the base positions corresponding to that site, and storing the number of pixels belonging to the site and the sums of their x and y coordinates from the input image. Thus, centroid position for each object is obtained by simply dividing those sums by the number of pixels. It is important to note that the method described by Vasconcelos et al. [17] creates a cell representing an object frequency in the base level by testing its represented color against each pixel covered by the corresponding tile. That means
Lloyd’s Algorithm on GPU
959
Fig. 2. Multi-dimension Reduction Operator: (from left to right) Voronoi Tessellation; a Single Tile; Base Texture; and a Set of Reductions
that for base creation, such method performs enough operations to compare each pixel in the input image against each object color, applying a total of (nP × nO) texture reads, where nP is the number of pixels in the original image and nO is the number of objects. This number of texture reads is prohibitive in our context. While for many natural video processing applications dealing with the object localization problem the number of objects is usually limited to a few dozen, in Voronoi Tessellation applications there are usually hundreds of sites. Moreover, since Lloyd’s Method is a cyclic procedure, the number of reads increases even more as it has to be multiplied by nI, the number of iterations computed by the algorithm before convergence, thus yielding a total of (nP × nO × nI) texture reads. 4.2
Centroid Computation by Multi-dimensional Regional Reduction
We have shown that the multi-dimensional reduction operator suffers from scalability as the number of Voronoi sites increases. To overcome this we propose a new kind of parallel operator that we call Multi-Dimensional Regional Reduction. The multi-dimensional reduction operator is designed as a data gathering operator. It makes no assumption about where within the input data the relevant regions are. When used for object localization from natural videos [17], it works like a global search covering the whole frame once for each object search without any region-of-interest clue. For the CVD we are interested in processing rendered data (the previously generated Voronoi Tessellation), therefore our idea is to use the sites’ primitive data as an initial guess about where the objects we are looking for are, and then create a distributed local search limited to an area around such primitives. Our method retrieves the local frequencies of the Voronoi sites by applying a total of (nR × nO × nI) texture reads, where nR is the number of pixels in a region around each primitive used as an adjustable input parameter for the algorithm which is expected to be much smaller than the total number of pixels nP . The local optimization proposed is based on the assumption that for any fixed resolution of the Voronoi Tessellation texture (nP ), as the number of sites grows (nO), fewer pixels are covered by each site and such pixels are arranged around the site.
960
C.N. Vasconcelos et al.
To compute such local evaluation, we have created a space subdivision hierarchy defined around each Voronoi site (see Figure 3) to be used by our algorithm. The higher level of such hierarchy is the Quadrant level. It is composed by a set of four quadrants Q0 , Q1 , Q2 , Q3 surrounding a Voronoi site, which are placed in a left-right, bottom-up order. The area covered by the four quadrants of a site defines the region of interest within the Voronoi Tessellation texture to be analyzed by the Multi-Dimensional Regional Reduction operator when looking for the centroid of the Voronoi region corresponding to that site. By definition, the dimensions of the quadrants should be chosen to cover the maximum area expected for a single Voronoi region. The next level of the hierarchy subdivides each quadrant into regular units named patches. The set of patches inside a quadrant is placed in a left-right, bottom-up order. Each patch defines an area within a quadrant (thus, within the Voronoi Tessellation texture), to be evaluated by a single processing unit. This level provides a mechanism to distribute centroid computation into as many processing units and processors as desired. Each patch receives an unique number, Idpatch , that represents its position within the ordered set of patches related to the same Voronoi site. Such enumeration starts at the left-bottom patch from quadrant Q0 and is sequentially incremented one by one in a left-right, bottom-up order. After all the patches within a quadrant have been numbered, the enumeration continues in the next quadrant, also in a left-right, bottom-up order. The identification number of a patch (Idpatch ) is determined using Equation 2: Idpatch = Q ∗ α + y ∗ β + x
(2)
where Q represents the number of the quadrant where the patch is located, varying between 0 and 3; x’ and y’ are the horizontal and vertical coordinates of the patch within its quadrant, measured in number of patches; and α and β are constants that represent respectively the number of patches within each quadrant and the number of patches per line of the quadrant. An illustrative example using α as 9 and β as 3 (thus, a 3x3 patches-per-quadrant subdivision) is shown in Figure 3. Now that the space subdivision is defined, we will describe the MultiDimensional Regional Reduction (MDRR) operator and how it is used to compute the CVD. The general similarity between MDRR and the algorithm presented in Subsection 4.1 is that both are composed by a two-step procedure where the first step is responsible for computing local evaluations and the second step is responsible for collecting such data into a well-defined storage space with global results. In both algorithms, the first processing step generates a 2D texture with each texel saving a local evaluation of the Voronoi Tessellation texture against a single Voronoi site. The significant differences between the algorithms are related to how the local domains are defined (tiles versus patches) and the overall area covered by the set of such local domains to process each site (the whole Voronoi Tessellation texture versus the Quadrants defined around each site).
961
Lloyd’s Algorithm on GPU
Fig. 3. Quadrants (left); Patches (center); Local Evaluations Texture (right)
During the first step of the MDRR operator each processing unit is responsible for outputting a texel. The texels placed in the same column represent the results of the evaluations of the Voronoi Tessellation texture against a single Voronoi site. More precisely, the horizontal coordinate of the output texel defines the ID of the site currently being evaluated within the processing unit. Different processing units producing texels to be placed in the same column are responsible for testing the same Voronoi site but against different areas of the Voronoi Tessellation texture. Such areas are defined using the vertical coordinate of the texel, which therefore defines the space within the Voronoi Tessellation texture to be swept by the processing unit. The texture storing local evaluations is shown in the right side of Figure 3. The area covered by each processing unit is determined by reversing the patch enumeration procedure. The output texel’s vertical coordinate is used as a patch number and an image space area within the Voronoi Tessellation texture is generated. Reversing Equation 2, as α and β are constants, can be accomplished with the following procedure: Initially the patch quadrant is obtained through an integer division of the patch number by α (the number of patches within each quadrant). The remainder of this division represents the patch’s number within its quadrant. This number is then divided (integer division) by β (the number of patches per line of the quadrant) so that the result represents the vertical position (y’ ), and the remainder is the horizontal position (x’ ) of the patch within the quadrant. Q = Idpatch /α;
y = Q/β;
x = Idpatch − Q ∗ α − y ∗ β;
(3)
Each patch location must be obtained in image space coordinates (pixels) in order to access the Voronoi Tessellation texture. It is possible to use the Voronoi site’s pixel coordinates and the input parameter defining quadrant size to determine each quadrant’s origin in pixel coordinates (Qx0 , Qy0 ). For simplicity, we consider that the patches are square regions of pixels and that the number of pixels on each side of such square is δ. As (x’ ) and (y’ ) are measured in number of patches within a quadrant, the image space coordinate (x0 , y0 ) of the origin (leftbottom pixel) of a patch is retrieved through the following component-wise sum: (x0, y0) = (Qx0 , Qy0 ) + (x ∗ δ, y ∗ δ);
(4)
962
C.N. Vasconcelos et al.
By reversing the patch enumeration procedure, each processing unit will know which area from the input image it should cover, and then it can sweep the pixels within a patch comparing the texels read from the Voronoi Tessellation texture against the represented site’s ID. During the evaluation, it counts the frequency of pixels identified with the represented Voronoi region and sums their coordinates, saving such values in the generated texel. Finally, the Multi-dimensional Regional Reduction operator performs a reduction procedure in which the local results are added into a single line, where each position represents a single site’s data. The centroids can be retrieved from this line by dividing each coordinate sum by the total number of pixels of the represented site.
5
Results
The quadrant sizes were chosen in order to cover a large area around each site thus safely including the related Voronoi region. The quadrant area used was four times larger than the area obtained by dividing the number of pixels of the Voronoi Tessellation by the number of sites. From a parallel programming point of view, it is important to stress that the total number of patches times the number of Voronoi sites defines the number of individual processing units to be distributed among the multiprocessors. Besides, patch dimensions define how many pixels are read by each one of the processing units (the texture area). Thus, the patch level was designed to provide a balance mechanism among the several processors, as well as texture cache patterns that can be adjusted in order to improve performance according to the architecture of the graphics card used. To test the algorithm presented, an implementation using CUDA running over a GeForce 8600 GT was created. The timings were obtained for sets of different numbers of sites (from 128 to 128k) and two different resolutions of the Voronoi Tessellation texture (512×512 and 1024×1024). They are shown in Figure 4. 300.00 Voronoi 512*512 250.00
time (msec)
Centroids 512*512 200.00 Voronoi 1024*1024 150.00
Centroids 1024*1024
100.00
50.00
0.00
128
256
512
768
1024
2048
4096
6144
8192
10240
12288
14336
16384
32768
65536
98304
Voronoi 512*512
58.12
51.08
51.12
40.29
42.80
42.83
30.77
34.00
33.87
19.95
20.35
20.76
21.32
22.84
9.63
9.65
Centroids 512*512
13.76
8.48
12.07
13.78
14.43
13.90
14.25
14.17
11.54
7.77
9.29
10.81
5.57
10.89
6.20
8.37
131072 10.17
Voronoi 1024*1024
266.24
217.12
188.99
200.74
191.78
161.60
173.00
167.51
112.45
116.17
119.07
127.35
124.50
77.89
86.46
90.91
93.39
Centroids 1024*1024
39.52
31.86
28.64
27.46
25.67
31.01
61.81
44.64
30.30
38.04
45.42
35.27
24.64
26.58
23.67
35.19
46.18
9.64
Number of sites
Fig. 4. Timing for Computing Voronoi Tessellations and Centroids over 512*512 and 1024*1024 Images
Lloyd’s Algorithm on GPU
963
For testing the sets composed of 128 and 256 sites, quadrants subdivided into 9 and 4 patches, respectively, were used. For the other cases, 1:1 quadrants were used in the patch subdivision. The results have shown that by properly using the spatial subdivision hierarchy, centroid computation time is kept close to constant even if the size of the site sets varies significantly. There is still room for optimization of the implementation tested, especially exploring CUDA memory hierarchy, but the objective of the tests presented was to compare the CPU-GPU model and the proposed algorithm. The tested implementation was constructed using only GPU programming resources that could be translated into shader languages. We do not present the number of iterations before convergence because such number is intrinsic to Lloyd’s Method’s formulation. Therefore, it is expected to be the same for our GPU parallel formulation as for other CPU or CPU-GPU formulations, as long as the same initial conditions (set of sites and distance metric) are used.
6
Conclusions
This paper presented a computation of the Centroidal Voronoi Diagram through Lloyd’s Method fully adapted to GPU resources. We showed how a data flow can be constructed so that it passes data through Lloyd’s Method’s iteration steps, eliminating the CPU-GPU texture reading presented in previous solutions. In particular, we described an efficient parallel computation algorithm to compute region centroids and to test convergence. By computing these steps on GPU we eliminate the read-back time related to passing the Voronoi diagram to the CPU, as is the case of previous proposals. As future work, we plan to extend the proposed method to be used with varying distance metrics and with varying density functions. This can be obtained directly by changing the Voronoi Tessellation and by including a weight in centroid computation, respectively. As a general contribution, the proposed Multi-dimensional Regional Reduction operator combined with the space subdivision hierarchy presented ensure an almost constant time processing throughput for a varied number of sites, thus motivating its use instead of the traditional reduction operator in cases where an initial localization clue, or a region of interest, is available.
References 1. Du, Q., Faber, V., Gunzburger, M.: Centroidal voronoi tessellations: Applications and algorithms. SIAM Rev. 41, 637–676 (1999) 2. Har-Peled, S., Sadri, B.: How fast is the k-means method? In: SODA 2005: Proceedings of the 16th ACM-SIAM Symp. on Discrete algorithms, pp. 877–885 (2005) 3. Du, Q., Emelianenko, M.: Acceleration schemes for computing centroidal voronoi tessellations. Numerical Linear Algebra with Applications 13, 173–192 (2006) 4. Hausner, A.: Simulating decorative mosaics. In: SIGGRAPH 2001: Papers, pp. 573–580. ACM, New York (2001)
964
C.N. Vasconcelos et al.
5. Du, Q., Gunzburger, M., Ju, L., Wang, X.: Centroidal voronoi tessellation algorithms for image compression, segmentation, and multichannel restoration. J. Math. Imaging Vis. 24, 177–194 (2006) 6. Kenneth, E., Hoff III, Culver, T., Keyser, J., Lin, M., Manocha, D.: Fast computation of generalized voronoi diagrams using graphics hardware. In: SCG 2000: Proceedings of the 16th Annual Symp. on Computational geometry, pp. 375–376. ACM, New York (2000) 7. Denny, M.: Solving geometric optimization problems using graphics hardware. In: EUROGRAPHICS 2003. Computer Graphics Forum, vol. 22, pp. 441–451 (2003) 8. Rong, G., Tan, T.S.: Jump flooding in gpu with applications to voronoi diagram and distance transform. In: I3D 2006: Proceedings of the Symp. on Interactive 3D graphics and games, pp. 109–116. ACM, New York (2006) 9. Rong, G., Tan, T.S.: Variants of jump flooding algorithm for computing discrete voronoi diagrams. In: ISVD 2007: Proceedings of the 4th Int. Symp. on Voronoi Diagrams in Science and Engineering, pp. 176–181. IEEE Computer Society, Los Alamitos (2007) 10. Rong, G., Tan, T.S., Cao, T.T., Stephanus: Computing two-dimensional delaunay triangulation using graphics hardware. In: SI3D 2008: Proceedings of the 2008 Symp. on Interactive 3D graphics and games, pp. 89–97. ACM, New York (2008) 11. Jesse, D., Hall, J.C.H.: Gpu acceleration of iterative clustering. In: Manuscript accompanying poster at GP2: The ACM Workshop on General Purpose Computing on Graphics Processors, and SIGGRAPH 2004 poster. ACM, New York (2004) 12. Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kr¨ uger, J., Lefohn, A.E., Purcell, T.: A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26, 80–113 (2007) 13. Owens, J.: Data-parallel algorithms and data structures. In: SIGGRAPH 2007: Courses, p. 3. ACM, New York (2007) 14. Roger, D., Assarsson, U., Holzschuch, N.: Efficient stream reduction on the gpu. In: Workshop on General Purpose Processing on Graphics Processing Units (2007) 15. Kr¨ uger, J., Westermann, R.: Linear algebra operators for gpu implementation of numerical algorithms. In: SIGGRAPH 2003: Papers, pp. 908–916. ACM, New York (2003) 16. Fluck, O., Aharon, S., Cremers, D., Rousson, M.: Gpu histogram computation. In: SIGGRAPH 2006: Research Posters, p. 53. ACM, New York (2006) 17. Vasconcelos, C., S´ a, A., Teixeira, L., Carvalho, P.C., Gattass, M.: Real-time video processing for multi-object chromatic tracking. In: BMVC 2008, pp. 113–123 (2008)
Computing Fundamental Group of General 3-Manifold Junho Kim1 , Miao Jin2 , Qian-Yi Zhou3, Feng Luo4 , and Xianfeng Gu5 1
Dong-Eui University, South Korea
[email protected] 2 University of Louisiana at Lafayette, USA
[email protected] 3 University of Southern California, USA
[email protected] 4 Rutgers University, USA
[email protected] 5 Stony Brook University, USA
[email protected]
Abstract. Fundamental group is one of the most important topological invariants for general manifolds, which can be directly used as manifolds classification. In this work, we provide a series of practical and efficient algorithms to compute fundamental groups for general 3-manifolds based on CW cell decomposition. The input is a tetrahedral mesh, while the output is symbolic representation of its first fundamental group. We further simplify the fundamental group representation using computational algebraic method. We present the theoretical arguments of our algorithms, elaborate the algorithms with a number of examples, and give the analysis of their computational complexity. Keywords: Computational topology, 3-manifold, fundamental group, CW-cell decomposition.
1 Introduction Topology studies the properties of geometric objects which are preserved under continuous deformation. In biomedical fields, topology has been applied for classification and identification of DNA molecules, and topological changes are considered as indications of some important chemical changes [1, 2]. In CAGD field, a rigorous, robust, and practical method to compute the topologies of general solids, in order to improve the robustness and reliability of CAGD systems required for automation of engineering analysis tasks, is also prefered [3, 4, 5, 6, 7, 8]. The computational algorithms for surface topology are mature [9, 10, 11, 12, 13, 14, 15, 16], while computing the topologies of 3-manifolds still remains widely open. Both Homology group and fundamental group are important topological invariants for 3-manifold. Although homology group is much easier to compute than fundamental group, the price is that it conveys much less information than fundamental group [17]. In theory, all 3-manifolds can be canonically decomposed to a unique collection of prime manifolds, whose topologies are solely determined by their fundamental G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 965–974, 2008. c Springer-Verlag Berlin Heidelberg 2008
966
J. Kim et al.
groups. Since most connected solids in the real world are prime 3-manifolds, computing the fundamental groups for 3-manifolds ia a practical solution to understand their topologies. To the best knowledge of the author, there is no general practical system in engineering fields, which can verify whether two 3-manifolds are topologically equivalent. To tackle this problem, we present the first efficient algorithms to compute fundamental groups of general 3-manifolds represented by tetrahedral meshes. The computational complexity of our algorithms are greatly reduced by converting an input tetrahedral mesh (i.e., simplicial complex) to a CW complex [18].
2 Related Work Computational topology has emerged as a very active field [19, 1, 2]. It is beyond the scope of this paper to give a thorough review. Here we only briefly review the most related works. For a closed 2-manifold surface, the surface can be sliced into a simple polygon, which is called polygonal schema. Since the boundary of polygonal schema provides the loops for the fundamental group and the homology group of the surfaces, it has been intensively studied in the field of computational topology. Vegter and Yap [9] present an efficient algorithm on computing a canonical form of a polygonal schema from a closed 2-manifold mesh. From the canonical form they introduce the algorithm on computing the fundamental group for surface meshes. By using Seifert-van Kampen’s theorem, Dey and Schipper [10] present a linear time algorithm for computing a polygonal schema of a 2-manifold mesh. Lazarus et al. [13] provide optimal algorithms for computing a canonical polygonal schema of a surface. Erickson and Har-Peled [14] show that it is NP-hard to get an optimal polygonal schema, which has the minimal boundary edge lengths. Colin de Verdi`ere and Lazarus [15] provide the algorithm to find an optimal system of loops among all simple loops obtained from a canonical polygonal schema. Erickson and Whittlesey [20] introduce greedy algorithms to construct the shortest loops in the fundamental group or the first homology group. Yin et al. [16] compute the shortest loop in a given homotopy class by using universal covering space. We utilize the idea of reducing the problem dimension [10,11,21] in the computation of fundamental groups for 3-manifold with CW cell decompositions. In contrast to the algorithms for computing homology groups [11, 21], this work focuses on computing fundamental groups, which convey much more topological information in the case of 3-manifolds. Furthermore, our method is general for 3-manifolds which even can not be embedded in R3 , and is efficient for large tetrahedral meshes reconstructed from medical images.
3 Background In out work, we convert input tetrahedral meshes, namely simplicial complexes, to CW complexes, then compute their fundamental groups, which are represented as symbolic representations. Here we only briefly introduce the concepts directly related with our algorithms. We refer readers to [22, 18, 23] for more details.
Computing Fundamental Group of General 3-Manifold
(a)original solid torus (b) 0 and 1-cells
(c) 2-cells
967
(d) 3-cell
Fig. 1. CW Complex of a solid torus. The original solid torus in (a) is decomposed to one 0-cell, shown in (b) with green color, two 1-cells shown in (b) as red arcs, two 2-cells in (c) and one 3-cell in (d).
3.1 Simplicial Complex The topology of tetrahedral meshes is typically represented as a simplicial complex: a set of 1, 2, 3 and 4-element subsets of a set of labels, corresponding respectively to the vertices, edges, triangles, and tetrahedrons of the mesh. A simplicial complex can be considered as just the connectivity part of a traditional tetrahedral mesh. 3.2 CW Complex CW complex is a generalization of simplicial complex, introduced by J.H.C.Whitehead in 1949. Given a tetrahedral mesh, CW complex forms a topological skeleton of the mesh, which is far more flexible than using simplicial complex representation. A topological space is called an n-cell if it is homeomorphic to Rn . For example, a 0-cell is a point, a 1-cell is a space curve segment, a 2-cell is a surface patch and a 3-cell is a solid. This homeomorphism maps the boundary of a n-cell to the (n-1) sphere. Given a 3-manifold M with one component, a Hausdorff topological space X is its CW complex if it can be constructed, starting from discrete points, by first attaching 1-cells, then 2-cells, then 3-cells, represented as M0 ⊆ M1 ⊆ M2 ⊆ M3 . Each M k is called the k-skeleton, obtained with attaching k-cells to a M k−1 by identifying the boundary of the k-cells with the union of some collection of (k-1)-cells in the complex. For example (shown in Figure 1), to construct a 3-dimensional CW-complex. We begin with the empty set. Then we attach 0-cells by unioning disjoint points into the set (shown in Figure 1(b)). We attach 1-cells by unioning space curve segments whose endpoints lie on these points (shown in Figure 1(b)). We attach 2-cells by unioning surface patches whose boundaries lie on the space curve segments (shown in Figure 1(c)). We attach 3-cells by filling in closed regions bounded by surface (shown in Figure 1(d)). A particular choice of a collection of skeletons and attaching maps for the cells is called a CW structure on the space, which is not unique in general. 3.3 Fundamental Group In a topological space X, we mean a path as a continuous map f : I → X where I is the unit interval [0, 1]. Two pathes f0 and f1 which share two end points (i.e., f0 (0) = f1 (0) and f0 (1) = f1 (1)) are homotopic to each other, if one can be continuously deformed to
968
J. Kim et al.
another in X while two end points are kept during the deformation. All pathes each of which is homotopic to a path f is called a homotopy class of f , denoted by [ f ]. Given two paths f , g : I → X such that f (1) = g(0), there is a composition f · g that traverses first f then g, defined by the formula f (2s), 0 ≤ s ≤ 12 f · g(s) = g(2s − 1), 12 ≤ s ≤ 1 In particular, suppose a path f : I → X is with the same starting and ending point f (0) = f (1) = x0 ∈ X , then f is called a loop, and the common starting and ending point x0 is referred to as base point. Definition 1 (Fundamental Group). The set of all homotopy classes [ f ] of loops f : I → X at the base point x0 is a group with respect to the product [ f ][g] = [ f · g], which is called the fundamental group of X at the base point x0 , and denoted as π1 (X , x0 ). If X is path connected, then for any base points x0 , y0 ∈ X, the fundamental groups π1 (X , x0 ) and π1 (X , y0 ) are isomorphic, therefore, we can omit the base point and denote the fundamental group as π1 (X ). We represent the fundamental group as < S; R >. It is a free group generated by S, called the generator and represented as a set of non-commutative symbols. R is called the relation, represented as words formed using these symbols.
4 Algorithm Given a 3-manifold represented by a simplicial complex (a tetrahedral mesh) M, our goal is to compute its fundamental group π1 (M), represented as generators and relations S; R. Considering that the number of the simplexes in M is in general high such that direct computation is prohibitively expensive, our algorithms will be built on CW complex representation of M instead of simplex representation. So the first step of our algorithms is to compute the CW cell decomposition of the input tetrahedral mesh M. Then the following lemmas give the keys of the next steps of our algorithms for computing the generators and relations of M, which tell that they only depend on the 2-skeleton M 2 and the 1-skeleton M 1 of the CW complex representation of M. We refer readers to Appendix for the proof. Lemma 1. The fundamental group π1 (M) is isomorphic to the fundamental group π1 (M 2 ). Lemma 2. The fundamental group π1 (M 1 ) is a free group (only generators, no relations), π1 (M 1 ) = γ1 , γ2 , · · · , γn .
Suppose M 2 = M 1 {σ12 , σ22 , · · · , σn22 }, and σi2 is 2-cell, then the boundary of each 2cell ∂ σi2 is a loop in M 1 . The fundamental group of M 2 has the form
π1 (M 2 ) = γ1 , γ2 , · · · , γn ; [∂ σ12 ], [∂ σ22 ], · · · [∂ σn22 ] where [∂ σi2 ] is the homotopy class of ∂ σi2 in π1 (M 1 ), represented by a word formed by γk ’s.
Computing Fundamental Group of General 3-Manifold
(a)
(b)
(c)
969
(d)
Fig. 2. CW cell decomposition: (a) an input non trivial tetrahedral mesh, obtained by removing a solid two hole torus from a solid sphere; (b) the 2-skeleton of the tetrahedral mesh; (c) the different 2-cells in the 2-skeleton illustrated with different colors; (d) the 1-skeleton of the tetrahedral mesh; vertices in the 1-skeleton whose valence is greater than two, belong to the 0-skeleton
4.1 Computing CW Complex Suppose M is a 3-manifold represented by a tetrahedral mesh. Our goal is to compute a CW complex M0 ⊆ M1 ⊆ M2 ⊂ M3 , where M k is the k-skeleton, obtained by attaching k-cells to M k−1 . In the following discussion, the vertex, edge, triangle, and tetrahedron refer to simplicial complex. The algorithm starts with an input tetrahedral mesh M, which is equivalent to the 3-skeleton M 3 . Initially, we set M 0 , M 1 , and M 2 as empty sets. Since 3 3 3 2 M = M {σ1 , σ2 , · · · , σn33 }, where σi3 is a 3-cell. Suppose Δ 3 is a tetrahedron in M, then Δ 3 must belong to a 3-cell σi3 . We merge all the tetrahedra sharing a face with Δ 3 to Δ 3 to form a bigger 3-cell. We keep growing this 3-cell until all the tetrahedra are exhausted or the 3-cell can not be extended further, then the 3-cell is σi3 . Then we select another tetrahedron in M 3 \ σi3 , and get another 3-cell. We repeat this process, until all tetrahedron are removed. Then what left is the 2-skeleton M 2 . The computation of 2-cells and 1-skeleton M 1 is very similar. We select a triangle 2 Δ ∈ M 2 . By growing Δ 2 , we can find a 2-cell. By repeating removing 2-cells from M 2 , we obtain M 1 . All the vertices in M 1 whose valence is not equal to 2 form 0-skeleton M 0 . The connected components of M 1 \ M 0 are 1-cells. Algorithm 1 gives the general procedure to get a k-skeleton from a (k+1)-skeleton. Figure 2 shows an example of the CW cell decomposition from an input non trivial tetrahedral mesh. 4.2 Computing Generators According to propositions 1 and 2, the generators of π1 (M) is equivalent to the the generators of π1 (M 1 ). To compute the generators of π1 (M 1 ), we can treat the 1-skeleton M 1 as a graph G by considering the 0-cells as nodes and 1-cells as edges. Then the generators of π1 (M 1 ) are simply those loops, whose compositions can generate all possible loops in G.
970
J. Kim et al.
Algorithm 1. Computing CW Complex Set a randomly picking (k+1) simplex Δ k+1 as seed; Set seed as a (k+1)-cell σ k+1 and marked; queue += all the (k+1) simplexes sharing a face with seed; repeat repeat Pop a (k+1) simplex Δ k+1 out of queue; Let τ be the common face of Δ k+1 and σ k+1 , where ∂ Δ k+1 ∩ ∂ σ k+1 = τ . if Δ k+1 has not been marked then Grow the (k+1)-cell σ k+1 by including Δ k+1 and τ ; Mark Δ k+1 ; queue += all the (k+1) simplexes sharing a face with Δ k+1 while not marked; end if until queue is empty Shrink current (k+1)-skeleton by removing this (k+1)-cell σ k+1 , such that M k+1 ← M k+1 − σ k+1 ; if Some (k+1) simplexes Δ k+1 in M k+1 not marked then Set one of them as seed and marked; queue += all the (k+1) simplexes sharing a face with seed and not marked; end if until queue is empty A k-skeleton M k is obtained by removing all (k+1)-cells from M k+1 ;
The algorithm is as following. First, we compute a minimal spanning tree T of G. Then, the set of edges in G is partitioned into the set of edges in T and the set of nontree edges in G \ T . Let eT and e¬T be an edge in T and a non-tree edge in G \ T . When we union each non-tree edge e¬T with T , a unique loop γi is generated in G. From the i property of the spanning tree of a graph, the set of {γ1 , γ2 , · · · , γn } is the generators for G, where n is the number of non-tree edges in G. Moreover, the set {γ1 , γ2 , · · · , γn } can be considered as the generators of π1 (M) from the propositions. 4.3 Computing Relations Since the boundary of each 2-cell is a loop in the 1-skeleton, they must come from the concatenations of the generators of π1 (M 1 ) we just computed, and can be represented as a word w. Considering the boundary loop of any 2-cell can be shrunk to a point in M, the word w must equal to e. Therefore, w is a relation. The following gives the algorithm based on the graph G and tree T computed from previous algorithm Sec. 4.2. We first give an arbitrary orientation for each edge ei in G. The symbol e−1 represents the opposite direction for the orientation of ei . Then, i we select an arbitrary orientation of the 2-cell boundary and write down the boundary with the sequence of symbols, by using the corresponds between 1-cells and edges in G. Next, we eliminate the symbol ei (or e−1 i ) in the sequence if ei is an edge in the spanning tree T . Then, each remaining symbol ei must correspond to a non-tree edge
Computing Fundamental Group of General 3-Manifold
971
e¬T j . Finally, we replace each symbol with the generator γ j , which corresponds to the loop identified by e¬T j . 4.4 Group Representation Simplification The group representation obtained from previous procedures has redundancies. In order to simply the presentation, we first remove some redundancies by the following simple algorithm. 1. Sort the relations by their lengths. 2. For each relation with length one, w = γk , that means γk is homotopic to a point, then we remove w from the relations and γk from the generators, and remove γk from all relations. 3. For each relations with length two, w = γi γ j , that means γ j = γi−1 , then we remove w from relators, γ j from the generators, and replace γ j by γi−1 in all relations. 4. Repeat step 1 through 3, until the lengths of all relations are greater than two. Then we use the computational algebraic package GAP [24] for further simplification, which is based on Tietze Transformation program [25] with four elementary Tietze transformations to modify a representation to an isomorphic one. 1. Adding a relation If a relation can be derived from the existing relations then it may be added to the presentation without changing the group. 2. Removing a relation f a relation in a presentation can be derived from the other relations then it can be removed from the presentation without affecting the group. 3. Adding a generator Given a presentation it is possible to add a new generator that is expressed as a word in the original generators. 4. Removing a generator If a relation can be formed where one of the generators is a word in the other generators then that generator may be removed.
5 Experimental Results In this section, we analysis the complexity of our algorithms, then apply to general 3-manifolds to compute their fundamental group. 5.1 Complexity Analysis Suppose the input 3-manifold has n1 edges, n2 triangles and n3 tetrahedra, then the complexity of the algorithm to convert it to a CW complex is linear, O(n1 + n2 + n3 ). Suppose the number of k-cells in the CW complex is mk , k = 0, 1, 2, 3, mk < nk . In general, mk nk and our algorithm ensures to minimize mk , m1 equals to the number of the group generators, m2 equals to the number of relations. Then computing the group generators is in m1 steps, and computing relations is less than m2 steps. Therefore, the total complexity is O(n1 + n2 + n3 ) + O(m1 + m2 ). The complexity of the final step to use Tietze transformations to simplify the group representation is difficult to analyze. Because in theory, finding the simplest representation of a group using Tietze transformation is undecidable. We use the heuristic algorithm, which is linear to the number of generators and relations of the input group.
972
J. Kim et al.
5.2 General 3-Manifolds We test our algorithms introduced in Sec. 4 on general 3-manifolds. Due to the page limit, we only list one here. Figure 3 illustrates a complicated 3-manifold, constructed by removing a solid knot and a solid two-hole torus from a solid torus. We compute its fundamental group, which has four generators and three relations, as follows: a, b, c, d; (b−1 c−1 bc), (a−1 b−1 dbd −1bab−1 ), (b−1 d −1 bd −1 ab−1 a−1 d −1 aba−1daba−1b−1 ab−1a−1 d −1 ab−1 a−1 daba−1d)
Fig. 3. A complicated 3-manifold is constructed by removing a solid knot and a solid two-hole torus from a solid torus. Four loops are marked with different colors, each of which corresponds a generator that cannot be shrunk to a point in the 3-manifold.
6 Conclusions and Future Work In this paper, we provided a practical tool for computing the topology of general 3manifolds with their fundamental groups. For the input tetrahedral mesh, we perform the CW cell decomposition to reduce the computational complexities. We also proved that the generators of the fundamental group of a 3-manifold come from its 1-skeleton and the relations come from its 2-cells in the CW complex. We presented the method to simplify the fundamental group representation by algebraic symbolic computations. In the future, we will apply our algorithms for applications such as isotopy detection, handle and tunnel loops detections [7], DNA molecular structure, path planning in robotics, isotopy surface classification, and collision detection in animation.
Acknowledgments We thank Tamal K. Dey for the references and advice and Alexander Hulpke for the assistances on using GAP. The project is partially supported by NSF and NSFC. This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research Promotion Fund) (KRF-2008-331D00510).
Computing Fundamental Group of General 3-Manifold
973
References 1. Edelsbrunner, H.: Biological applications of computational topology. In: Goodman, J.E., O’Rourke, J. (eds.) Handbook of Discrete and Computational Geometry, pp. 1395–1412. CRC Press, Boca Raton (2004) 2. Moore, E.L.F., Peters, T.J.: Computational topology for geometric design and molecular design. In: Ferguson, D., Peters, T. (eds.) Mathematics for Industry: Challenges and Frontiers, pp. 125–137. SIAM, Philadelphia (2005) 3. Hart, J.C.: Using the CW-complex to represent the topological structure of implicit surfaces and solids. In: Proc. Implicit Surfaces 1999, pp. 107–112 (1999) 4. Amenta, N., Peters, T.J., Russell, A.: Computational topology: Ambient isotopic approximation of 2-manifolds. Theoretical Computer Science 305, 3–15 (2003) 5. Abe, K., Bisceglio, J., Peters, T., Russell, A., Ferguson, D., Sakkalis, T.: Computational topology for reconstruction of surfaces with boundary: integrating experiments and theory. In: Shape Modeling and Applications, 2005 International Conference, pp. 288–297 (2005) 6. Abe, K., Bisceglio, J., Ferguson, D., Peters, T., Russell, A., Sakkalis, T.: Computational topology for isotopic surface reconstruction. Theoretical Computer Science 365, 184–198 (2006) 7. Dey, T.K., Li, K., Sun, J.: On computing handle and tunnel loops. In: 2007 International Conference on Cyberworlds, pp. 357–366 (2007) 8. DiCarlo, A., Milicchio, F., Paoluzzi, A., Shapiro, V.: Chain-based representations for solid and physical modeling. IEEE Trans. Automation Science and Engineering 5 (to appear, 2008) 9. Vegter, G., Yap, C.K.: Computational complexity of combinatorial surfaces. ACM SoCG 1990, 102–111 (1990) 10. Dey, T.K., Schipper, H.: A new technique to compute polygonal schema for 2-manifold with application to null-homotopy detection. Discrete & Computational Geometry 14, 93–110 (1995) ´ 11. Kaczy´nski, T., Mrozek, M., Slusarek, M.: Homology computation by reduction of chain complexes. Computer & Mathematics with Applications 35, 59–70 (1998) 12. Dey, T.K., Guha, S.: Computing homology groups of simplicial complexes in R3 . Journal of ACM 45, 266–287 (1998) 13. Lazarus, F., Vegterz, G., Pocchiola, M., Verroust, A.: Computing a canonical polygonal schema of an orientable triangulated surface. ACM SoCG 2001, 80–89 (2001) 14. Erickson, J., Har-Peled, S.: Optimally cutting a surface into a disk. Discrete & Computational Geometry 31, 37–59 (2004) ´ 15. de Verdi`ere, E.C., Lazarus, F.: Optimal system of loops on an orientable surface. Discrete & Computational Geometry 33(3), 507–534 (2005) 16. Yin, X., Jin, M., Gu, X.: Computing shortest cycles using universal covering space. The Visual Computer 23, 999–1004 (2007) 17. Hempel, J.: 3-Manifolds. AMS Chelsea Publishing (1935) 18. Hatcher, A.: Algebraic Topology. Cambridge University Press, Cambridge (2001) 19. Dey, T.K., Edelsbrunner, H., Guha, S.: Computational topology. In: Chazelle, B., Goodman, J.E., Pollack, R. (eds.) Advances in Discrete and Computational Geometry. Contemporary Mathematics 223, pp. 109–143. AMS (1999) 20. Erickson, J., Whittlesey, K.: Greedy optimal homotopy and homology generators. In: 16th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1038–1046 (2005) 21. Dey, T.K., Guha, S.: Transforming curves on surfaces. Journal of Computer and System Sciences 58, 297–325 (1999) 22. Munkres, J.R.: Elements of Algebraic Topology. Benjamin Cummings (1984)
974
J. Kim et al.
23. Lyndon, R.C., Schupp, P.E.: Combinatorial Group Theory. Springer, Heidelberg (2001) 24. GAP: Groups, Algorithms, Programming - a System for Computational Discrete Algebra, http://www-gap.dcs.st-and.ac.uk/ 25. Robertson, E.F.: Tietze transformations with weighted substring search. Journal of Symbolic Computation 6, 59–64 (1988)
Appendix Proof. From the CW complex definition, M 3 = M 2 {σ13 , · · · , σn33 }. Let M¯ 2 be the tubular neighborhood of M 2 (i.e., a thickened M 2 ),
U1 = M¯ 2
{σ13 , · · · , σn33 −1 }, U2 = σn33 ,
then both U1 and U2 are path connected. U1 ∩U2 is a tubular neighborhood of the boundary of U2 , which can retract to a sphere. Therefore, U1 ∩ U2 is path connected. Both π (U2 ) and π (U1 ∩ U2 ) are trivial. By applying Seifert-van Kampen Theorem [18], we get π (M 3 ) = π (U1 ). This shows that the fundamental group is preserved by removing a 3-cell. We can repeat this process to remove all 3-cells, then we get π (M 3 ) = π (M¯ 2 ). Because M 2 is the deformation retract of M¯ 2 , therefore π (M 2 ) = π (M¯ 2 ) = π (M 3 ). Proof. By induction. Suppose n2 = 0, because the 1-skeleton M 1 is a graph, therefore, its fundamental group is a free group [18], π (M 1 ) = γ1 , γ2 , · · · , γn , where γi ’s are independent loops of the graph. Suppose the proposition holds for n2 < k. Now assume n2 = k, M 2 = M 1 {σ12 , σ22 , · · · , σk2 }. Let M¯ 1 be the tubular neighborhood of M 1 , U1 = M¯ 1
2 {σ12 , σ22 , · · · , σk−1 },U2 = σk2 ,
then both U1 and U2 are path connected. U1 U2 is a topological annulus, retracts to ∂ U2 . π (U2 ) is trivial, therefore the loop ∂ U2 is homotopic to a point in U2 . ∂ U2 is a loop in U1 , we use [∂ U2 ] to denoted its homotopy class in π (U1 ), which can be represented as an element in π (M 1 ), namely, a word formed by γk ’s. By assumption, 2 π (U1 ) = γ1 , γ2 , · · · , γn ; [∂ σ12 ], [∂ σ22 ], · · · [∂ σk−1 ]
According to Seifert-van Kampen Theorem [18], π (M 2 ) can be obtained by inserting [∂ U2 ] to the relations of π (U1 ), therefore
π (M 2 ) = γ1 , γ2 , · · · , γn ; [∂ σ12 ], [∂ σ22 ], · · · [∂ σk2 ].
OmniMap: Projective Perspective Mapping API for Non-planar Immersive Display Surfaces Clement Shimizu, Jim Terhorst, and David McConville The Elumenati, LLC
Abstract. Typical video projection systems display rectangular images on flat screens. Optical and perspective correction techniques must be employed to produce undistorted output on non-planar display surfaces. A two-pass algorithm, called projective perspective mapping, is a solution well suited for use with commodity graphics hardware. This algorithm is implemented in the OmniMap API providing an extensible, reusable C++ interface for porting 3D engines to wide field-of-view, non-planar displays. This API is shown to be easily integrated into a wide variety of 3D applications.
1
Introduction
Artists, architects, and engineers have long attempted to understand and simulate spatial perspective as perceived by the human visual system. This quest is widely associated with the formal development of linear perspective from the 15th century onwards that became the hallmark of European Renaissance painting. These mathematically derived techniques forced an artificial projection of three-dimensional scenes onto a two-dimensional plane, requiring that the experience of perspective be based on the metaphor of peering through a planar window from a single point. Numerous attempts have been made to define a “natural” non-planar and panoramic perspective that more closely mimics the visual field of view. Leonardo da Vinci illustrated the divergence between artificial linear perspective and a more natural non-planar perspective[1], and numerous artists have since turned to curvilinear and hemispherical architectural forms as canvases upon which to portray visual depth. Development of both optical and mathematical techniques for representing perspective within cathedrals, panoramic exhibits, planetaria, flight simulators, and surround cinema laid the groundwork for contemporary visually immersive display technologies[2,3]. Though hemispherical and panoramic screens have long been used to surround audiences with imagery, their cost, complexity, lack of standards[4], and size requirements have limited their widespread adaptation. The use of panoramic and multi-screened virtual reality environments requiring multiple front and rear projectors have largely been limited to academic and corporate research and military training applications. However, recent advancements in graphics processing power, surround projection technologies, and material construction techniques are enabling the adoption of non-planar immersive virtual environments for a G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 975–986, 2008. c Springer-Verlag Berlin Heidelberg 2008
976
C. Shimizu, J. Terhorst, and D. McConville
broad range of artistic, scientific, educational, and entertainment applications. Though many of the core technologies have been available for decades[5], a new generation of portable dome and panoramic systems as well as permanent digital “fulldome” theaters are providing relatively simple and cost-effective methodologies for visually immersing participants inside of computer-generated imagery using a single projector and image generator. The OmniMap application programming interface (API) has been designed to enable existing interactive 3D software applications to be adapted for use within these wide field-of-view non-planar displays. It uses object-oriented techniques to enable extensibility, configurability, and reusability so that the minimum amount of effort is required for this adaptation. It is freely available on the web at [6] and has been integrated into a variety of 3D toolkits, programming environments and applications based on Direct3D9, Direct3D10, and OpenGL. This paper describes the algorithm employed to produce the correct viewing perspective output on non-planar display surfaces, the software architecture that supports the extensibility, configurability, and reusability, and the integration of the OmniMap API into various software applications. The cover shot shows OmniMap integrated into SCISS AB’s Uniview for realtime visualization of the universe (left). An OmniFocus projector sits alongside a star ball to update a planetarium’s scientific visualization capabilities (right). The next section provides an overview of various techniques used to produce imagery for non-planar immersive projection systems, including a two pass technique well suited for implementation on modern graphics hardware. Although various forms of the two pass algorithm have been used in many immersive computer graphics systems, the details have never been adequately described in the literature. This paper formalizes the algorithm in Section 3. Section 4 provides the implementation details of the OmniMap API needed to reproduce, utilize, and extend the technique. Finally, the flexibility and extensibility of the OmniMap API is evidenced by the successful integration into many applications and SDKs in Section 5.
2
Background
There are currently three primary approaches for rendering projections on nonplanar surfaces: ray tracing, quadric surface transfer, and a two-pass rendering method. When projecting images onto non-planar display devices, ray tracing is
OmniMap: Projective Perspective Mapping API
977
the most straightforward approach. First, rays are cast from the projector to the screen’s surface to compute the optical warping of the screen. From the viewer’s position rays are cast through this intersection point into the scene to compute the final image. Because of computational requirements of raytracing, this only works for content that can be rendered offline. Quadric surface transfer (QST) is a popular rendering technique to create large seamless curved displays using overlapping projectors[7,8,9]. QST systems typically need to use many projectors to cover a curved surface, rendering at very high resolution. The rendering phase for QST runs in a single pass, but that pass is run on each projector, so the implementation typically uses a computer for each projector. The greatest draw back of QST systems is the requirement for highly tessellated objects since major artifacts result if objects are not finely tessellated. Because either the application’s 3D content needs to be re-tessellated offline or dynamically re-tessellated, QST is not yet practical as a general-purpose perspective correction library using commodity graphics processing units. The two-pass rendering method is well suited for use with commodity graphics hardware since the scene does not need to be retesselated and is compatible with all versions of OpenGL and DirectX that support vertex and pixel shading. The first pass renders the scene into a set of off-screen textures, a method similar to generating cubic environment maps. The second pass renders and warps the screen surface and maps the textures to it using projective texturing. This is commonly used to correct perspective for immersive displays, including the OmniMap API. It has also been used within Elumens’ SPIClops API for fisheye projection[10,11], Paul Bourke’s Spherical Mirror Projection[12], and the University of Minnesota’s three-projector panoramic VR Window[13]. An early reference to a related two pass technique is found in [14]. As graphics processing power and projector resolution has increased, the twopass rendering technique offers a method for taking advantage of simplified and lower-cost immersive projection systems. The OmniMap API has been developed to provide an extensible, reusable C++ interface that can be adapted to existing, and future commodity graphics hardware and most 3D software engines regardless of the type and level of abstraction. Although various forms of the two pass algorithm have been used in many immersive computer graphics systems, the details have never been adequately described in the literature. This paper formalizes the algorithm and provides the implementation details needed to reproduce, utilize, and extend the technique. In order to thoroughly describe the algorithms and implementations associated with the two pass rendering method, this paper proposes that it be named projective perspective mapping.
3
Projective Perspective Mapping
Typical video projection systems display rectangular images on flat screens. Optical and perspective correction techniques must be employed to produce undistorted output on planetariums, domes, panoramas, and other non-planar display surfaces. In this section we describe the methodology derived by the authors for projective perspective mapping which facilitates interactive perspective correct
C. Shimizu, J. Terhorst, and D. McConville
Pass 1: Generate Perspective
978
ĺ
Pass 2: Map Perspective
Application’s Virtual Environment
Render Perspectives
Screen Mesh
Optical Correction
Use projective texturing to map channels
Fig. 1. Projective perspective mapping generates the user’s perspective by rendering the scene into a subset of cube map faces, then uses projective texturing to map the perspective onto a optically corrected screen surface mesh. MOi , MPi , and MTi represent the view offset, projection, and projective texturing matrices for each channel.
rendering into a wide frustum, non-planar surface. Although projective perspective mapping is capable of serving many projection system types, this discussion is focused on a specific class of projection systems with wide field of view optics similar to a fisheye lens. Section 3.1 discusses the unique properties of the projector. Section 3.2 describes the algorithm’s first pass where channels (cube faces) are rendered into off screen frame buffers (frame buffer objects or render textures). Section 3.3 describes the second pass, where a mesh representing the projection surface is rendered and warped using a vertex shader to account for the spherical projection of the wide field of view optics, and painted in using projective texture mapping. Figure 1 illustrates the flow of the algorithm.
OmniMap: Projective Perspective Mapping API
979
Centered
Truncated
Fulldome
Fig. 2. A fisheye lens replaces a projector’s stock lens. Lens offset configurations (right).
3.1
Projection System
The projector shown in Figure 2 is an OmnifocusTM HAL-SX6 color projector with 6500 lumens brightness and 1400 x 1050 resolution. The stock lens from Christie has been removed and replaced with a custom wide field of view lens from The Elumenati, LLC. If the lens is centered on the projector’s DMD panel, it projects 180◦ along the horizontal axis and a 135◦ along the vertical axis, Figure 2 (right, top). The lens is offset vertically relative to the center of the DMD panel to optimize pixel usage in projection. The resultant projected FOV is ±90◦ horizontal and +90◦ , −45◦ vertical (middle). “Full dome” lenses are available that project 180◦ in both the vertical and horizontal axis (bottom). Most fisheye lenses tend toward an “f θ” pixel distribution where the pixel projection angle is linearly related to the pixels distance from the lens optical axis. This means that the angle to which a specific pixel is projected is linearly proportional to its distance from the optical axis. The result is an equiangular pixel distribution across the entire projected field. Rectilinear projectors typically have an f tan(θ) angular pixel distribution. In addition, if the f θ lens is centered in a dome, it will project with uniform brightness across the entire screen surface. If possible, avoiding brightness uniformity corrections is beneficial because they inevitably reduce screen’s overall brightness and contrast. While the extremely wide projected field of view allows a single projector to cover almost any screen shape, the nearly infinite depth of field allows the image to remain in focus. In optics a common rule of thumb is that for a lens of focal length f an object that is ≥20f away is essentially infinitely far away. Reversing this rule for projection, if the screen is ≥20f away from the lens the image will always be in focus. The fisheye used in this system has a focal length of 6mm. The extremely short focal length is a byproduct of the extremely wide FOV of the fisheye lens design. This allows the projector to be placed anywhere in relation to a screen of arbitrary shape (dome, cylinder, etc) and still maintain focus on the entire screen as long as the nearest point is at least 12 cm away.
980
3.2
C. Shimizu, J. Terhorst, and D. McConville
First Pass – Generating the User’s Perspective
Through the two-pass algorithm described next, projective perspective mapping takes into account the prescription of the optics described above, the shape of the screen and the position of the viewer within its algorithm in order to facilitate the generation of the correct image at the projector’s image plate. The initial pass generates the user’s perspective by rendering the scene into faces (called render channels) of a cube-map like structure. A cube map with sufficient resolution can perfectly represent any view of the scene from a single vantage point. The scene is rendered through a subset of the faces of a cube. The n faces are chosen so as to fill the display surface with imagery from the perspective of the sweet spot of the audience. For each channel i, a view offset matrix MOi is computed representing the offset from the default view to the view through the cube map face. This matrix also stores translational offset of the audience sweet-spot. The perspective projection matrix MPi is also needed for each channel. The use of these matrices are the only change to the application’s rendering loop. The scene’s geometry does not need to be re-tesselated. In the case of an upright dome, three channels are required to fill the display surface. The frustum and view offset rotation for each of those channels needs to be computed. If channel 1 is the left view, the matrix MT1 is a 45◦ rotation to the left. Channel 2’s matrix MT2 , the right view, is a 45◦ rotation to the right. The top channel’s matrix MT3 would rotate and twist the view to capture the view through the top face of a cube. The perspective projection matrix MPi for each channel is set to have a symmetric 90◦ FOVs and a near and far clip plane suitable for the scene. In the case where the optimal viewing position is not placed at the center of the dome, the ideal channel frustums may be asymmetric and greater than 90◦ . Asymmetric frustums enables an optimal frustum size, saving rendering time if the application is implementing frustum culling. 3.3
Second Pass – Mapping the Views to the Display Surface
In the second pass, a mesh representing the projection surface is drawn with vertex and fragment shaders. First, the screen surface mesh is warped using the vertex shader, to account for the spherical projection of the wide field of view optics. Then, the channels rendered in the first pass are used as projective texture maps in the fragment shader to fill in the display surface. Vertex Shader – Optically Correcting the Non-planar Display Since the f θ lens causes the screen surface to be projected as spherically into the environment rather than rectangularly, the vertex shader is used to warp the screen surface from world space to spherical projection space. The screen mesh is tessellated to accommodate the warping. This allows the polygonal mesh of screen surface to be rendered in such a way that it lines up with the physical screen surface when projected through the spherical lens. This mapping is specific to the optics of the projection system. Although projective perspective mapping is capable of all types of projection systems, we work though the math for only
OmniMap: Projective Perspective Mapping API
981
the f θ optics. Calibration for rectangular projectors have been covered in [13] while mirror ball has been covered in [12]. This optical technique was published in the context of a rear projected, motion tracked, flexible screen in [15]. In the vertex shader, the world location of each vertex (x, y, z) is converted into spherical space (ϕ, r). z is defined as the axis parallel with the projector’s direction. d is the distance between the vertex and the projector. In Equation 2, the vector ϕ is a unit length vector in screen space. Equation 1 computes its magnitude as r. Screen pixel coordinates (x , y ) are computed by Equation 3. Finally, the z-depth is simply set to d. z 2 r = cos−1 (1) π d ϕ = x/ x2 + y 2 , y/ x2 + y 2 (2) (x , y ) = r ∗ ϕ
(3)
A slight modification needs to be made for fisheye lenses that have a have a non-uniform pixel distribution. To do this, a low order polynomial is fit to the mapping of the specific lens’s distribution to an f θ distribution. The lens correction for standard, non-fisheye projector optics is covered in [13]. Fragment Shader – Mapping the User’s Perspective to the Display The fragment shader now fills in the display surface with the channels rendered in the first pass, using projective texturing. Projective texturing was invented as a technique for rendering shadows onto curved surfaces, but was extended to simulate the effect of a using a slide projector to project images onto curved surfaces[16]. A tech report on the subtle details of implementing hardware accelerated projective texturing can be found in [17]. The projection of the texture requires that the shaders calculate texture coordinates into the rendered channels drawn in the first pass for each pixel on the screen surface. These texture coordinates are calculated by applying a transform matrix to the vertex coordinate. For each channel i, the transform matrix MTi is computed from the projection and offset matrices used to generate the channel (Equation 4). The matrix MS is a scale bias matrix for mapping coordinates from [−1, 1] to the texture coordinate domain [0, 1]. The texture coordinates (s/q, t/q) are computed for every channel by multiplying the vertex position (x, y, z) with the offset matrix (Equation 5). MTi = MOi ∗ MPi ∗ MS T T s, t, r, q = MTi ∗ x, y, z, 1
(4) (5)
If the screen surface is completely covered by the render channels from the sweet spot of the audience with no overlap, only one of the n texture coordinates will be valid for any pixel. The texture coordinates is valid if and only if (si , ti ) is within the range [0, 1]2 . The texture coordinate of the valid channel is used to index into the texture to retrieve the final color value for that fragment.
982
C. Shimizu, J. Terhorst, and D. McConville
OmniMap
OmniMapAppTextureOGL
OmniMapD3D
OmniMapAppD3DTexture
OmniMapD3d10
OmniMapAppD3D10Texture
OmniMap_Base
ScreenRenderer_Base
ScreenRendererOGL ScreenRendererD3D OmniMapChannel
OmniMapChannelATOGL
OmniMapChannelD3D
OmniMapChannelATD3D
OmniMapChannelD3D10
OmniMapChannelATD3D10
ScreenRendererD3D10 OmniMapChannel_Base
OmniMapShader_Base
OM_DomeScreen OmniMapScreen_Base
OM_CylinderScreen
OmniMapScreenShapeFactory
OmniMapScreenRendererFactory
OM_ToroidScreen
OmniMapShader OmniMapShaderD3D OmniMapShaderD3D10
Fig. 3. Classes supporting specific graphics APIs like OpenGL (shown in green), Direct3D9 (light blue), and Direct3D10 (deep blue) are derived from graphics API independent (shown in gray) base classes. OmniMap Base, OmniMapChannel Base, OmniMapShader Base and ScreenRenderer Base need to be derived to incorporate the API specific function calls. For application supplied texture support, only main class, and channel classes needed to be derived. The shader and screen renderer classes were reused for the application supplied texture implementations. OmniMapScreenShapeFactory creates and manages OmniMapScreen Base derivations, and OmniMapScreenRendererFactor creates and manages ScreenRenderer Base derivations.
4
OmniMap API Architecture
This section describes the OmniMap API architecture with an overview of the API and then a description of the functionality of the base classes and their derivations. The OmniMap API provides a framework to enable 3D application, game engine, and toolkit developers to implement rendering to immersive, non-planar displays. The base classes contain the information and utility access methods necessary to implement the algorithm. Derivations of the base classes provide the implementation details for specific rendering APIs and frameworks. The authors have derived the base classes to implement the algorithm for the OpenGL, Direct3D9, and Direct3D10 APIs, and some application architectures can use these implementations “out-of-the-box”. However, the extensibility of OmniMap affords developers the opportunity to adapt OmniMap to the framework of existing applications, toolkits or game engines. Toolkit and game engine developers can derive reusable classes from the OmniMap base classes, thus providing OmniMap functionality to application developers using their toolkit or game engine. OmniMap does not intend to provide the details of implementation for every conceivable software architecture, but instead provides the basic building blocks to enable developers to implement the algorithm with the least effort.
OmniMap: Projective Perspective Mapping API
4.1
983
OmniMap API Classes
In this section, the support for the Projective Perspective Mapping algorithm in the OmniMap base classes is described, followed by an explanation of how these classes can be derived to implement the algorithm. The object oriented design of OmniMap enables runtime configurability of many properties of the display and software environment. These classes are designed such that derivations can implement the algorithm for: 1. Low and high level graphics APIs: OpenGL, DirectX, Orgre3D, etc. 2. Various shader languages, and shader loading/compiling APIs 3. Various screen shapes OmniMap Base: This is an abstract class for deriving classes that manage the two-pass algorithm. It owns the channel, shader, and screen objects. It directs the rendering of the second pass in which the channel content is composited to the display surface. It can be derived to implement the algorithm for different rendering engines, and has been derived to implement Direct3D and OpenGL. The base class provides functionality that is common to all derived implementations including: 1. Channel management 2. Creation, and storage of the matrix transforms that represent the position and orientation of the projector. 3. Screen shape factory[18], screen management, screen renderer factory, screen renderer management, and invocation of screen rendering. 4. Utility methods for calling the list of channels to bind and unbind their respective textures. This method is used during the second pass to bind the channel textures to texture units for access by the shader. 5. Utility methods for executing Lua scripts, which are used primarily for configuration. OmniMap Base has two object factories, one for the creation of screen shapes (see OmniMapScreen Base below), and one for the creation of screen renderers. These factories allow for new screen shapes, and screen renderers to be created and added at run time. OmniMap Base uses a Lua scripting facility to initialize itself with the preferred configuration[19]. The Lua script file is executed by the base class. The script then calls back into the OmniMap Base class through a Lua to C++ interface mechanism[20]. These calls provide the OmniMap object with configuration information including: – Number of channels to be rendered, their resolution, and projection parameters that define how those channels are composited onto the display surface. – Underlying Graphics API to be used for rendering – Shader programs to be used for vertex warping and channel projection – Position and orientation of the projector, and optimal viewing position
984
C. Shimizu, J. Terhorst, and D. McConville
OmniMap includes configuration scripts for standard dome configurations shipped by Elumenati, Inc. The Lua scripting facility is available to derived classes via a protected member of each base class. The OmniMap API base classes are free of any code that is graphics API dependent. The design allows implementations of the perspective mapped surface display algorithm for high (Ogre3D, OpenSceneGraph) and low level graphics APIs (OpenGL, Direct3D9 and Direct3D10) to leverage the base functionality, minimizing the efforts required for those implementations. Derived implementations of OmniMap Base must implement the CreateChannel and PostRender methods. The CreateChannel method creates the appropriate derivation of OmniMapChannelBase for the derived implementation. PostRender implements the second pass of the algorithm Specifically, shader loading and parameter setting, and display surface rendering. The OmniMap API also provides support for applications that supply their own render textures. This support is useful for toolkits that already have a facility for rendering to textures. For instance, OmniMapAppTextureOGL, and OmniMapChannelATOGL implement this functionality for OpenGL applications. OmniMapChannel Base: This is the abstract class for channel implementation. It is intended to be derived along with OmniMap Base to support a specific rendering API or framework. Its purpose is to provide the mechanism for rendering to an offscreen buffer that acts as a texture, and to provide the matrices for setting up the channel’s viewing offset. Hence, this class owns the position/rotation offset matrix and the projection matrix for the channel. The OmniMapChannel Base class supports asymmetric frustums. OmniMapScreen Base: This is the abstract class for implementing display surface shapes. Derived classes simply define vertex buffers that represent the geometry of the display surface. The screen geometry is rendered with classes derived from the ScreenRendererBase class. This class is defined such that the derived classes can be implemented independent on the underlying graphics API. So, a screen shape can be defined once, and used by any derivations of the OmniMap classes, as well as any game engine or toolkit implementations. Derivations that implement specific screen shapes are responsible for tessellating the shape into triangles and notifying the screen renderer of the contents of that shape. ScreenRenderer Base: This is the abstract class for defining screen renderers. Screen renderers are responsible for rendering the shapes defined by derivations of OmniMapScreen Base. Derived classes are graphics API dependent. So there are derived classes for OpenGL, Direct3D9, and Direct3D10 rendering. The derived classes simply render the vertex buffers defined by classes derived from OmniMapScreen Base class. OmniMapShader Base: This is the abstract class for implementing shader support. Derived classes are responsible for loading/unloading, compiling, and setting parameters in shaders.
OmniMap: Projective Perspective Mapping API
985
Fig. 4. Michael Somoroff’s Illumination features a downward-facing projector filling a 120◦ wrap-around panoramic display. A custom video player was created using OmniMap to take the high resolution panoramic frames (top) and warp them for projection onto the inside of a 120◦ cylindrical display (bottom left). OmniMap produces a dramatically warped but optically correct horseshoe shaped image (bottom middle) that is projected into the final installation (bottom right).
5
Conclusion
As advances in graphics processing power, surround projection technologies, and material construction techniques enable proliferation of non-planar immersive display devices, demand for content accelerates. The authors conclude that a library for integrating the wide variety of existing interactive 3D software applications into these venues is a necessity and should be freely available. The OmniMap API is free for non-commercial use, runs on OSX and Windows. The OmniMap API provides a simple way for software developers to implement rendering into non-planar immersive display surfaces. This is evidenced by the successful integration of OmniMap into the following applications and SDKs: – Applications • • • • • • • •
GeoFusion, Inc. GeoPlayer : Geospatial Visualization Google’s Sketchup : 3D Modeling. SCISS Uniview : Astronomy Visualization Apple’s Quartz Composer : Interactive, Visual Programming Language Linden Lab’s Second Life Viewer : On Line Virtual World VidVox’s VDMX : Realtime Video Mixing and Effects Software Cycling ’74’s MaxMSP : Realtime Video Mixing and Effects Software Elumenati Video Player : High Def Codec for mono/stereo videos (Fig. 4.)
– SDKs • OpenSceneGraph 3D Toolkit • Unity 3D Game Engine • Ogre 3D Game Engine
986
C. Shimizu, J. Terhorst, and D. McConville
Recently, the authors have successfully used OmniMap to implement active and passive stereo systems as well as multiple projector configurations all powered by a single PC, and plan to integrate general implementations of these functions into the library. The many developers bringing exciting applications to immersive displays using OmniMap prove how projective perspective mapping systems are elegant, simple, and cost effective in their computer hardware, projector optics, and software implementation.
References 1. Blake, E.H.: The natural flow of perspective: Reformulating perspective projection for computer animation. Leonardo 23, 401–409 (1990) 2. Benosman, R., Kang, S.: A brief historical perspective on panorama. Panoramic vision: sensors, theory, and applications, 5–20 (2001) 3. McConville, D.: Cosmological cinema: Pedagogy, propaganda, and perturbation in early dome theaters. Technoetic Arts 5, 69–85 (2007) 4. Lantz, E.: A survey of large-scale immersive displays. In: EDT 2007: Proc. of the 2007 Workshop on Emerging Displays Technologies, p. 1. ACM Press, New York (2007) 5. Shaw, J., Lantz, E.: Dome theaters: Spheres of influence. In: Trends in Leisure Entertainment, pp. 59–65 (1998) 6. The Elumenati, LLC: Omnimap api, real–time geometry correction library (2008), http://www.elumenati.com/products/omnimap.html 7. Raskar, R., van Baar, J., Willwacher, T., Rao, S.: Quadric transfer for immersive curved screen displays. Comput. Graph. Forum 23, 451–460 (2004) 8. Raskar, R., van Baar, J., Beardsley, P., Willwacher, T., Rao, S., Forlines, C.: ilamps: Geometrically aware and selfconfiguring projectors. In: SIGGRAPH (2003) 9. Majumder, A., Brown, M.S.: Practical Multi-projector Display Design. A. K. Peters, Ltd., Natick (2007) 10. Elumens Corporation: The SPIClops API (2001) 11. Chen, J., Harm, D.L., Loftin, R.B., Lin, C., Leiss, E.L.: A virtual environment system for the comparison of dome and hmd systems. In: Proc. of the International Conference on Computer Graphics and Spatial Information System, pp. 50–58 (2003) 12. Bourke, P.: Low cost projection environment for immersive gaming. JMM (Journal of Multimedia) 3, 41–46 (2008) 13. Ries, B., Colucci, D., Lindquist, J., Interrante, V., Anderson, L.: VRWindow: Tech Report. Digital Technology Center, University of Minnesota (2006) 14. Greene, N., Heckbert, P.: Creating raster omnimax images from multiple perspective views using the elliptical weighted average filter. IEEE Computer Graphics and Applications, 21–27 (1986) 15. Konieczny, J., Shimizu, C., Meyer, G.W., Colucci, D.: A handheld flexible display system. IEEE Visualization, 75 (2005) 16. Segal, M., Korobkin, C., van Widenfelt, R., Foran, J., Haeberli, P.: Fast shadows and lighting effects using texture mapping. SIGGRAPH 26, 249–252 (1992) 17. Everitt, C., Rege, A., Cebenoyan, C.: Hardware shadow mapping. Technical report (2002), http://developer.nvidia.com/object/hwshadowmap paper.html 18. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns, pp. 87–95. Addison-Wesley Publishing Company, Inc., Reading (1994) 19. Ierusalimschy, R.: Programming in Lua, 2nd edn. Lua.Org. (2006) 20. Schuytema, P., Manyen, M.: Game Dev. with LUA. Charles River Media (2005)
Two-Handed and One-Handed Techniques for Precise and Efficient Manipulation in Immersive Virtual Environments Noritaka Osawa National Institute of Multimedia Education
Abstract. Two-handed control techniques for precisely and efficiently manipulating a virtual 3D object by hand in an immersive virtual reality environment are proposed. In addition, one-handed and two-handed techniques are described and comparatively evaluated. The techniques are used to precisely control and efficiently adjust an object with the speed of one hand or the distance between both hands. The controlled adjustments are actually position and viewpoint adjustments. The results from experimental evaluations show that two-handed control methods that are used to make the position and viewpoint adjustments are the best, but the simultaneous use of both one-handed and two-handed control techniques does not necessarily improve the usability.
1 Introduction An immersive virtual environment supporting direct manipulation by hand could offer a novice user a more familiar and simpler way to manipulate a virtual object. For example, changing the position of a virtual object through hand manipulation in an immersive space is similar to manipulating an actual object in the real world, so less specialized knowledge and fewer technical skills are required of a user. Suppose we want students in a class to experience and understand a theme through interaction using an immersive virtual environment for hands-on learning. Most such students would be novice users of immersive virtual environments and unfamiliar with 3D pointing devices such a cubic mouse [3]. Generally speaking, it takes time to master the use of 3D pointing devices. This would make it difficult for many of the students to use the virtual environment, because the long training time would decrease the time that could be used for hands-on learning within a fixed course period. If the students were unable to manipulate a virtual object or a virtual space as they like, the hands-on learning would be ineffective. For a broader acceptance and more effective use of immersive virtual environments, it is thus necessary for casual users to be able to precisely and efficiently manipulate virtual objects and virtual space, without a long training period. While hand manipulation is easy to understand and use for approximate positioning, it has been considered unsuitable for making precise adjustments to virtual objects in an immersive environment, because it is not easy for people to precisely hold their hand in midair without physical support, and it is also hard to release a virtual object in a precise position because the releasing action often causes the hand to move. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 987–997, 2008. © Springer-Verlag Berlin Heidelberg 2008
988
N. Osawa
However, we believe that direct hand interaction techniques should be further studied in order to provide more effective and intuitive interfaces for immersive virtual environments. This research studied control and adjustment methods to overcome the weak points of direct hand manipulation. Our previous papers [7],[8] proposed and described enhanced hand manipulation methods using only one hand for adjustment control. These adjustment methods were a release adjustment, position adjustment, viewpoint adjustment, and hand-size adjustment. They are briefly described in Sec. 3. The previous papers also reported the results of experiments using these adjustment methods with a one-handed control. This paper proposes the use of two hands to control adjustments and comparatively evaluates the one-handed (previously proposed) and two-handed (newly proposed) controls through experimentation. The results from the experimental evaluations showed that the methods including two-handed control for making position and viewpoint adjustments (including hand-size adjustment and release adjustment) were the best from all of the methods tested, and these methods were the most preferred in a subjective evaluation.
2 Related Work Many studies have been done on 3D interaction techniques supporting hand manipulation, such as the silk cursor [11], go-go interaction [10], ray-casting interaction [1], body-relative interaction [5], image plane interaction [9], and scaled manipulation [2]. These use a handheld device, such as a stylus, to manipulate a virtual object. The silk cursor, go-go interaction, and ray-casting interaction are mainly for selecting objects, not manipulating them. In contrast, the target manipulation of our methods is for accurately and efficiently placing a virtual object at a desired position. One form of body-relative interaction includes the automatic scaling of the world. Bodyrelative interaction uses automatic scaling to manipulate objects outside the user's reach, whereas our methods use automatic scaling for precisely manipulating objects. Although image plane interaction techniques can be used for object selection, object manipulation, and user navigation, the user has to select a mode that is appropriate for his/her purpose. This complicates the use of these techniques for novice users. PRISM [2] uses speed-dependent techniques but controls only the position, whereas our proposed adjustment methods not only control the position, but also the viewpoint, the size, and the release location. Moreover, the adjustment model for PRISM is different from that of the stage model proposed in our study. Guiard [4] presented a theoretical framework for the study of asymmetry in the context of human bimanual action. The use of both hands could allow precise and efficient manipulation of virtual objects. In this paper, we propose and evaluate a twohanded control in which one hand is used for positioning and releasing and the other hand is used for adjustment control, that is, the use of hands is asymmetric. Our proposed techniques enhance direct hand interaction without a handheld device for precise positioning by the use of both hands. The techniques provide more effective and intuitive interfaces for novice users without the need for a long training time. Moreover, hand gestures with the fingers can easily be incorporated to give
Two-Handed and One-Handed Techniques for Precise and Efficient Manipulation
989
commands to a virtual system employing the proposed techniques. It is difficult to use such hand gestures when holding a device.
3 Adjustment Methods Two difficulties arise after a virtual object is picked by hand in an immersive virtual environment. One is the difficulty of moving a virtual object to a precise position (precise positioning), while the other is the difficulty of releasing it at a precise position (precise releasing). The proposed automatic adjustments include a position adjustment, viewpoint adjustment, which includes the virtual hand size adjustment we presented in our previous paper, and release adjustment. The position and viewpoint adjustments adjust the virtual hand position and the viewpoint, respectively, for precise positioning. The position and viewpoint adjustments are controlled by scale factors. The release adjustment adjusts the release timing for precise release. This section briefly describes the basic mechanisms of these adjustments. 3.1 Position Adjustment The position adjustment method adjusts the position of the virtual hand to enable precise positioning. The adjusted position Pa is expressed by Pa = Pa’ + Fd (Pm - Pa’) + Fr (Pm -Pm’), where Pa’ is the adjusted position of the last calculation, Pm is the measured hand position, Pm’ is the hand position of the last measurement, Fd is a displacement scale factor (or offset recovery scale factor), and Fr is a relative position scale factor. Fr is used to reduce the movement of the hand in the precise positioning, while Fd is used to recover the offset between the adjusted and measured positions. The relationships between the measured and adjusted positions are shown in Fig. 1. 3.2 Viewpoint Adjustment The viewpoint adjustment method is also used for precise positioning. The viewpoint scale factor Fv controls the adjustment. When the viewpoint is not adjusted, Fv equals one. As Fv is decreased, the viewpoint approaches the point where the virtual object is grasped (the grabbing point). This viewpoint movement enlarges the scene containing the virtual object. Figure 2 illustrates the viewpoint adjustment. Enlargement is useful for precise positioning when a user can precisely control his hand. We often use enlargement in our daily lives for precise manipulation. For example, people often use a magnifying glass to manipulate a precision machine such as a mechanical watch or to do delicate embroidery. Similarly, a surgeon uses a mesoscope for microscopic surgery. These are just two ways to show that enlargement is useful and important for precise manipulation. Virtual hand size is also controlled by the viewpoint scale factor Fv to keep the apparent size of the hand constant. Using only the viewpoint change enlarges the apparent size of the hand, which can prevent the user from seeing the virtual object. Combining the viewpoint and size changes kept the apparent hand size constant.
990
N. Osawa
Adjusted position
Fd(Pm-Pa’)+Fr(Pm-Pm’) Pa
Pa’ Last adjusted (P -P ’) m a Pm position Measured (Pm-Pm’) position Pm’ Last measured position Fig. 1. Relationships between measured positions and adjusted positions
Fig. 2. Viewpoint adjustment
3.3 Release Adjustment In the release adjustment, when a virtual object is quickly released—that is, when the thumb and fingers are quickly opened—the virtual object is returned to the position that seems to be the intended release position. The release adjustment for precise releasing is based on the relative speed between the thumb and the pinching finger. Our previous work showed that, in terms of completion ratios and subjective evaluation, the release adjustment method was clearly beneficial. Please refer to our previous papers for more detail concerning the release adjustment and the results of its evaluation [8]. In this work, the release adjustments were used in all methods.
4 Control Methods Users need to control the application of adjustments. The control methods influence the usability of the adjustments and the performance of a precise and efficient manipulation. Our previous studies [7],[8] used a speed control method that used only one hand for adjustments. The position and viewpoint adjustments are based on hand speed. One assumption underlying the one-handed control method for precise positioning is that the hand moves slowly when the user wants to precisely manipulate a virtual object. On the other hand, this study proposes a distance control method that uses both hands for precise positioning. The distance between the right hand and the left hand controls the adjustments. When the distance between both hands is small, the adjustments are
Two-Handed and One-Handed Techniques for Precise and Efficient Manipulation
991
activated. In other words, bringing the non-dominant hand close to the dominant hand activates the adjustments. This can be considered to be analogous to stopping an object by using a hand to stop an object or the other hand holding the object. 4.1 Stage Model Using Hysteresis The subjects in the previous studies on speed control disliked the linear change of viewpoint or scale factors, because the scene often changed or was constantly changing. Therefore, a stage model was used for the adjustments. In the simplest case, two stages were used, referred to as the precise stage and the normal stage. In the precise stage, the adjustments for precise positioning were activated, and in the normal stage, these adjustments were deactivated. A simple way to change the stage would be to use a threshold. Let us consider the speed control first. When the speed exceeds a threshold, the stage is normal; otherwise, it is precise. This threshold method is simple but not desirable, because the stage could rapidly fluctuate if the speed is varied near the threshold. To avoid such rapid fluctuation, a form of hysteresis was introduced, which requires two thresholds. In this context, hysteresis means that the stage does not instantly follow the speed, but depends on its immediate history. When the speed is below the lower threshold, the precise stage is entered. This stage is then maintained until the speed exceeds the higher threshold, at which time the stage is reset to normal. A stage model with hysteresis is illustrated in Figure 3. The precise stage should be entered only when precise positioning is needed, and it should be maintained until canceling actions become apparent. Therefore the thresholds were chosen so as to prevent frequent changes in the scale factor. For a one-handed control, the lower (Sp) and higher (Sn) thresholds were set at 1.5 and 300 mm/s, respectively. The log records from our pilot study showed that about 60% of the movements during a precise positioning task were slower than 1.5 mm/s. When the speed was below the Sp, it was rare for precise positioning not to be needed. Therefore, 1.5 mm/s was chosen as the Sp. Since the speed rarely exceeded 300 mm/s during precise positioning and the occasional movements for hand rests, the precise stage could be turned off after the hand moved faster than 300 mm/s, which was thus chosen as the Sn. Similarly, for a two-handed control, the lower (Dp) and higher (Dn) thresholds were set at 300 and 400 mm, respectively. When the distance between both hands is nearer than the Dp, the precise stage is entered. If the distance between both hands is distant from the Dn, the stage is reset to normal. Small fluctuations in hand movement do not affect the stage stability. 4.2 Scale Factors and Parameters In the precise stage, the scale factor values were Fd= 0, Fr= 1/3, and Fv= 1/3, whereas the values of Fd= 1, Fr= 0, and Fv= 1 were used in the normal stage. Figure 4 shows viewpoint changes in two stages. Some smoothing was adopted in order to avoid the sudden changes that some subjects disliked when the stage was changed. When viewpoint adjustment was applied, transient animation was used.
992
N. Osawa
Normal stage
Fd= 1, Fr= 0, and Fv= 1 (normal stage)
Precise stage
Fd= 0, Fr= 1/3, and Fv= 1/3 (precise stage) Sp Dp
Speed /Distance
Sn Dn
Scale factors in the experiment
Fig. 3. Stage model with hysteresis for speed and distance control
(a) Normal stage
(b) Precise stage with viewpoint adjustment
Fig. 4. Normal stage and precise stage with viewpoint adjustment
5 Experiment We conducted an experiment to evaluate the usefulness of various combinations of the control and adjustment methods for manipulating virtual objects in an immersive environment. 5.1 Environment and Settings The experiment was conducted in a virtual environment that is a kind of surrounddisplay system that uses immersive projection technology. It has a large cubic screen, each face of which is 3 by 3 m. It also has a passive stereo system, which employs circular polarization to provide a stereoscopic view to users. One stereoscopic face was used in the experiment to evaluate the adjustment methods in a simple and usual VR environment. The subject stood upright without physical support. Figure 5 shows the experimental settings. A PC-based system was used in the experiment. The system ran on a PC workstation (Dell Precision 530 with dual 2-GHz Pentium 4 Xeon processors and a 3DLabs Wildcat II 5110 graphics board supporting the dual displays). A six-DoF
Two-Handed and One-Handed Techniques for Precise and Efficient Manipulation
993
position tracker (Polhemus Fastrak) and sensor gloves (Virtual Technologies CyberGlove) were used to detect the position and motion of the user’s body and hands. The experimental software was developed using the Java programming language, Java 3D class library, and the it3d class library for interactive 3D applications [6]. The sizes and positions were specified by a coordinate system in a position-tracker space, or a physical space. The origin (0,0,0) of the position-tracker space was 1200 mm above the center of the floor screen face. The radius of the control sphere was 15 mm. The target sphere had a radius of 17 mm (Size M), or 16 mm (Size S). My group’s previous studies showed that with sizes larger than 17 mm, adjustments are not usually needed to complete the task. A sphere was chosen as the shape of the control and target objects because orientation was not considered in the experiment. In order to avoid motion sickness in the experiment, a viewpoint camera was placed at a fixed position (0,0,400) (unit: mm) and along the –Z axis direction in the position-tracker space where Fv = 1. In other words, head tracking was not used in the experiment. The angular field of view of the camera was 90°. 5.2 Subjects Thirty people (21 male, 9 female) took part in the experiment. They were all undergraduate students. They had no virtual reality experience of using sensor gloves in immersive virtual environments. In other words, they were all novice users. The subjects were paid for their participation in the experiment. They performed the task using the methods in different orders. 5.3 Task The subjects were asked to place a control sphere inside a translucent target sphere repeatedly within a specific period. The basic task in the experiment is illustrated in Figure 6. This experiment measured the number of completions within the specific period. The period of each trial was one minute. The period during which the virtual object was grasped was counted as part of the trial period. The forefinger and thumb were used to grasp the virtual object; no other fingers were used. The initial positions of the control and target spheres were randomly generated within a cubical space whose diagonal vertexes were at (-150, -150, -150) and (150, 150, 150) (unit: mm). When the control sphere was released inside the target sphere and one task was completed, the positions of the control and target spheres were changed for a new task. When adjustment for precise positioning was applied, the virtual hand was colored light sea green or light slate blue, respectively, for the speed control and distance control to explicitly show the adjustment status. When precise positioning by both control methods was applied, the hand was colored light steel blue. When the control sphere was placed inside the target sphere, the control sphere turned aqua blue, indicating to the subject that the control sphere was within the target region.
994
N. Osawa Y
Target sphere (translucent) Placing control sphere inside target sphere X
Control sphere (solid color) Z
Fig. 5. Experimental settings
Fig. 6. Experimental task
5.4 Combination of Control and Adjustment Methods Important combinations of control and adjustment methods were used for a comparative evaluation. The control methods used for the evaluation were the one-handed (or speed) control method (referred to as one) and the two-handed (or distance) control method (referred to as two). Both of the control methods could simultaneously be used in the experiment. This was referred to as one+two. The adjustment methods used were no adjustment for precise positioning or release adjustment only (referred to as base), position adjustments (referred to as pos), and position and view adjustments (referred to as pos+view). Selection of the combinations was based on the results from previous experiments. The following six combinations were the combinational methods used in the experiment: (0) base, (1) one:pos, (2) one:pos+view, (3) two:pos, (4) two:pos+view, and (5) one+two:pos+view. 5.5 Procedure The functions of the system and the task were explained to the subjects, and each subject was given practice tasks in order to learn how to use the sensor gloves and the control and adjustment methods. These practice sessions were followed by data collection sessions. Using the six methods, the subjects performed the tasks first with a target of Size M once and then with a target of Size S twice. After they had finished testing each method, the subjects were asked to rate that method on a scale of 1 to 5 (where a 1 indicated the lowest preference and 5 indicated highest preference). Moreover, they also rated the adjustment methods and the control methods separately.
6 Results 6.1 Performance Figure 7 shows the average number of completions with standard deviation bars within the trial period. An analysis of variance (ANOVA) was used to analyze the number of completions for each method. The ANOVA showed that there were
Two-Handed and One-Handed Techniques for Precise and Efficient Manipulation
995
Average number of completions
significant differences among the performances of the methods: for Size M, F(5,174) = 19.1, p < 0.00001; for Size S, F(5,354) = 17.7, p < 0.00001. The least significant difference (LSD) multiple comparison test showed that the differences between all method group means except group pair (1) one:pos and (2) one:pos+view, group pair (1) one:pos and (3) two:pos, and group pair (2) one:pos+view and (3) two:pos, were significant for Size M, and all method group means except group pair (1) one:pos and (3) two:pos, group pair (2) one:pos+view and (3) two:pos, and group pair (4) two:pos+view and (5) one+two:pos+view were significant for Size S, all p < 0.05. The position and viewpoint adjustments improved the number of completions. The improvements enabled by the position and viewpoint adjustment methods were clearly demonstrated for small targets. This is consistent with our previous results. The results showed that the position and viewpoint adjustments including the twohanded control, that is, (4) two:pos+view and (5) one+two:pos+view were the best of all the methods tested. 6 5 4 Size M Size S
3 2 1 0 (0) base
(1) one: pos
(2) one: pos+view
(3) two: pos
(4) two: pos+view
(5) one+two: pos+view
Fig. 7. Average No. of completions
6.2 Preference The subjective ratings with standard deviation bars are shown in Figure 8. The methods that used adjustments for precise positioning were rated higher than the (0) base method. The (4) two:pos+view method was found to be the best, and the (5) one+two:pos+view method was regarded as second best. These results support the above-mentioned experimental performance evaluation. ANOVA was applied to analyze the ratings of the methods. It showed that there were significant differences among the ratings of the methods: F(5,174) = 19.1, p < 0.00001. Except group pair (2) one:pos+view and (3) two:pos and group pair (4) two:pos+view and (5) one+two:pos+view, the differences of the means in all group pairs were significant based on the LSD multiple comparison test. Figure 9 shows the average ratings of the subjective evaluations of the respective adjustment methods and the respective control methods. Use of both the position and viewpoint adjustments were preferred to the base or position-only adjustment. The use of both control methods was not necessarily preferred for the single use of the two-handed control method.
996
N. Osawa 5.5 5 Average ratings
4.5 4 3.5 3 2.5 2 1.5 1 (0) base
(1) one: pos
(2) one: pos+view
(3) two: pos
(4) two: pos+view
(5) one+two: pos+view
Fig. 8. Average ratings of subjective preference of methods (1 = lowest preference, 5 = highest preference) 5.5
5
5
4.5
4.5
4
Average ratings
Average ratings
5.5
3.5 3 2.5
4 3.5 3 2.5
2
2
1.5
1.5
1
1
A. base
B. pos
C. pos+view
(a) adjustment methods
X. one
Y. two
Z. one+two
(b) control methods
Fig. 9. Average ratings of subjective preference of methods (1 = lowest preference, 5 = highest preference)
7 Discussion and Future Work The option of using both one-handed and two-handed control methods improved the subject’s freedom of choice but did not necessarily improve the usability. The use of both control methods did not significantly improve the performance for small targets and the use of both of them was not preferred to the single use of two-handed control. When both control methods were simultaneously used, unintended adjustments were observed in some cases. For example, a subject tried to control the adjustments by using both hands. He brought both his hands close and then adjusted positioning but wanted to turn off the viewpoint adjustment to confirm the overall situation and then bring his hands apart. However, the viewpoint adjustment was not turned off because the moving speed of both hands was slow and the adjustment was still activated by the one-handed control. In this case, he spent some time trying to understand the situation and he had to move his hands faster. This is the negative effect of the use of both control methods. We need further study to decrease the negative effect of the use of both of them In this experiment, we investigated the effect of the use of both the adjustment and control methods on only the translational positions of the virtual objects. We now
Two-Handed and One-Handed Techniques for Precise and Efficient Manipulation
997
plan to study the effect of automatic adjustments on the rotation of virtual objects using two hands.
8 Conclusion The proposed control methods for automatic adjustments were experimentally evaluated. The results showed that, in terms of the number of completions and the subjective evaluation, two-handed control methods for the position and viewpoint adjustments were the best from all the methods that were tested. Moreover, the results showed that simultaneous use of both one-handed and two-handed controls did not necessarily improve the usability.
References 1. Bowman, D., Hodges, L.: An evaluation of techniques for grabbing and manipulating remote objects in immersive virtual environments. In: Symposium on Interactive 3D Graphics, pp. 35–38. ACM, New York (1997) 2. Frees, S., Kessler, G.D.: Precise and Rapid Interaction through Scaled Manipulation in Immersive Virtual Environments. In: Proc. of IEEE Virtual Reality 2005, pp. 99–106 (2005) 3. Fröhlich, B., Plate, J.: The cubic mouse: a new device for three-dimensional input. In: Proc. of the CHI 2000 Conference on Human Factors in Computing Systems, pp. 526–531 (2000) 4. Guiard, Y.: Asymmetric Division of Labor in Human Skilled Bimanual Action: The Kinematic Chain as a Model. Journal of Motor Behavior 19, 486–517 (1987) 5. Mine, M., Brooks, F., Sequin, C.: Moving objects in space: exploiting proprioception in virtual-environment interaction. In: SIGGRAPH 1997, pp. 19–26. ACM, New York (1997) 6. Osawa, N., Asai, K., Saito, F.: An Interactive Toolkit Library for 3D Applica-tions: it3d. In: Proceedings of 8th Eurographics Workshop on Virtual Environments, pp. 149–157 (2002) 7. Osawa, N.: Enhanced Hand Manipulation for Efficient and Precise Positioning and Release. In: Proceedings of 9th Int. Workshop on Immersive Projection Technology, 11th Eurographics Workshop on Virtual Environments, Eurographics, pp. 221–222 (2005) 8. Osawa, N.: Automatic adjustments for efficient and precise positioning and release of virtual objects. In: Proceedings of ACM SIGGRAPH International Conference on Virtual Reality Continuum and Its Applications (VRCIA 2006), pp. 121–128 (2006) 9. Pierce, J., et al.: Image plane interaction techniques, in 3D immersive environments. In: Symposium on Interactive 3D Graphics, pp. 39–43. ACM, New York (1997) 10. Poupyrev, I., Billinghurst, M., Weghorst, S., Ichikawa, T.: Go-Go Interaction Technique: Non-Linear Mapping for Direct Manipulation in VR. In: ACM UIST 1996, Seattle, WA, pp. 79–80 (1996) 11. Zhai, et al.: The Partial Occlusion Effect: Utilizing Semi-transparency in 3D Human Computer Interaction. ACM Transactions on Computer Human Interaction 3(3), 254–284 (1996)
Automotive Spray Paint Simulation
Jonathan Konieczny1, , John Heckman2 , Gary Meyer1 , Mark Manyen2 , Marty Rabens2 , and Clement Shimizu1 1
Dept. of Computer Science and Engineering University of Minnesota
[email protected] 2 Johnson Virtual Reality Lab Pine Technical College
Abstract. A system is introduced for the simulation of spray painting. Head mounted display goggles are combined with a tracking system to allow users to paint a virtual surface with a spray gun. Ray tracing is used to simulate droplets landing on the surface of the object, allowing arbitrary shapes and spray gun patterns to be used. This system is combined with previous research on spray gun characteristics to provide a realistic simulation of the spray paint including the effects of viscosity, air pressure, and paint pressure. The simulation provides two different output modes: a non-photorealistic display that gives a visual representation of how much paint has landed on the surface, and a photorealistic simulation of how the paint would actually look on the object once it has dried. Useful feedback values such as overspray are given. Experiments were performed to validate the system.
1
Introduction
Training spray painters to apply modern paints can be an expensive process. Paints can vary significantly in how they must be applied to a surface, forcing painters to vary spray gun settings, speed of spray gun movement, and distance of the gun from the surface of the object. Therefore, training new painters can be costly in both time spent training the painter and in amount of paint used. Even experienced painters may need to re-train for newly formulated paints that require careful application techniques to achieve proper results. When working with real paints, this training must be performed in an expensive spray booth with both the instructor and the pupil wearing protective clothing and bulky respirator masks. The goal of the virtual reality system described in this paper is to aid in training painters to use spray paints thereby reducing the amount of paint and time wasted in training. In addition, this system can be used by paint designers to determine how difficult a new paint would be to spray, without having to actually manufacture and test the paint. The system provides users with many
Corresponding author.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 998–1007, 2008. c Springer-Verlag Berlin Heidelberg 2008
Automotive Spray Paint Simulation
999
useful features, including: photorealistic and non-photorealistic visualization of paint thickness, numeric feedback on overspray and other relevant variables, customization of spray gun settings, and a realistic spray painting environment. In addition, the system has been validated with user testing on real spray painters.
2
Relevant Work
The two most relevant pieces of research describe basic spray paint simulation for the ship building industry [Yang et al. 2007] [Kim et al. 2007]. Yang et al. employs a tracked spray gun and head tracker to place the user in a virtual environment. The user then sprays the virtual object with the spray gun, and gets feedback on the resulting paint thickness. Kim et al. uses a spray paint simulation that employs ray casting and a flood fill algorithm to fill the nearby texture pixels next to the striking coordinate spot. Our algorithm makes use of pre-computation of texture density to perform “splatting” in constant time (see Section 3.1). The approach described in this paper supports arbitrary object shapes. The research and resulting system described in this paper goes beyond the above systems in several ways. First, a realistic paint color model was added, allowing users to view a photorealistic simulation of the resulting paint job rather than just a paint density texture map, including a realistic lighting environment. Second, parameters that can change how paint must be applied such as air/paint ratio and paint viscosity are modeled in the simulation. Finally, user testing with the system was performed to both improve it and validate it as a training tool. The photorealistic paint simulation used in the system was originally developed by Shimizu et al. [1]. The technique for paint simulation given in that research was altered to work with the other portions of the spray painting simulation (see Section 3.3). The method for capturing environment map lighting was first created by Debevec et al. [2] in which multiple pictures are taken at varying exposures to capture a high dynamic range photograph of the surrounding environment.
3
Setup and Spray Paint Simulation
Figure 1 shows the critical components of the virtual spray paint system. A tracked head mounted display allows the user to navigate around a virtual environment. The user holds a tracked “spray gun” that is used to spray paint objects placed in the environment. The system currently works with any head mounted display and either a magnetic or optical tracking device. An nVis 1280x1024 resolution head visor along with a HiBall optical tracker was used for testing the system. 3.1
Simulation of Paint Particles
A natural method to simulate spray painting is ray casting, because casting a ray toward a virtual surface is very similar to a paint particle striking a real surface. However, since any drop below a real time frame rate could result in improper training, calculating a ray cast for every paint particle is computationally infeasible.
1000
J. Konieczny et al.
Fig. 1. Left: Photorealistic rendering (as seen in headset) of directionally diffuse paint on a car hood. Note that some minor artifacts can be seen from too few rays being used to simulate the gun. Right: The final result of the painted car hood after a gloss coat has been applied.
Fortunately, a good compromise between realism and rendering speed can be accomplished by firing fewer rays, and having each ray that strikes the surface spread paint over a reasonable area. Thus, each particle fired from the spray gun is intersected with the virtual object using a ray cast calculation, and “splats” onto the surface, much as a real particle of paint would do. Varying the size of the splat allows more or fewer rays to be used, allowing a balance between realism and rendering speed. The first step in the spray simulation is to sample the mesh at load time to determine uv density: the area of the uv triangle (determined by the texture coordinates and texture size) divided by the area of the 3D triangle (determined by the 3D position coordinates). The uv density of each triangle is then stored for later use in paint “splatting.” When the spray gun’s trigger is pressed, a number of rays are generated, each with an origin of the spray gun’s nozzle tip, and a direction chosen within the shape of the spray cone. Each ray is tested with the virtual object to determine the intersection location, both in 3D space as well as uv texture space using barycentric coordinates [3]. After the precise intersection point has been determined, the paint density on the affected portions of the object must be updated. The paint density is stored as a texture map across the surface of the object. In addition to the precise texture coordinate that each ray hits, a “splat” is performed to affect nearby texels as well, based on the pre-computed texture density performed above. Splat size is based on a world coordinate area, then translated to a size in texels based on the pre-computed uv density (rounded to the nearest texel). Once it has been determined which texels in the density map should be updated, the precise amount to increase the value of each texel must be calculated. This quantity is the amount of paint represented by the ray multiplied by the
Automotive Spray Paint Simulation
1001
percentage of the total splat area that the texel represents. The amount of paint each ray represents is based on many factors, including the total rays being cast, the characteristics of the gun being used, and the distance from the gun to the object (paint is lost as particles travel from the gun to the object). See Section 4 for details on the effects of distance and gun settings on the amount of paint that reaches the object surface. In the current system, rays are cast 30 times/second and both splat size and the number of rays to be cast are user set parameters. If too few rays are cast and/or the splat size is too small, the visual effect of the spray going onto the surface of the object can become “splotchy.” This can be seen in Figure 1. The exact number of rays and splat sizes required to prevent this appearance varies with the size of the area the gun is spraying at any given moment, which is a function of the spray gun settings and the distance of the gun from the surface. The number of rays that can be cast per frame while maintaining 30 frames per second varies with the number of polygons in the paintable object. In practice, an object with 10,000 polygons can be run at about 500 rays per frame on a Pentium 4 2.8 ghz single core processor, while a 10 polygon flat panel model will run at about 2000 rays per frame. The splat size is then scaled appropriately to generate an even appearance on the surface. Generally, the splat size can be kept to just one neighboring texel with acceptable visual results. However, larger splats may be necessary for high polygon count models (high cost ray casting) or large textures (high uv density). Using the above approach, a density map is built up on the virtual object representing the thickness of the paint at each point on that object. This can then be used to give output back to the user on how well the painting has been performed. 3.2
Non-photorealistic Display Algorithm
The first method of user feedback is a non-photorealistic (NPR) display. This method takes the thickness data from the texture map and attempts to visualize it in a manner that allows the user to immediately judge exactly how thick and uniformly the paint has been applied. This is an excellent way to provide training information to new painters, or discover defects in an existing paint procedure. The algorithm for performing this is relatively simple. A 1D texture of colors is created and passed into the shader. The thickness data that is stored from the paint particle simulator described in Section 3.1 is used to index into the 1D texture and retrieve the proper color for that texture pixel. The 1D texture that is currently used in the system is a common cold-hot color scale, ranging from blue to yellow to red as the paint becomes thicker. Areas that are light blue have too little paint, deep blue the correct amount, yellow warns that the paint is about to become too thick, and finally red indicates that too much paint has been applied. The rate as which the thickness moves through these colors can be controlled by a script, allowing users to easily set the proper paint thickness for a particular paint being simulated. See Figure 2 for an example of the NPR display algorithm at work.
1002
J. Konieczny et al.
Fig. 2. An NPR rendering with a shape that has been painted using a range of thicknesses. A cold-hot color visualization scheme has been used to show the user the thickness of the paint. Here, the white region (normally yellow) separates the region of too thick (light, normally red) from properly painted (dark, normally blue).
3.3
Photorealistic Display Algorithm
In addition to an NPR algorithm, a photorealistic rendering algorithm was implemented. This algorithm is a modification of a metallic car paint simulator described in [1]. The metallic paint simulator allows a user to design a paint color, and then displays it on a model using environment map based lighting. For the spray painting simulation, a couple modifications were made: first, the gloss portion of the simulation is separated from the directionally diffuse color. This allows a user of the virtual system to realistically spray the diffuse color of the metallic paint before applying the final gloss coat, just as a real spray painter does. Second, the simulation was modified to permit the paint to become “lighter” or “darker” based on the thickness of the paint (Beer’s law). This allows the paint to appear more realistic as it being applied in real time. A similar effect can be done with the gloss, allowing the surface to appear more or less glossy based on how much gloss has been applied. Figure 1 shows this rendering with both a partially complete diffuse coating as well as a fully painted object. Another important aspect of displaying a realistic simulation to a painter is to place them in a familiar environment. Modern paint shops have paint booths designed to give the painter a lighting setup best suited to showing any defects in the paint job. Therefore, we have taken care to capture a real paint booth environment using high dynamic range photographs, which is used as the environment in which to paint (see Figure 3). The painted object is also properly lit using this environment [2]. The system also allows users to input their own lighting environments to use if they wish.
Automotive Spray Paint Simulation
1003
Fig. 3. The model of the spray booth. Both a model and environment map have been constructed. This allows a painter to use the virtual system in an environment that is familiar to him/her.
4
Spray Paint Parameters
Simply allowing a user to spray a surface with a virtual spray gun and observe the resulting paint thickness leaves out a critical fact: not all paints spray onto a surface in the same way. Changing the spray gun can also have a dramatic effect on the application of the paint. Factors such as air pressure, paint pressure, and viscosity of the paint must be taken into account when determining the final appearance of the painted object. For instance, paints with more solid particles (higher viscosity) tend to travel further, resulting in more paint landing on the target compared to a paint with lower viscosity. However, a much higher paint pressure must be applied to achieve the same flow rate with the higher viscosity paint.
Table 1. An excerpt from [Kwok 1991] showing some of the variables that alter the amount of spray deposition that lands on the target. In all cases, the paint flow rate was kept constant at approximately 275cc/min. Variable A/P Ratio A/P Ratio A/P Ratio Viscosity(cstk) Viscosity(cstk) Distance(inches) Distance(inches) Distance(inches)
Value Paint Deposition (gm) Overspray (%) 0.92 4.14 22.32 1.49 3.54 31.69 2.18 3.19 39.69 57 3.54 31.69 106 4.32 25.44 7.00 4.59 21.93 10.00 3.54 31.69 14.00 2.75 45.99
1004
J. Konieczny et al.
Fig. 4. Top: An image of a spray object painted with a pre-generated replay. Bottom: The same object painted with the same replay data, but with 75 cstk lower viscosity as well as 0.8 higher A/P ratio. The result is that the object is insufficiently painted due to the higher overspray caused by the parameter changes.
The simulation presented in this paper makes use of research performed by [4]. Kwok did trials of spray painting using differing paint spray gun characteristics. The resulting distribution of paint on a target surface was carefully measured for a variety of variables. By using the results of this study, variables have been added to the simulation: viscosity, A/P ratio (the ratio of air pressure to paint pressure), target distance, paint flow rate, and spray gun cone size can all be controlled by the user with realistic results. For instance, the amount of overspray (paint that misses the target, either due to hitting something else or to evaporation) varies approximately linearly based on the distance of the gun to the target. Table 1 shows the effects of a few of the more important variables that are used in the simulation. Figure 4 shows how varying the parameters of the spray paint can affect the final visual appearance of the painted object. The use of these variables allows the simulation to be tailored to a particular paint with little effort. A new paint’s characteristics can simply be input to the simulation, and painters can practice painting without wasting large quantities of potentially expensive paint. One extremely important aspect of spray painting that these variables affect is overspray. Overspray is undesirable for a number of reasons. First, it is a waste of paint, costing the paint company money in materials. Second, stray paint particles can potentially fall on portions of the object or work area that
Automotive Spray Paint Simulation
1005
will have to be cleaned later. Finally, overspray lost into the air can become a health and environmental risk. However, many of the parameters that reduce overspray may also make a paint more difficult to spray. During user testing, when spray painters were asked to make adjustments to the spray gun until it sprayed correctly, they tended to adjust the settings in directions that caused greater overspray. A strong advantage to using the virtual training system is that overspray is accurately calculated (utilizing the research performed by Kwok) and displayed back to the user at all times. Therefore, this system provides an effective method for evaluating new spray paints and for reaching a good compromise between ease of use and overspray.
5
User Studies
The system has been tested with both controlled experimentation as well as field testing in actual spray paint companies. The controlled experiments consisted of three tests. In all three experiments, each participant was allowed as much time as they wanted to familiarize themselves with the virtual environment, and they were allowed to paint a few test objects before starting the actual experiment. Tests were limited to only a few professional painter participants, as getting enough professional painters for a full statistical study was infeasible. These tests do, however, provide basic verification that the system performs in a similar manner to real spray painting. The purpose of the first test was to confirm that an expert spray job on a simple flat panel is similar in both time and technique regardless of whether the painter is using the virtual system or spray painting with real paint. To begin the experiment, an expert spray painter was tracked painting a 25x40 inch panel with real paint at a spray paint facility. Then, expert and novice spray painters were tracked painting the same panel using the system (neither painter was the expert who applied the real paint job to the panel). Table 2 summarizes the results of this test. Both expert painters painted the object in a very similar manner. The virtual spray painter took only slightly longer with a couple less passes, likely due to his gun tip being a bit further from the panel than that of the expert using real paint (which also accounts for Table 2. The first experiment: An expert was tracked spray painting a panel. Then, the same setup was recreated virtually and painted by another (different) expert as well as a novice using the system. Variable Gun Dist (1st Coat) Gun Dist (2nd Coat) # Passes (1st Coat) # Passes (2nd Coat) Time (1st Coat) Time (2nd Coat) Correct Coverage (%)
Expert1 Expert2 Novice (Real Paint) (Virtual Paint) (Virtual Paint) 5in. 6in. 6-10in. 6in. 7in. N/A 11 10 5 8 6 0 33secs 38secs 50secs 16secs 13secs N/A 100.0% 97.4% 79.9%
1006
J. Konieczny et al.
Table 3. The second experiment. Two painters were asked to adjust the settings of the virtual spray gun back to their nominal settings (%100) after they had been altered. The painter adjustments are shown in the order in which the painters made them. Painters generally made adjustments in 10-15% intervals, so any adjustment that ended 85% to 115% was considered to be “close enough” to the original setting. This means all but one of the experiments ended with the painter properly adjusting the gun. Starting Settings Painter1 Adjustments Painter2 Adjustments 60.0% A/P 90.0% Flow Rate Flow→130.0% A/P→100% Flow→100% 120.0% A/P 140.0% Flow Rate A/P→105% Flow→90.0% Flow→110% 70.0% A/P 110.0% Flow Rate Flow→125.0% Flow→95.0% A/P→100.0% A/P→100.0% Flow→105.0%
Table 4. The third experiment. Two shapes were painted by both an expert painter and a novice, and their performance recorded. Variable Car Hood % Correct Car Hood Time Car Hood Overspray Car Hood Gun Dist. Motorcycle % Correct Motorcycle Time Motorcycle Overspray Motorcycle Gun Dist.
Expert 99.75% 1min21secs 52% 10-11 inches 98.0% 2min8secs 44% 10-11 inches 12-14 inches
Novice 60.0% 1min21secs 40% 6-10 inches 71% 1min58secs 57% (bottom) 4-9 inches (top)
fewer passes being made). In addition, both expert spray painters outperformed the novice spray painter. In the second experiment, the parameters of the virtual spray gun (air/paint paint pressure, and flow rate) were adjusted so that they were different from the nominal settings. Two spray painters (who both had some knowledge of how to correctly set up a spray gun) were asked to adjust the virtual gun back to the original settings, using only the performance of the virtual gun on the surface of the panel as a guide. The painters made alterations to the gun settings by asking the experimenter to make adjustments to the gun (for instance, “lower flow rate by 15%”). The painters themselves couldn’t see the current settings. Each adjustment session ended when the painter felt that the gun was “approximately” the same as its original settings. Table 3 summarizes the results of this test: both painters were quite accurate in diagnosing what parameters had changed in the gun and in adjusting the virtual gun back to its original settings. This demonstrates that the alterations made to the spray gun settings in the virtual simulation were accurate enough to allow spray painters to properly evaluate and adjust the virtual spray gun just as they would a real gun. In the final experiment, an expert spray painter’s performance using the virtual system was compared to that of a novice spray painter. After familiarizing themselves with the virtual spray system, each was asked to paint two objects: a car hood and a motorcycle midsection. Spray time, distance from spray gun
Automotive Spray Paint Simulation
1007
to the object, percentage of correct coverage, time, and overspray were all calculated. Each painter was asked to paint the object as completely as possible. Table 4 summarizes the results of this test. As expected, the expert spray painter performed significantly better than the novice, showing that someone more skilled as a real spray painter performs better using the virtual system as well. Of particular note is that the expert was able to perform a rapid, almost flawless paint job using the virtual system after familiarizing himself for around a half hour with the system. In addition to controlled testing, the system has been shown to a number of auto refinish and paint companies since its creation. User feedback was positive. A manufacturing company included the system into their training program, with positive results. After the system was introduced to the painters, they have reported reduced error levels and rejection of paint jobs. In addition, use of the system during painter training improved skill acquisition.
6
Conclusions
This paper has presented a spray paint simulation system. The primary purpose of this system is to train spray painters to use different paints and spray guns without wasting valuable paint. At the same time, the system gives very specific and helpful feedback about their performance. In addition to training spray painters, the system can also be used to evaluate the properties of new paint formulas without the need to actually manufacture the paint. Finally, user testing has been employed to verify the system’s usefulness as a training tool.
Acknowledgements This work was done as a part of the Digital Technology Center at the University of Minnesota. We would like to thank DuPont, Lehmans Garage, and Heat-n-Glo Industries for providing helpful advice, feedback, and their painters’ time to test the system. This work was partially funded by NSF Grant IIP-0438693.
References 1. Shimizu, C., Meyer, G.W., Wingard, J.P.: Interactive goniochromatic color design. In: Color Imaging Conference, pp. 16–22 (2003) 2. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: SIGGRAPH 1997: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 369–378. ACM Press/Addison-Wesley Publishing Co., New York (1997) 3. Weisstein, E.W.: Barycentric coordinates. MathWorld: A Wolfram Web Resource, http://mathworld.wolfram.com/BarycentricCoordinates.html 4. Kwok, K.C.: A Fundamental Study of Air Spray Painting. PhD thesis, University of Minnesota (1991) 5. Yang, U., Lee, G., Shin, S., Hwang, S., Son, W.: Virtual reality based paint spray training system. In: Virtual Reality Conference, pp. 289–290 (2007) 6. Kim, D., Yoon, Y., Hwang, S., Lee, G., Park, J.: Visualizing spray paint deposition in vr training. In: Virtual Reality Conference, pp. 307–308 (2007)
Using Augmented Reality and Interactive Simulations to Realize Hybrid Prototypes Florian Niebling, Rita Griesser, and Uwe Woessner High Performance Computing Center Stuttgart (HLRS)
Abstract. Engineers and designers of various product development fields show an increasing interest in rapid prototyping techniques to help them optimize the design process of their products. In this work we present an Augmented Reality (AR) application with a model size water turbine in order to demonstrate how rapid prototyping with a hybrid prototype, simulation data of water flow characteristics and an optical AR tool can be realized. The application integrates interactive simulation, a tangible user interface and several interaction concepts for 3D CFD. Due to the intuitive and automated workflow as well as seamless process iterations, the application is easy to use by users without expert knowledge in the field of parallel simulations. Our approach points out the main benefit and problems of AR in rapid prototyping and thus provides an informative basis for future research and optimizations to offer a seamless and automated workflow.
1
Introduction
Engineers in different development fields nowadays face real challenges: the complexity of products and quality standards increase while expenditure of time and money are required to decrease at the same time. In order to meet these challenges, various rapid prototyping techniques have been developed. For modelling the physical prototype, methods like stereolithography, selective laser sintering or 3D printing are used [1]. Thus, it is possible to modify characteristics like size, shape and surface of the object efficiently. Recently, virtual prototypes have become a popular and commonly used technique in product development and optimization. They inclose a graphical computer representation of the geometry as well as characteristics of the object. This allows for running computations and simulations on virtual objects while the engineer can modify the virtual model and adjust the parameters as needed. In this way, the virtual design and possible construction errors can be corrected before the physical prototype is built. By superimposing the physical and virtual prototype, modifications of components can be directly compared and evaluated in an easy manner. So far, the combination of physical and virtual prototyping has been integrated in the product development process only to a minor extend. Yet, the engineers’ interest in these hybrid prototypes grows and some are already used in operative processes in the industry. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1008–1017, 2008. c Springer-Verlag Berlin Heidelberg 2008
Using AR and Interactive Simulations to Realize Hybrid Prototypes
1009
Virtual Reality (VR) techniques today are employed in early phases of the development process to evaluate the draft of the future product. Engineers benefit from VR techniques as they allow them to visualize the geometry of the object as well as the huge amount of simulation data in a demonstrative and intuitive way. Thus, managers and specialists from other workgroups who are not familiar with the technical details are also able to interpret and use VR techniques. Nevertheless, not all aspects can be captured and visualized sufficiently: Properties like shape, distance, optical appearance and haptics of materials, for example, can only be perceived on physical objects properly. In contrast, Augmented Reality (AR) techniques provide essential advantages by overlaying the physical prototype with virtual content. One of the major benefits of AR applications is the simultaneous judgment of multiple parameters. This comes into consideration when, for example, the object underlies strong deformations or material wear out. Here, AR techniques can help to measure the deviation from the original state. Another major benefit is the interactivity provided by tangible interfaces and interactive simulation. This is particularly relevant when the objects change their position or have movable components. The user can then modify parameters like shape, position and orientation of the object or components on the spot and see the simulation results according to the new modulation at once. The overall aim is to develop AR applications that do not require special expert know-how but are easy and effective to use. In order to meet these requirements, several main process iterations have to be completed: grid generation, domain decomposition, simulations and calculations, post-processing for AR visualization and the setup of a tangible user interface. The quality of AR applications strongly depends on the performance of these iterations. Currently, an important development area for electric power generation is the optimization of water power plants. Due to the field of operation, water power plants have to be designed individually and the design processes require the use of very complex techniques. In order to increase the efficiency of hydraulic turbines, several approaches are researched today. One important aspect is the investigation of the optimal design of the turbine runner, and the optimal blade angles for different operating points. In this work, an AR application with a model size Kaplan turbine is presented to demonstrate how rapid prototyping with a hybrid prototype and interactive simulation can be realized.
2
Related Work
In 1994, Milgram et al. gave a definition of a continuum of real-to-virtual environments, the two main parts of which they refer to as Mixed Reality (MR). The part where virtual elements are added to the real content is defined as Augmented Reality (AR), the other part where the virtual environment is complemented with real objects is called Augmented Virtuality (AV) [2]. Verlinden et al. [3] improve on the classification of AR systems given by Milgram et al. 1994 [4]. They categorize applications such as the one presented here as ”VideoMixing AR”.
1010
F. Niebling, R. Griesser, and U. Woessner
In his survey of 1997, Ronald T. Azuma describes AR as a variation of virtual environments and points out its supplementary function in various application fields like medical visualization, maintenance and repair, annotation, robot path planning and entertainment [5]. While some AR approaches aim at generating virtual objects that realistic that they can hardly be distinguished from the real environment, the focus in product development processes today is rather put on the ease of use of AR systems. Azuma mentions various basic challenges that have to be fulfilled in order to obtain a satisfying augmentation. Since then lots of research concerning optics and resolution, accuracy, registration of real and virtual objects, environmental and lighting conditions as well as marker tracking has been done. In his survey of 2001, Azuma denotes rapid technological advancements during the last years. This includes issues like tracking approaches, calibration, latency, displays and visualization problems [6]. Augmented reality techniques have been used in production e.g. for visualization of air flow in car and airplane cabins using CFD simulations [7]. Most AR applications make use of precomputed simulation data for visualization, although there has been some work by Schmalsteig et al. to couple AR and online simulations in the Studierstube project [8]. Various visualization environments have been used for computational simulation steering. Unitah/SCIRun [9], AVS [10], CUMULVS [11] and COVISE [12] provide integration of interactive visualization into the simulation workflow. COVISE has also been used to integrate tangible interfaces [13] to make parallel simulation on remote supercomputer resources accessible not only to the simulation expert but also to other engineers involved in product design.
3
An Application: Design of a Kaplan Turbine
Water power plants in flowing waters usually underlie extremely high variations of flow and head which often lead to pressure drop and finally to cavitation. Kaplan turbines are special water turbines that have adjustable rotor blades and thus can be operated efficiently even when flow conditions vary. When optimizing these hydraulic turbines, characteristics of the water flow like pressure and velocity distribution have to be investigated [14]. In this work we present an AR application based on interactive simulation in real-time and a tangible user interface. As a concrete example object we use a model size Kaplan turbine of 27 cm in height. The propeller shaped runner has four adjustable rotor blades that are turnable around their anchorage. The tangible user interface is realized with the optical marker tracking system ARToolKit [15]. Two pattern markers were used: the first marker of a size of 48mm was attached to the model turbine in the upper part in order to determine the position and orientation of the model. The second marker of a size of 27mm was fixed at the edge of a turbine blade in order to determine the rotation angle of the blade. In our setup, the pattern markers were viewed from a distance of about 50cm. The video camera we used was a HD camera with a resolution of 1400x1050 pixels.
Using AR and Interactive Simulations to Realize Hybrid Prototypes
4
1011
Rapid Prototyping and Interactive Simulations
Prototypes are used in most phases of product development, to validate characteristic properties of the product or several of its parts. Since the development of physical prototypes is time consuming and expensive, virtual prototypes are used to replace them, particularly in earlier phases of development. This allows for faster changes in the initial product design, and therefor an improved development cycle. To be able to evaluate the properties of a virtual prototype, its behavior has to be simulated numerically. In turbine design, computational fluid dynamics (CFD) simulations are used to optimize the machine for different operating points. The workflow of a simulation cycle consists of geometry generation, grid generation, domain decomposition, post-processing and simulation. We will outline this workflow in the following sections. We have integrated the workflow into the dataflow-oriented Collaborative Visualization and Simulation Environment (COVISE), allowing for presentation of the simulation results in immersive and augmented reality environments, and collaborative product development through the use of hybrid prototypes. 4.1
Hybrid Prototypes
Both physical and virtual prototypes are not always able to represent the finished product satisfactory. Physical prototypes can diverge from the behaviour of the finished product because of e.g. different processes used in the development, different materials or the high amount of manual development involved. Some parameters can only be determined numerically, because a physical experiment would be too time consuming, expensive or dangerous. Virtual prototypes too are not able to exactly represent reality. The design and simulation is based on simplified physical models, geometry often has to be approximated by polygonal meshes, and properties of certain materials may not even be known. To overcome these limitations, hybrid prototypes aim at combining the use of physical and virtual prototypes. They enable the engineer to compare simulation and experimental results, and to optimize the design of the virtual and physical prototypes by evaluating the results of both processes corporately. Hybrid prototypes integrate geometry and behaviour in a computer representation while allowing a user to interact with a physical model and possibly evaluate the behaviour of the prototype in a test environment. Physical representation of objects and physical feedback with computer generated information is combined to better analyze the behaviour or properties of the product. 4.2
Grid Generation
The computational mesh for the numerical flow simulation of the turbine runner is an unstructured grid based on hexahedral elements. The grid is generated automatically by a custom COVISE module ”AxialRunner”. This module allows for parametrized design of axial flow turbine runners. Additionally to the
1012
F. Niebling, R. Griesser, and U. Woessner
computational grid, the module generates boundary conditions used in the CFD simulation, and polygons representing the surface of the shrouds and the hub of the turbine runner. This virtual object can then be used for realistic occlusion effects of the turbine model with respect to the post-processed simulation results, and for simultaneous visualization of the simulation results in immersive environments. FENFLOSS, the CFD simulation used in this project, has been extended to be able to read unstructured grids in COVISE format. The size of the computational grid can range from approximately 100.000 elements for interactive response times to millions of elements for a more accurate, however more time consuming calculation, depending on the computing resources available. 4.3
Domain Decomposition
The computational grid is split into several parts for parallel processing of the simulation on remote computing resources. Generally, the number of partitions of the mesh should be the same as the number of compute units (e.g. CPU cores) available. Because communication between different processes used in the simulation is expensive, even on high-bandwidth and low-latency networks such as InfiniBand, the computational grid must not be subdivided into too many partitions. We implemented a COVISE module using METIS [16] as a library for the partitioning of the mesh. METIS provides a fast and stable solution for domain decomposition. Besides, it is simple and practicable to use. In Fig. 1, you see a dataset split into four domains with interface elements (yellow cells) between these domains. Because of the high overhead of communication between the cluster nodes during the simulation, there is no advantage in subdividing the relatively small computational grid used in this interactive simulation into more than four partitions. To support higher detailed simulations using fine-grained computational meshes, the domain decomposition module is parametrizable to split the computational grid into domains suitable for the available resources. 4.4
Simulation
FENFLOSS [17] is developed at the Institute of Fluid Mechanics and Hydraulic Machinery (IHS) at the University Stuttgart. FENFLOSS can be used for the simulation of incompressible flows and uses Reynolds-averaged Navier-Stokes-equations on unstructured grids. The simulation code can be applied to laminar and turbulent flows. The turbulence models used are turbulent mixing length models as well as various k- models, containing nonlinear k- models and algebraic Reynolds stress models. The solver works for 2D or 3D geometries, which can be fixed or rotating and either steady state or unsteady problems. FENFLOSS can also handle moving grids (rotor-stator-interactions) and contains methods to calculate free surfaces flows. It can be used on massively parallel computer platforms and is optimized for vector processors e. g. NEC SX-8. The program employs a segregated solution algorithm using a pressure correction. The parallelization takes place in
Using AR and Interactive Simulations to Realize Hybrid Prototypes
1013
Fig. 1. Decomposition of the computational mesh into four parts
the solver, which uses BiCGStab2 including ILU pre-conditioning. Coupling of fixed and moving grids is accomplished by using integrated dynamic boundary conditions. FENFLOSS provides a user subroutine API, which can be used to call user supplied functions at different places in the solver. We used the subroutine API to handle data transfer between the simulation and the visualization environment COVISE. After each time step, the computed pressure and velocity fields are sent to COVISE, where the data can be further post-processed and visualized. The coupling of the simulation and the visualization environment is implemented by using a socket connection between the solver API and a custom COVISE simulation module. The COVISE module implements the ”coSimLib” interface, which was designed to provide an abstraction layer for simulation coupling. The simulation module integrates transparently into the workflow by providing the computed values, scalar and vector data, in a format accessible for further postprocessing in COVISE data format. Parts of the simulation results, e.g. the pressure and velocity values at the outlet of the turbine runner, can also be used as boundary condition for other, possibly coupled, simulations. The computation of the flow in the wicket gate and the draft tube where not part of this simulation. 4.5
Augmented Reality and Computational Steering
In our exposition, a high resolution Head Mounted Display (HMD) is used to explore the physical model of the kaplan turbine. The camera pictures of the HMD are captured and analyzed by a modified version of ARToolKit [15]. ARToolKit analyzes the picture at around 10 frames per second on a Pentium M 1.6. It returns the position and orientation in six degrees of freedom of all markers which are completely visible. By using the position and orientation of the object marker, the post-processed simulation data can be overlaid with the picture
1014
F. Niebling, R. Griesser, and U. Woessner
Fig. 2. Post-processed simulation results in an augmented reality environment. The angle of the blades used for the two simulations differ by 6 degrees.
taken by the cameras. By simply rendering the virtual model of the kaplan turbine generated by the ”AxialRunner” module to the z-buffer, we achieve realistic occlusion of the physical model and the post-processed simulation data. The second marker, which is attached to one of the turbine blades, is used to obtain the angle of the turbine blades from the tangible interface. This angle serves as an input parameter for the ”AxialRunner” grid generation module. By changing an input parameter, e.g. the angle on the physical model, the engineer is enabled to influence a parallel flow simulation of the turbine runner, and explore the resulting data. When an updated computational mesh is generated by the grid generator, the CFD simulation is restarted. Because the grid does not differ too much between simulations, and the number of nodes as well as their connectivity stays the same, the computed results of the previous iteration can be used as a starting point for the newly setup simulation. This allows for faster convergence of the numerical simulation and therefor improved perception of interactivity by the user. The other modules used in the workflow are standard COVISE modules used for data analysis and visualization, e.g. Tracer, Cutting Surface and Colormap modules. A visualization of the resulting post-processed data, combined with the physical prototype of the turbine runner as can be seen by using a HMD, is presented in Figure 2. By using COVISE, workgroups are enabled to share a simulation workflow in a collaborative session and to explore the results together in multiple different, even spatially distributed, environments. By using tangible interfaces, a more natural mode of operation is possible compared to a traditional user interface which e.g. makes typing in coordinates necessary to move objects. The natural
Using AR and Interactive Simulations to Realize Hybrid Prototypes
1015
perception of the model in comparison to a simple monitor image is another important advantage.
5
Discussion
Superimposition of real and virtual objects provide useful information for a variety of purposes. Hybrid prototypes can be used to evaluate simulation data with respect to measurements. In turbine design, verifying the simulation of stationary parts, such as the wicket gate and the draft tube, can easily be done in a test facility. To verify moving parts such as the turbine runner, the test facility has to be equipped with a stroboscope, which can be used to make the runner appear stationary for a given constant rotational speed. Physical prototypes are also quite often used for teaching purposes. We believe that the comprehension of complex systems can be strongly improved by using hybrid prototypes including coupled online simulations. For production usage in rapid prototyping, augmented reality applications often are too inaccurate and require a lot of time to setup for each changing dataset. Camera tracking by using ARToolKit and pattern markers works quite well for tracking of the prototype. Although, it is not precise enough for input parameters where small changes lead to huge computational expenses, such as the blade angle in the application outlined above. This shows the need to provide additional sensors for user input. When a hybrid prototype is to be used in collaborative sessions, there has to be a way to provide feedback to the tangible interface. If a parameter is changed using the virtual representation of the object, the respective parameter in the tangible interface should be changed also. These parts of the tangible interface would have to be developed specifically for each application leading to an increased development time.
6
Conclusion
In this work we have presented an AR application including interactive simulation and a tangible user interface in order to demonstrate how rapid prototyping can be supported by hybrid prototypes. Today, even challenging simulation tasks like 3D CFD can be solved in proper time and with reasonable effort. Our approach with a model size water turbine shows that the superimposition of virtual simulation data with a real prototype helps to understand and interpret complex relationships. A set of user interfaces for the modification of input parameters, the preparation of visualization modules and the orientation detection of movable components allow us to control the application. With these user-friendly interfaces and seamless automated workflows even engineers and designers without expert knowledge in simulation are able to optimize solutions. The automated process chain of interactive simulation includes grid generation, domain decomposition, the simulation of incompressible flows, a tangible user interface as well as post-processing and visualization.
1016
F. Niebling, R. Griesser, and U. Woessner
We have identified various challenges that have to be met. Although the tracking of objects using ARToolKit is good enough for superimposing data with the physical prototype, it is not exact enough to represent input parameters for online simulations. This increases development time, because the easy to setup method of getting input parameters via additional markers has to be replaced with application-specific sensors that provide accurate values. In the future, we would like to integrate other parts of the turbine design into our workflow. This includes the wicket gate and the draft tube, as well as a coupled online simulation that allows for interaction and back coupling of the different parts. Additional sensors for collecting the angle of the shrouds in the wicket gate and the turbine blades in the runner could be integrated into the tangible interfaces.
References 1. Kai, C.C., Fai, L.K.: Rapid Prototyping: Principles & Applications in Manufacturing. John Wiley and Sons, New York (1998) 2. Milgram, P., Takemura, H., Utsumi, A., Kishino, F.: Augmented reality: A class of displays on the reality-virtuality continuum. In: Telemanipulator and Telepresence Technologies, vol. 2351 (1994) 3. Verlinden, J., Horvath, I., de Smit, A.: Case-based exploration of the augmented prototyping dialogue to support design. In: Proceedings of TMCE, pp. 245–254 (2004) 4. Milgram, P., Kishino, F.: A taxonomy of mixed reality visual display. Inst. of Electronics, Information and Communication Engineers (IEICE) Trans. Information and Systems E77-D(12), 1321–1329 (1994) 5. Azuma, R.: A survey of augmented reality. In: Proceedings of Computer Graphics (SIGGRAPH 1995), pp. 1–38 (1995) 6. Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., MacIntyre, B.: Recent advances in augmented reality. IEEE Computer Graphics and Applications 21, 34–47 (2001) 7. Regenbrecht, H., Baratoff, G., Wilke, W.: Augmented reality projects in automotive and aerospace industry. IEEE Computer Graphics and Applications (2005) 8. Schmalsteig, D., Fuhrmann, A., Szalavari, Z., Gervautz, M.: Studierstube - an environment for collaboration in augmented reality. In: Proceedings of Collaborative Virtual Environments, pp. 37–49 (1996) 9. de St. Germain, J.D., Parker, S.G., McCorquodale, J., Johnson, C.R.: Uintah: A massively parallel problem solving environment. In: HPDC, pp. 33–42 (2000) 10. Vaziri, A., Kremenetsky, M.: Visualization and tracking of parallel CFD simulations. In: Proceedings of HPC 1995, Society of Computer Simulation (1995) 11. Kohl, J., Papadopoulos, P., Geist, G.: Cumulvs: Collaborative infrastructure for developing distributed simulations. In: Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis (1997) 12. Lang, U., Woessner, U.: Virtual and augmented reality developments for engineering applications. In: Proceedings of the European Congress on Computational Methods in Applied Sciences and Engineering ECCOMAS (2004) 13. Woessner, U.: Arctis: augmented reality collaborative tangible interactive simulation. In: SC 2006, p. 304. ACM, New York (2006)
Using AR and Interactive Simulations to Realize Hybrid Prototypes
1017
14. Lippold, F., Ogor, I.B.: Fluid-structure interaction: Simulation of a tidal current turbine. In: High Performance Computing on Vector Systems 2007, pp. 137–143 (2008) 15. Kato, H., Billinghurst, M.: ARToolKit (2001), http://www.hitl.washington.edu/research/shared space/download/ 16. Karypis, G., Kumar, V.: MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 2.0 (1995) 17. Ruprecht, A., Bauer, C., Gentner, C., Lein, G.: Parallel computation of statorrotor interaction in an axial turbine. In: ASME PVP Conference, CFD Symposium, Boston (1999)
Immersive Simulator for Fluvial Combat Training Diego A. Hincapi´e Ossa, Sergio A. Ord´ on ˜ ez Medina, Carlos Francisco Rodr´ıguez, and Jos´e Tiberio Hern´ andez Universidad de los Andes, Bogot´ a, Colombia
[email protected],
[email protected] http://www.uniandes.edu.co
Abstract. This paper presents a simulator’s development. Its objective consists in training Colombian Navy soldiers. This device has three subsystems: A mobile platform, a graphical interface and a shooting device. The first was constructed by the connection of two linear actuators to a seat shaped, single user platform with two rotations over the horizontal axes. These actuators are activated by servomotors that are connected to a motion controller. Furthermore, the graphical interface permits the visualization of a realistic three dimensional world composed by a river, a firearm over a moving boat, targets and natural elements over the riversides. The software has the option of stereoscopic view and captures the shots provided by the third subsystem. The shooting device is the result of combining encoders with a cardan-type joint. It allows the aim movement in elevation and azimuth coordinates. Finally, preliminary tests of its potential use were conducted with satisfactory results. Keywords: Immersive Simulation, Fluvial, Assault, Dynamic Simulation, Firearm, Shooting Training.
1
Introduction
Learning is an inherent activity to the human being, and with the development of new technologies, learning or training is necessary for their effective use; in the other hand, technology can be used to improve the way in which humans can learn or be trained. Several activities require hard and constant dedication to attain ability and experience on its execution. However, training in some activities results to be expensive and not practical. This is the motivation for the development of new learning and training methods. It is pertinent to introduce some aspects of the social context in which this project is developed; there has been a military conflict in Colombia for almost half a Century, and advances in security technologies are necessary. This work is focused on the search of an efficient, economic and practical activity for Colombian soldiers to provide them with the proper conditions for fluvial assault. Punctually, this project is developed to help supply the clear need of training Colombian soldiers in the specific activity of shooting from vehicles in motion. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1018–1027, 2008. c Springer-Verlag Berlin Heidelberg 2008
Immersive Simulator for Fluvial Combat Training
1019
First in this paper, there is a revision of current and recent related works. Next, the problem and requirements of this project are defined and its process development is described. After, the final concrete characteristics of the prototype subsystems are shown, and its performance is evaluated with satisfactory results. Finally, the conclusions of the design, construction and evaluation processes are presented and possible further developments are proposed.
2
State of the Art
The State of the art is reviewed in this section, and it must be said that current advances in military technology, because of its nature, are neither commercial nor available. Some technical information can be found in published patents [7] [9] [13] [10] [12]. Most of the methods are based in laser rays manipulation. 1. LONG RANGE SHOOTING SIMULATION: It is a simulator with variable caliber (AR10 .308 caliber rifle, Windrunner .338 y Windrunner .50), and changeable size of static targets in diverse scenarios. It is a precision shooting simulator taking into account environmental variables like wind speed and temperature effects on the bullet. Visually it presents telescopic aim and sound effects [4]. 2. SHOTPRO 2000: It is a three-projections simulator with a screen angle range of 180 degrees. Its projected scenarios are prerecorded and present animated targets with realistic movements. The shooting detection system is based on the recognition of the disposition of a laser emitted by the simulating weapon [5]. 3. EUROSIMULATOR: It is a shooting simulator available in the Web. The user can choose the weapon between revolver, pistol, and rifle. The simulation can be performed in diverse environments in a closed shooting stage. The shooter can manipulate some aiming options like the oscillation of the holding hand [1]. 4. MARKSMAN TRAINING SYSTEMS ST-2: It is the first hunting simulator. It has realistic prerecorded scenarios with efficient feedback including the firearm position and a backward hitting system. It offers several hunting possibilities with shotgun (rabbit, duck and pheasant), and rifle (moose, deer, wild boar and bear). The system has a patented shooting recognition technique and an adjustable objective speed and trajectory [2]. 5. NOPTEL 2000: It is a training simulator for soldiers, policemen and sportsmen. It presents two training modes, indoors basic shooting and outdoors assault environment. For the firearms disposition recognition a device is implanted in modified real weapons. In the indoors mode this device has a wire connection with a computer that stores the time and results information; this connection is wireless in the outdoors mode. The shooting recognition is based on a reflective laser target. It counts with a backward hitting system developed with compressed air capsules [3].
1020
3 3.1
D.A. Hincapi´e Ossa et al.
Project Definitions General Problem
Shooting training is a very expensive activity for the Colombian Armed Forces; in the case of the Colombian Navy it is indispensable to train soldiers in a fluvial environment which results extremely unpractical because of fuel and munitions costs. Developing a Simulator for this purpose is the main objective of this project. It is necessary to create a virtual environment which enables the user to submerge in by cheating his senses with visual and perceptual stimuli; the main feature of this work is the integration of a physical movement simulator with a proper image projection and a performance measurement system. 3.2
Requirements
Three different subsystems are necessary to solve the problems above. Their requiremente are shown. 1. A mobile platform that simulates the movement of a ship over a river. The mobile platform must be able to move in two degrees of freedom: Two rotations above the horizontal axes. It must be implemented by the disposition of linear actuators on a supporting structure. The actuators behavior can be driven by a motion controller. Another required characteristic is Synchronization with the other subsystems to accomplish immersion. The motion of the platform must be programmed, executed and tested; the possibility of executing several modes of movement is desirable. 2. A graphical interface that imitates the desired scenario, through the projection of a realistic virtual world. The graphical software must have a virtual world model. It must be composed by the most important objects of the real environment with which the soldiers interact. This graphical software must project the virtual world realistically for user visualization. The software must be able to calculate the trajectory of a shot with the information provided by the physical device and must have a Score System that allows the monitoring of a user’s training process. The projected visual environment must simulate the same motor boat motion executed by the mobile platform. Several features are desirable: targets in the virtual world must have different sizes, movements and orientations. The software may have different modes of execution. For example automatic or manual weapons could be implemented. Also, the software must have the option to display a stereoscopic view to gain in immersion. 3. A shooting device which allows the users interaction with the virtual world. The shooting device must be able to move in two degrees of freedom: Elevation and Azimuth. Also, a trigger must be implemented to generate the information of a shooting event; the information of the orientation and shooting parameters must be sent in real time to the graphical software to guarantee
Immersive Simulator for Fluvial Combat Training
1021
synchrony. The shooting device must be constructed to withstand the movement conditions of the platform and must have a firearm-like appearance for aiming in the shooting action.
4
Project Development
The solution presented in this document was initially developed by strictly respecting the conditions given in previous projects. The mobile platform was developed by Vivas [15]; it is a Joyrider type structure. Subsequently, a first version of a motion control was implemented for the platform by C´ ardenas [6]. In Ord´ on ˜ ez’s project [11] two subsystems of the prototype were developed. The graphical software was completely designed, implemented and tested. A first prototype of the shooting device was conceived, constructed and evaluated, including the backward hitting assembly. Two versions of the synchronization of the previous two subsystems were completed and compared. The first was based on a voltage transduction from a variable resistor (linear potentiometer with rotational movement) and an acquisition board. The second and selected option was based in positional encoders transduction. The prototype integration was continued in Hincapi´e’s project [8] where the mobile platform subsystem was redesigned and completed. A new motion control system was implemented and the movement simulation programs were developed. Some demonstrations of the interaction of the visual interface and the preliminary shooting device mounted on the mobile platform were performed. According to the results, a new shooting device was designed and constructed to gain in robustness, keeping the transduction system. The boat’s movement was taken into account in the coordination with the graphical software, achieving a synchronization between the mobile platform and the virtual world events. Functionality tests were developed, and preliminary tests of the simulator’s potential use as training machine were conducted.
5 5.1
Results Prototype Description
Mobile Platform Characteristics. Now the description of this subsystem is broken down. The mobile platform subsystem was developed over a PVC structure, it is capable to perform the two desired Degrees of Freedom by . The whole motion of the platform can be described as two successive independent rotations: a reference frame A rolls over a frame fixed to ground, and a second inertial frame B rotates around the frame A. Fig. 1. A preliminary generic Joyrider structure [15] was modified to include new actuators attached to the controllers; the structure results robust enough to the simulator protoype purposes. The whole dynamic system produces a satisfactory motion.
1022
D.A. Hincapi´e Ossa et al.
(a)
(b)
Fig. 1. (a) Mobile platform structure and degrees of freedom. (b) α:F irearmLatitude. β:F irearmAzimuth. θ:ShipRoll. φ:ShipP itch.
Several kinds of movements were programmed. Two types of them are able to be used in this prototype of a fluvial vehicle simulator: Oscillatory periodic and random motions. Oscillatory periodic motion can be run in synchrony with the virtual world projection, while oscillatory random motion can be used to explore human response to a non predictable kind of motion. The motion programs are the result of combining harmonic functions in the two degrees of freedom that are being used. It is a preprogrammed motion, and it must be enhanced in future works. Graphical Software Characteristics. The following are the features that the Software Subsystem presents. Fig. 2. – – – – – – –
– – – –
A single static or dynamic user through a rectilinear river. A shot trajectory calculation, marking the correct ones on the targets. An accumulated and partial score system. Static or dynamic classical targets with different sizes, movements and orientations. Acceptable graphical quality displaying a river, its banks, some natural elements, shooting targets and a 0.50 M-2HB firearm. Random generation of the terrain, the targets disposition and the natural elements configuration. Three static perspective projections (Front, left and right), a dynamic perspective projection moving with the firearm azimuth angle, three static orthogonal projection (Frontal, lateral and superior), and a dynamic orthogonal projection with a track ball and zoom options. Phong illumination, and Flat or Gouraud shading. Stereoscopic projection with the option of a variable distance between the eyes. Two firearm options: manual or automatic shooting. A joystick calibration system for the shooting device. It records in a text file the grades per pixel in both, the azimuth and elevation coordinates.
Immersive Simulator for Fluvial Combat Training
(a)
1023
(b)
Fig. 2. (a) Graphical software visualization. (b) Stereoscopic software visualization.
Shooting Device Characteristics. This subsystem was developed in two phases, and a backward hitting system was constructed and tested. a. Static Shooting Devices over Tripod: Two versions of this device were implemented 3. Both of them present the following operational characteristics: In order to develop the manual and automatic shooting mode in the software, it was necessary to use a push switch as the trigger in order to have a continuous signal transmission only when it is being pressed. With reference to the modes of motion, the two desired rotations are possible by a tripod’s positioning mechanism. – The first device provides information in both coordinates through linear potentiometers with rotational movements. Two rotations change the value of the potentiometers’ resistance, consequently the voltage values also change. In real time, these values are read by the acquisition board and they are transferred to the visualization software by a dynamic library. This library was programmed in C++ and translated to Java by JNI. This device presents a considerable noise error corresponding to a constant oscillation of two degrees in the acquisition; furthermore, it has a complicated digital connection. Its calibration depended on the potentiometers’ resistance values. – The second one was selected because of its performance, this version of the device takes advantage of two positional encoders with a rotational movement. Those electronic devices make directly the analog to digital transduction in a very practical mode. The resolution of the device resulted satisfactory: It is 0,8 degrees per pixel for the azimuth encoder, and 0,5 degrees per pixel for the elevation encoder. Another advantage is its simple digital connection and its practicity to be calibrated. b. Mobile Shoting device over Platform: It was developed including position encoders over a cardan-type joint that permits two degrees of freedom to the firearm. Fig. 4 (a). It presented a highly reliable type of motion; the elevation and Azimuth rotations are possible. Also, the new structure is robust enough to withstand the loads applied by the user manipulation. In the other hand, the trigger
1024
D.A. Hincapi´e Ossa et al.
(a)
(b)
Fig. 3. (a) First approach to shooting device. (b) Selected standing shooting device.
(a)
(b)
Fig. 4. (a) Final shoting device over Joyrider. (b) Backward Hitting Assembly.
was implemented and the shooting event information can be produced. The device permits the communication of the position and shooting events information with the graphical software, so the shooting device is coordinated with the visual projection. With respect to the desirable objectives, the device has a cannon like appearance and its ergonomics are acceptable. c. Backward Hitting Assembly: A haptic feedback can be as useful as a visual or an auditive feedback, but it is not widely used. In this project a Backward Hitting system of the physical device was implemented to improve the sensory experience and gaining in immersion. It was developed by using compressed air, but the results are to be improved in future works. 5.2
User Tests Results
After the implementation of the three subsystems of the prototype and installing them together as a single integrated machine, some qualitative tests were done to validate its functionality. Furthermore, some users tested the apparatus in a specific order to discern patterns that permit to estimate, the potential of the simulator for its ultimate training goal. The conditions of the tests consisted in running periodic motions at various levels of the whole integrated systems.
Immersive Simulator for Fluvial Combat Training
(a)
1025
(b)
Fig. 5. (a) User tests verifying difficulty levels. (b) User tests verifying potential as a training tool.
(a)
(b)
Fig. 6. (a) Pira˜ na military boat of the colombian navy. (b) Complete system assembled.
Random motion was tested only to identify user sensation of movement, without trying to integrate it with the visual projection; they were evaluated at various levels of intensity. The results are presented next, in which the users’ scores are normalized by his mean. A T-Student distribution hypothesis test was made. Twelve individuals performed for two minutes in the Simulator Prototype in each run. In Fig. 5 (a) the results of the functionality quantitative tests are shown. They consisted in comparing increasing difficulty levels for the same user. The hypothesis to be proved was that the mean normalized score decreased when the difficulty level increased. In Fig. 5 (b) the results of the potential training tests are presented. In this case the hypothesis to be tested was that the mean normalized score increased when the same user conducted the test for second and third attempt.
6
Conclusions
It has been demonstrated that simulation prototypes like the one developed in this project work correctly using currently available technology. For the simulator
1026
D.A. Hincapi´e Ossa et al.
implementation three subsystems (a mobile platform, a graphical software, and a shooting device) were developed and integrated with an acceptable synchrony. An acceptable immersion was reached because the realism in visualization, the stereoscopic vision, the physical weapon manipulation, the backward hitting assembly, and the mobile platform movement. The prototype’s functionality was tested as the difficulty levels where shown to be coherent. With other tests results it was demonstrated that a person can be trained to use the apparatus. The prototype that resulted from this project fulfills its initial requirements, and it is the first approximation to a complete simulator of these characteristics.
7
Further Works
– Visualization of the graphical software must be increased by including more real objects, developing lighter geometric models and special effects. – More complex rivers can be developed; the possibility of modeling real Colombian rivers is desirable. – Non-predetermined motion must be developed. It can be done if real signal communication of the visual world with the Motion Controller equipment is established. – A superior level of programming could be developed to transform the movement platform into a manageable computational device. – A new design of motion capture must be developed to increase weapon signaling robustness. The current robust structure can be used. – Ergonomics of the fire gun must be increased. – Back hitting of the simulation weapon could be improved, by increasing the force magnitudes to gain in haptic realism. – For the development of a complete simulator, structural and environmental characteristics must be evaluated by an expert; a trained person must generate instructions to continue with the simulator developments. With the expert intervention the complete simulator will intend to imitate real conditions of fluvial assault. For example, by developing new firearms and motor boat appearence features. – The movement production must be validated when compared with a real ship; an expert must test the perception of orientation and acceleration variables. – Training routines must be developed for a more formal validation of the training use of the simulator. It can be done by comparing the performance in a real shooting activity of a person who trained in the simulator. – More real phenomena can be taken into account; for example, the interaction of the water with the motor boat can be considered.
References 1. Eurosimulator, http://www.hlberg.dk/eurosimulator 2. Marksman training systems, http://www.marksman.se 3. Noptel (2000), http://www.noptel.fi/eng/mil/index.php
Immersive Simulator for Fluvial Combat Training
1027
4. Shooter ready, http://www.shooterready.com/lrs.html 5. Shotpro shooting simulator (2000), http://www.trojansim.com/shotpro2000.html 6. Cardenas, J.: Dise˜ no e implementaci´ on del sistema de control para un simulador de veh´ıculo terrestre, Bogot´ a, Colombia. Universidad de los Andes (2007) 7. Clark, J., Kendir, T., Shechter, M., Rosa, S.: Firearm laser training system and method employing an actuable target assembly, USA (June 2003) Patent 6575753 8. Hincapie, D.: Prototype of a dynamic immersive fluvial assault simulator, Bogot´a, Colombia. Universidad de los Andes (June 2008) 9. Lvovskiy, M.: Training simulator for sharp shooting, USA (September 2005) Patent 6942486 10. Morley, R., Buick, J.: Projected imaged weapon training apparatus, USA (July 1987) Patent 4680012 11. Ordonez, S.: Simulador inmersivo de ataque con arma de fuego unipersonal en un ambiente fluvial, Bogot´ a, Colombia. Universidad de los Andes (December 2007) 12. Powell, R., Jacobsen, W.: Laser weapon simulator apparatus with firing detection system, USA (January 1997) Patent 5591032 13. Suzuki, K.: Shooting game machine, USA (November 1994) Patent 5366229 14. Tseng, H.-L., Fong, I.-K.: Implementation of a driving simulator based on a stewart platform and computer graphics technologies. Asian Journal of Control 2(2) (June 2000) 15. Vivas, J.: Plataforma para simuladores din´ amicos, Bogot´ a, Colombia. Universidad de los Andes (2006)
A Low-Cost, Linux-Based Virtual Environment for Visualizing Vascular Structures Thomas Wischgoll Computer Science and Engineering, Wright State University
Abstract. The analysis of morphometric data of the vasculature of any organ requires appropriate visualization methods to be applied due to the vast number of vessels that can be present in such data. In addition, the geometric properties of vessel segments, i.e. being rather long and thin, can make it difficult to judge on relative position, despite depth cues such as proper lighting and shading of the vessels. Virtual environments that provide true 3-D visualization of the data can help enhance the visual perception. Ideally, the system should be relatively cost-effective. Hence, this paper describes a Linux-based virtual environment that utilizes a 50 inch plasma screen as its main display. The overall cost of the entire system is less than $3,500 which is considerably less than other commercial systems. The system was successfully used for visualizing vascular data sets providing true three-dimensional perception of the morphometric data.
1
Introduction
The analysis of spatial perfusion of any organ requires detailed morphometry on the geometry (diameters, lengths, number of vessels, etc.) and branching pattern (3-D angles, connectivity of vessels, etc.). Accurate methodologies for extracting this morphometry from volumetric data such as CT scans are becoming available nowadays [1]. The resolution of the scans can range from a little less than a millimeter down to just a few micrometers. Especially the latter can result in a vast number of vessel segments that can be extracted. Once extracted, this morphometry needs to be visualized in order to be analyzed properly. Based on a geometric reconstruction, an accurate visualization of the vasculature can be derived. Additional morphometric and statistical information can be incorporated into the visualization as well to enhance its informational value. However, it is often difficult for a user to grasp the geometric configuration of the vasculature due to the high number of relatively thin and long objects presented by the individual vessel segments. Despite depth cues, such as proper lighting and shading of the vessels, it is often not easy to identify which vessel is in front and which one is in the back. A true 3-D visualization can help improve the visual perception. For example, Barco’s CADWall, a large projection display, can be used which is capable of achieving stereoscopic rendering based on a polarized projection system. Figure 1 shows the visualization of a large-scale vasculature G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1028–1039, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Low-Cost, Linux-Based Virtual Environment
1029
Fig. 1. Large-scale vasculature displayed on Barco’s CADWall in daytaOhio’s Appenzeller Visualization Laboratory at Wright State University
including vessels from the most proximal vessel down to the capillary level using this system. Unfortunately, such a visualization system is prohibitive in most projects due to the high cost involved. Hence, there is a need for visualization systems that are capable of stereoscopic 3-D rendering at a significantly lower cost. Therefore, this paper describes a virtual environment that utilizes non-traditional display technology, which is capable of running any OpenGL-based application that uses quad-buffered rendering to create a virtual environment. The described system is based on Linux to provide a versatile operating system environment. The system is also technically capable of running a Windows-based operating system. Despite the 50 inch large plasma screen being used as its main display, the entire system is available for less than $3,500 which makes it a very low-cost, yet powerful visualization system. The structure of this article is as follows: initially, work related to this article is discussed. Subsequently, the low-cost, Linux-based virtual environment is described. Finally, conclusions and future work are presented.
2
Related Work
Virtual environments consist of two major components: first, display technology is required that allows a user to view in 3-D. Second, 3-D suitable input devices are required that do not bind the user to a certain location but instead provide maximal freedom of movement to the user. There are a few different technologies
1030
T. Wischgoll
typically used for displaying the visualization. Head-mounted displays [2,3,4] consist of two small screens mounted into a device that the user wears similar to a helmet such that the two screens are placed in front of the user’s eyes. Since the device is equipped with two individual screens, different images for the left and right eye can be easily displayed resulting in stereoscopy experienced by the user. Typically, head-mounted displays have a resolution of only 800 by 600 pixels. Higher resolution head-mounted displays are available but significantly more expensive. One advantage of head-mounted displays is that some can be used as see-through devices for augmented reality systems [5]. Other display types [6,7,8] rely on glasses that hide the left image from the right eye and vice versa. This allows for a majority of display types to be used. Often large projection walls are used which can be configured as large wall-type displays or a CAVE-like environment. Two different types of glasses are used in combination with these displays: active and passive. With passive glasses, polarization is used to ensure that the left image can only be seen by the left eye. This requires two projectors where polarization filters with different polarization are placed in front of each projector. The glasses then only let light pass with the correct polarization so that each eye only sees the image generated by one of the projectors. Nowadays, even some TFT-based monitors are becoming available that work with passive polarization glasses. Active stereo glasses work similar to TFT screens. Polarization filters can be activated that block all incoming light. There is one filter for each eye which can be activated individually. The filters then need to be synchronized with the display in such a way that ensures that the right image is only visible to the right eye and vice versa. Typically, the visualization system displays images for the left and right eye in an alternating fashion and activates and deactivates the filter in the glasses for the left and right eye according to which eye the current image corresponds to. The advantage of this type of glasses is that they work with various different display types, such as projection displays, CRT screens, or plasma displays. However, they typically do not work with TFT screens since they, too, use polarization filters for displaying an image so that the active stereo glasses filter out the light entirely all the time. Recently, auto-stereoscopic displays were developed that are available at a reasonable price. The advantage of this type of display is that it does not require the user to wear any glasses. Typically, barrier screens are used so that the light of half of the pixels gets directed more towards the left and the other half more towards the right. This way, one half of the pixels are only visible by one eye, whereas the other half can be seen only by the other eye, assuming the user is located somewhat centered in front of the display. As input devices, different wand or stylus devices are typically used. Often times, these are tracked either magnetically or optically to determine their position in 3-D space without the need of any cabling. More recently, standard gaming devices are utilized in virtual environments as well which are wirelessly connected to the computer. Wischgoll et al. [27] discuss the advantages of game controllers for navigation within virtual environments. Dang et al. [28] studied the usability
A Low-Cost, Linux-Based Virtual Environment
1031
of various interaction interfaces, such as voice, wand, pen, and sketch interfaces. Klochek et al. [29] introduced metrics for measuring the performance when using game controllers in three-dimensional environments. Wilson et al. [30] presented a technique for entering text using a standard game controller. Based on the previously described technology, a visualization of a model, such as a vasculature, can be presented to a user. In order to navigate through or around a displayed model, the camera location needs to be modified. In general, a camera model describes point of view, orientation, aperture angle, and direction and ratio of motion. A general system for camera movement based on the specification of position and orientation of the camera is presented in [9], whereas Gleicher et al. [10] chose an approach where through-the-lens control was applied by solving for the time derivatives of the camera parameters. The concept of walkthroughs in simulated virtual worlds using a flying metaphor was first explored by Brooks [11]. Other commonly applied metaphors for navigation in virtual environments (VEs) such as ”eyeball in hand”, ”scene in hand” and ”flying vehicle control” were introduced by Ware and Osborne [12]. For camera and viewpoint navigation in virtual endoscopy systems, various aspects have to be considered. While free manual navigation in 3-D generates the problem of potential disorientation, proceeding automatically on a planned path is often too constraining. Planned navigation with automatic path planning by specifying camera parameters at key points was explored for example by Nain et al. [13]. A combination of manual and planned navigation is called guided navigation. While Galyean [14] applied a river analogy for guided navigation in VEs, Hong et al. [15] among others utilized guided navigation paradigms with a combination of distance fields and kinematic rules for collision avoidance. Lorensen et al. [16] described the use of a virtual endoscope for several types of data. Internal views of the data were explored by generating camera paths with key framing and robot path planning algorithms. Kaufman et al. [17] enhanced their endoscopy system (volumetric environment) with automatic fly-through capability based on flight-path planning with the possibility of an interactive walk-through. Application areas for virtual endoscopy [18] are, for example, virtual colonoscopy [19], virtual angioscopy [20], and vessel visualization and exploration of the vasculature of the human liver [21]. The ViVa project [22] presented visualization solutions for virtual angioscopy and provides simple tools for measuring single distances inside the vessels. Sobel et al. [23] described a visionary system featuring novel visualizations and views of bifurcations. In addition, the blood flow was depicted by particles visualized as glyphs. Since the visualization aspect concentrated on non-photorealistic visualization techniques, no textures were used and no complex surface details were visible. This might be a restriction for physicians who are used to traditional (realistic) visualizations and real-life endoscopic images. Bajaj et al. [24] segmented a CT scanned human heart using a seeded contour algorithm. The extracted parts were then aligned with a template to derive a patient-specific heart model. In addition, segmented vessels could then be further refined based on a NURBS interpolation to allow for an accurate simulation
1032
T. Wischgoll
of blood flow in a patient-specific model as shown by Zhang et al. [25]. Similarly, Forsberg et al. [26] introduced a virtual environment for exploring the flow through an artery.
3
A Linux-Based Virtual Environment
The system described in this paper uses a large-screen plasma display as the main screen for the virtual environment combined with shutter glasses. The shutter glasses used for the described system are TriDef’s 3-D Wireless Glasses. These glasses come with an infrared emitter to signal the glasses to flip between the left and right eye. Figure 2 depicts the model used for the described system. The package also contains a software CD which only runs on Windows operating systems. Even though the emitter for the shutter glasses uses a regular DIN connector, it cannot be plugged into the port at the graphics card since a different protocol is used. Instead, the emitter needs to be plugged into the port at the back of the display screen. Accordingly, only screens that provide such a port can be used in this configuration. Various such screens are available from manufactures such as Mitsubishi and Samsung that are equipped with 3-D technology. The described setup uses the 50 inch plasma display Samsung P50A450. This plasma display has a resolution of 1360 by 768 pixels. Higher resolution rear-projection displays with full HD resolution of 1920 by 1080 are also available from Samsung that also provide the necessary 3-D capabilities. The advantage of plasma displays over projectors is the higher durability, while still providing a large display surface. Typically, the life expectancy of a plasma display is at least ten times as long compared to projectors. In Linux, a stereo capable graphics card, such as NVidia Quadro or ATI FireGL cards, is required in order to generate quad-buffered stereo images. The test system is equipped with an NVidia Quadro 3700FX graphics card. The graphics card has two dual DVI connectors which are hooked up to the plasma screen and a regular TFT display of the same resolution as can be seen in figure 3. The TwinView mode provided by NVidia’s graphics driver is used to show the exact same content on both screens at the same time. The driver requires both screens to use the exact same configuration in this setting. This not only means
Fig. 2. TriDef 3-D stereo glasses
A Low-Cost, Linux-Based Virtual Environment
1033
Fig. 3. Configuration of the Linux-based virtual environment showing a large-scale cardiovascular tree
that the same resolution is used, but also the exact same modeline, i.e. the timing string used by the X-server to define resolution and frequency used for driving the monitor, to configure the X-server for both displays. Since both screens are connected via DVI connectors, the monitor configuration is retrieved automatically by the X-server to identify the modeline for both screens via the EDID response of the monitors. Unfortunately, the screens used for the system do not provide the exact same modeline. The timing and resolution is identical. However, the synchronization mode differs, with the Samsung display requiring a positive polarity for the horizontal sync signal whereas the other screen uses negative polarity. This automatically disables the stereo mode in the X-server so that no quad-buffered rendering mode is available. Fortunately, this problem can be solved relatively easily by using NVidia’s setup tool to download the EDID information of one of the screen. The X-server allows for providing EDID information in a file so that the information just downloaded from one screen can then be used for the other screen. Consequently, the system now thinks that two Samsung plasma displays are connected and it automatically uses the exact same configuration. As a result the stereo mode is no longer disabled when starting the X-server with both screens attached. The Samsung displays support different modes for rendering stereo images, specifically horizontal vertical, and checkerboard. All three modes expect that the images for the left and right eye are sent to the display as a single interleaved
1034
T. Wischgoll
Fig. 4. Nintendo Wii controller and nunchuck
image. In the horizontal mode, the first row is part of the left image, the second row belongs to the right image, and so forth. Similarly, in the vertical mode the images are split up by column whereas the checkerboard mode intertwines the pixels. In Linux, the NVidia driver supports the vertical mode so that the X-server was configured to use this mode. This results in an effective resolution of 680 by 768 which is displayed at a refresh rate of 60 Hz. In order for the image to not appear striped and lessen the effect of the reduction in resolution due to the use of the stereoscopic mode, the Samsung display post-processes the images by averaging neighboring pixels. As a result, the stereo images do not appear to be based on this vertical mode at all, i.e. no striping effect is visible. The display manages to generate smooth images without gaps. Only single pixels, as used in fonts for example, appear slightly distorted. Since the system uses a large-screen display mounted at eye-sight for a typical standing user, traditional input devices, such as keyboard and mouse are not suitable at all. Due to the lack of desk space within reach of a user standing in front of the display, keyboard or mouse simply cannot be used in a reasonable fashion from a usability point of view. Instead, standard game devices are deployed to navigate the system and change settings within the visualization. Game devices such as the Logitech Wingman already proved to be useful for medical visualization [27]. In this paper, Nintendo’s Wii controller is used as the input device for the system described. The Wii controller, as depicted on the right in figure 4, provides four buttons arranged in a two-axis layout as well as six additional buttons. If extended using the nunchuck shown in figure 4 on the left, which is simply connected to the Wii controller by a supplied cable, an additional button and a joystick is available. The Wii controller is particularly suitable since it is equipped with accelerometers that allow the device to determine its rotational position. Two rotational axes are detected based on the accelerometers. Additionally, the system can be equipped with a sensor bar. The sensor bar essentially is just a set of four infrared LEDs mounted in a row.
A Low-Cost, Linux-Based Virtual Environment
1035
Fig. 5. Sensorbar used in combination with the Wii controller
Fig. 6. Linux-based virtual environment showing a cardiovascular tree to a user
Hence, any sensor bar can be used instead of the one provided with the Wii entertainment system. One example is shown in figure 5. The Wii controller has a camera built into the front of the device which detects the location of the LEDs of the sensor bar. With this additional information, the Wii controller is capable of detecting rotations in all directions, i.e. yaw, pitch, and roll. In order to communicate with the Wii entertainment system, the Wii controller utilizes the Bluetooth protocol. To use it in combination with a computer, a Bluetooth dongle needs to be used. The described system is based on Mandriva 2007 which already comes with all the necessary packages required for the Bluetooth protocol. The setup uses an ASUS Bluetooth dongle (ASUS WL-BTD201M) which is directly supported by this Linux distribution. In order to drive the Wii controller, The interface library wiiuse is used which is available at http://www.wiiuse.net/. This library allows any C-based program to check for pushed buttons on the Wii controller, determine the rotational position of the device, or identify the position of the joystick on the nunchuck. Figure 6 shows the entire system with all its components showing a vascular structure to a user. The vascular structure was previously extracted from a CT
1036
T. Wischgoll
Fig. 7. Fly-through mode through the vasculature including particle simulation; detailed statistical information is continuously updated on the right during fly-through
scan of a porcine heart, which determined the center lines and radii of all vessel segments that were detected [1]. This then results in a data structure that defines vessel segments as the center line with radii information at both ends. Based on this information, conic cylinders can be generated to represent each vessel segment. At the vessel bifurcation, where a single vessel segment forks into two or more daughter vessels, the intersection between these conic cylinders is computed to remove any obstruction within the interior of the vessels. The user can then examine the vasculature from an external point of view or fly through the vasculature. The fly-through mode can be enhanced with a particle simulation that traces erythrocytes, leukocytes, and platelets through the vasculature as shown in figure 7. For the external view, the rotational position of the Wii controller determines the rotation of the vasculature. Via the library wiiuse, the exact angles for yaw, pitch, and roll the Wii controller is held at are identified. The rotational matrix for displaying the vasculature is then updated according to the change in these angles. This allows a user to rotate the vasculature in a very intuitive fashion since it rotates exactly as the Wii controller is rotated. Two buttons on the Wii controller are used to activate and deactivate the rotational mode so that a user can reposition his or her hand, thereby avoiding unnatural stretching or bending of the wrist. The two-axis aligned buttons at the top of the controller are used to move the vasculature parallel to the screen, whereas two of the remaining buttons are used to zoom in and out. Alternatively, the joystick on the nunchuck could be used for panning. The home button on the center of the controller always allows the user to reset the view to its initial setting. Since all buttons are very accessible at all times, the user has full control over the view
A Low-Cost, Linux-Based Virtual Environment
1037
settings and can rotate, pan, and zoom all at the same time. In the fly-throughmode the rotational position of the Wii controller determines the direction of movement while the vertical axis on the top of the device can be used to slow down or accelerate the movement. Again, this makes for a very intuitive control of the fly-through.
4
Conclusions and Future Work
This paper described a Linux-based virtual environment. The system is capable of running any application that uses quad-buffered OpenGL stereoscopic rendering to display stereoscopic imagery using shutter glasses. Its 3-D rendering capabilities were tested with the FAnToM software package developed at the universities of Kaiserslautern and Leipzig at Gerik Scheuermann’s group as well as the visualization software developed for visualizing vascular structures. Both systems ran flawlessly on the described system. The presented system is very reasonable priced at less than $3,500. The 3-D view helps better perceive the three-dimensional structure of the vasculature which cannot be provided as easily by 2-D projections as offered by conventional display technology. The use of the Wii controller enables the system to be used in a very intuitive fashion. In the future, user studies need to be performed to determine as to which configurations in terms of button layout on the Wii controller are most user-friendly and intuitive. Selection methods will be implemented that allow a user to select individual vessel segments to have the system display additional information about particular vessel segments, such as vessel volume or cross-sectional area. For example, the direction the Wii controller is currently pointing at could be used in combination with the sensor bar to determine which vessel segment the Wii controller is aimed at in order to make the selection.
Acknowledgments The author would like to thank daytaOhio for providing access to the Appenzeller Visualization Laboratory and Barco’s CADWall as well as the visualization group at the University of Leipzig for continuously supplying updates to the FAnToM software package. This project was funded in part by Wright State University, the Ohio Board of Regents, and the Ohio Department of Development through the Early Lung Disease Detection Alliance.
References 1. Wischgoll, T., Choy, J., Ritman, E., Kassab, G.S.: Validation of image-based extraction method for morphometry of coronary arteries. Annals of Biomedical Engineering (to appear, 2008) 2. Sutherland, I.: A head-mounted three-dimensional display. In: Proc. the Fall Joint Computer Conference, pp. 757–764 (1968)
1038
T. Wischgoll
3. Fisher, S., McGreevy, M., Humphries, J., Robinett, W.: Virtual environment display system. In: Workshop on Interactive 3D Graphics, pp. 77–87 (1986) 4. Chung, J.C., Harris, M.R., Brooks Jr., F.P., Fuchs, H., Kelley, M.T., Hughes, J.W., Ouh-Young, M., Cheung, C., Holloway, R.L., Pique, M.: Exploring virtual worlds with headmounted displays. In: Proceedings SPIE Conference, Non-holographic True Three-Dimensional Display Technologies, pp. 42–52 (1989) 5. Rolland, J.P., Fuchs, H.: Optical versus video see-through head-mounted displays in medical visualization. Presence 9, 287–309 (2000) 6. Cliburn, D.C.: Virtual reality for small colleges. J. Comput. Small Coll. 19, 28–38 (2004) 7. Pape, D., Anstey, J.: Building an affordable projective, immersive display. In: SIGGRAPH 2002: ACM SIGGRAPH 2002 conference abstracts and applications, p. 55. ACM, New York (2002) 8. Czernuszenko, M., Pape, D., Sandin, D., DeFanti, T., Dawe, G.L., Brown, M.D.: The immersadesk and infinity wall projection-based virtual reality displays. SIGGRAPH Comput. Graph. 31, 46–49 (1997) 9. Drucker, S.M., Galyean, T.A., Zeltzer, D.: Cinema: a system for procedural camera movements. In: Proceedings of the 1992 symposium on Interactive 3D graphics, pp. 67–70. ACM Press, New York (1992) 10. Gleicher, M., Witkin, A.: Through-the-lens camera control. In: Computer Graphics (SIGGRAPH 1992 Proceedings), vol. 26, pp. 331–340 (1992) 11. Brooks Jr., F.: Walkthrough - A dynamic graphics system for simulating virtual buildings. In: Proceedings SIGGRAPH Workshop on Interactive 3D Graphics, pp. 9–21 (1986) 12. Ware, C., Osborne, S.: Exploration and virtual camera control in virtual three dimensional environments. Computer Graphics 24, 175–183 (1990) 13. Nain, D., Haker, S., Kikinis, R., Grimson, W.E.L.: An interactive virtual endoscopy tool. In: Proceedings of the IMIVA 2001 workshop of MICCAI, Utrecht(NL) (2001) 14. Galyean, T.A.: Guided navigation of virtual environments. In: Hanrahan, P., Winget, J. (eds.) Proceedings of the 1995 symposium on Interactive 3D graphics, pp. 103–104. ACM Press, New York (1995) 15. Hong, L., Muraki, S., Kaufman, A., Bartz, D., He, T.: Virtual voyage: Interactive navigation in the human colon. In: Proceedings of SIGGRAPH 1997, pp. 27–34 (1997) 16. Lorensen, W.E., Jolesz, F.A., Kikinis, R.: The exploration of cross-sectional data with a virtual endoscop. Interactive Technology and the New Health Paradigm, 221–230 (1995) 17. Wan, M., Kaufman, F.D.A.: Distance-field based skeletons for virtual navigation. In: Ertl, T., Joy, K., Varshney, A. (eds.) IEEE Visualization 2001, San Diego, CA, pp. 239–245. IEEE Computer Society Press, Los Alamitos (2001) 18. Vilanova Bartrol´ı, A., K¨ onig, A., Gr¨ oller, E.: VirEn: A virtual endoscopy system. Machine GRAPHICS & VISION 8, 469–487 (1999) 19. You, S., Hong, L., Wan, M., Junyaprasert, K., Kaufman, A., Muraki, S., Zhou, Y., Wax, M., Liang, Z.: Interactive volume rendering for virtual colonoscopy. In: Yagel, R., Hagen, H. (eds.) IEEE Visualization 1997, pp. 433–436. IEEE Computer Society Press, Los Alamitos (1997) 20. Bartz, D., Straßer, W., Skalej, M., Welte, D.: Interactive exploration of extraand interacranial blood vessels. In: Ebert, D., Gross, M., Hamann, B. (eds.) IEEE Visualization 1999, San Francisco, pp. 389–392. IEEE Computer Society Press, Los Alamitos (1999)
A Low-Cost, Linux-Based Virtual Environment
1039
21. Hahn, H.K., Preim, B., Selle, D., Peitgen, H.O.: Visualization and interaction techniques for the exploration of vascular structures. In: IEEE Visualization 2001, pp. 395–402. IEEE Computer Society Press, Los Alamitos (2001) 22. Abdoulaev, G., Cadeddu, S., Delussu, G., Donizelli, M., Formaggia, L., Giachetti, A., Gobbetti, E., Leone, A., Manzi, C., Pili, P., Scheinine, A., Tuveri, M., Varone, A., Veneziani, A., Zanetti, G., Zorcolo, A.: ViVa: The virtual vascular project. IEEE Transactions on Information Technology in Biomedicine 22, 268–274 (1998) 23. Sobel, J.S., Forsberg, A.S., Laidlaw, D.H., Zeleznik, R.C., Keefe, D.F., Pivkin, I., Karniadakis, G.E., Richardson, P.: Particle flurries: a case study of synoptic 3d pulsatile flow visualization. IEEE Computer Graphics and Applications 24, 76–85 (2004) 24. Bajaj, C., Goswami, S., Yu, Z., Zhang, Y., Bazilevs, Y., Hughes, T.: Patient specific heart models from high resolution ct. In: Proceedings of Computational Modelling of Objects Represented in Images, pp. 157–165 (2006) 25. Zhang, Y., Bazilev, Y., Goswami, S., Bajaj, C., Hughes, T.J.R.: Patient-specific vascular nurbs modeling for isogeometric analysis of blood flow. Computer Methods in Applied Mechanics and Engineering 196, 2943–2959 (2007) 26. Forsberg, A.S., Laidlaw, D.H., van Dam, A., Kirby, R.M., Karniadakis, G.E., Elion, J.L.: Immersive virtual reality for visualizing flow through an artery. In: Ertl, T., Hamann, B., Varshney, A. (eds.) IEEE Visualization 2000, Piscataway, NJ, pp. 457–460. IEEE Computer Society Press, Los Alamitos (2000) 27. Wischgoll, T., Moritz, E., Meyer, J.: Navigational aspects of an interactive 3d exploration system for cardiovascular structures. In: IASTED International Conference on Visualization, Imaging, and Image Processing (VIIP 2005), pp. 721–726 (2005) 28. Dang, N.T., Tavanti, M., Rankin, I., Cooper, M.: A comparison of different input devices for a 3d environment. In: ECCE 2007: Proceedings of the 14th European conference on Cognitive ergonomics, pp. 153–160. ACM, New York (2007) 29. Klochek, C., MacKenzie, I.S.: Performance measures of game controllers in a threedimensional environment. In: GI 2006: Proceedings of Graphics Interface 2006, pp. 73–79. Canadian Information Processing Society, Canada (2006) 30. Wilson, A.D., Agrawala, M.: Text entry using a dual joystick game controller. In: CHI 2006: Proceedings of the SIGCHI conference on Human Factors in computing systems, pp. 475–478. ACM, New York (2006)
Visualization of Dynamic Connectivity in High Electrode-Density EEG Alfonso Alba and Edgar Arce-Santana Facultad de Ciencias, Universidad Aut´ onoma de San Luis Potos´ı, Diagonal Sur S/N, Zona Universitaria, 78240, San Luis Potos´ı, SLP, M´exico Tel.: +52 444 8262486 x 2906
[email protected],
[email protected]
Abstract. A visualization methodology for the analysis of dynamic synchronization in electroencephalographic signals is presented here. The proposed method is based on a seeded region-growing segmentation of the time-frequency space in terms of spatial connectivity patterns, a process that can be fully automated by cleverly choosing the seeds. A Bayesian regularization technique is applied to further improve the results. Finally, preliminary results from the analysis of a high electrode-density dataset with 120 channels are shown.
1
Introduction
Brain electroencephalography (EEG) consists of voltage measurements obtained from one or more electrodes placed on the scalp, or directly inside the cortex. These measurements represent the average electrical activity of the underlying neural networks. During the execution of relatively complex tasks, such as adding two numbers or making a decision, specialized and possibly distant areas of the brain interact together forming neural assemblies [1]. One of the most plausible mechanisms for this integration is the dynamical formation of reciprocal links between networks of neurons, which may be observed as phase synchronization between the EEG signals of the corresponding electrodes over different frequency bands [2]. In particular, it has been observed that the phase difference between the corresponding signals approaches zero during episodes of high synchronization [3] [4]. These observations are further supported by analytic models of the EEG signals [5] [6], where the phase difference between two reciprocally coupled areas is always zero (in-phase) or π (anti-phase), regardless of the inter-areal distance. Phase synchronization is typically measured between pairs of narrow-band signals, which can be obtained from a time-frequency decomposition of the raw EEG signals. The data obtained from this type of analysis is multi-dimensional and usually large in size, particularly in the case of high electrode-density EEG (64 or more electrodes); for example, synchrony data from a typical high density EEG recording with 128 channels (1282 electrode-pairs) and 1024 samples (per channel), analyzed over 10 frequency bands, would yield around 167 million G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1040–1050, 2008. c Springer-Verlag Berlin Heidelberg 2008
Visualization of Dynamic Connectivity in High Electrode-Density EEG
1041
synchrony values. This represents a serious visualization problem which most authors avoid by averaging across a large time window [7] [8], and/or by limiting the analysis to specific frequency bands or electrode pairs [9] [4]; however, our experience is that psychophysiologically relevant synchrony patterns which are spatially complex may arise at different frequency bands from 1 to 50 Hz, and have a duration between 100 ms and over 1 second. Recent efforts have been made to obtain better visual representations of EEG synchrony dynamics, including time-frequency-topography (TFT) displays [6], and electrode-grouping techniques [10]. In a TFT display, one divides the time-frequency (TF) space in cells and plots a head diagram at each cell showing the spatial distribution of a given EEG measure for the corresponding TF window. Since only one spatial dimension can be represented in a TFT map, one can only display, for example, the average degree of synchronization between one electrode and the others. Under the hypothesis that a particular spatial pattern of synchronizationdesynchronization that occurs in a particular TF window corresponds to a specific neural process, we have recently developed a novel visualization method for EEG synchrony dynamics based on the segmentation of the time-frequency plane in regions where the spatial synchronization pattern is relatively constant [11]. In this paper, we present a fully-automated version of our methodology, and evaluate its performance with high electrode-density data from a real EEG experiment. This paper is organized as follows: in Section 2, we explain the methodology used for the estimation of significant changes in EEG phase-synchronization. Section 3 presents the segmentation technique used to find and classify synchronization patterns. Finally, our results and conclusions are presented, respectively, in Sections 4 and 5.
2
Estimation of Significant Changes in EEG Synchrony
In many psychophysiological EEG experiments, a stimulus is presented to one or more subjects, who are instructed to respond by performing a specific task. Neuroscientists are often interested in how some EEG measure, such as power or synchronization, changes with respect to a certain value, called the baseline, which is obtained during a neutral state (e.g., before a stimulus is presented, or before treatment). It is also common to have the subjects perform several trials of the experiment, in order to increase the statistical robustness of the results. The raw EEG signals can be seen as a set of time-series Ve,j (t), where t denotes (discrete) time, e indicates the electrode position, and j is the trial number. Throughout this paper we will use Nt to denote the number of samples in each EEG signal, Ne for the number of electrodes, and Nr for the number of trials. Also, Ts will represent the time (in samples) at which the stimulus is presented, and fs will be the EEG sample rate. For each electrode e we also know their projection (xe , ye ) to a 2D unit sphere representing the head surface.
1042
2.1
A. Alba and E. Arce-Santana
Time-Frequency Decomposition
One common way to extract phase information from the EEG signals is by running them through a bank of narrow bandpass filters tuned at the frequencies of interest. In particular, we have obtained good results with a bank of sinusoidal quadrature filters (SQF’s) [12], whose frequency response is given by: ⎧ f −fk 1 ⎪ ⎪ ⎨ 2 1 + cos hk π if f ∈ [fk − hk , fk ], Gfk ,h (f ) = 1 1 + cos f −fk π (1) if f ∈ [fk , fk + h], 2 h ⎪ ⎪ ⎩ 0 otherwise, where fk is the tuning frequency for the k-th filter, h is the bandwidth, and hk = min{h, fk }. The filter kernels gfk ,h are computed as the fast inverse Fourier transform of Gfk ,h ; therefore, one can compute the complex filtered signal Ff,e,j as the convolution of Ve,j and gf,h , from which the phase φf,e,j can be directly obtained. One may assume the phase to be between −π and π, as obtained with the atan2 function in C or Matlab. For our tests, we have chosen a bank of filters tuned at each Hz from 1 to 20 Hz, each filter with a bandwidth of approximately 1.76 Hz within 3 db of attenuation. 2.2
Estimation and Classification of Phase Synchrony
EEG synchronization is typically measured as some form of correlation between pairs of electrode signals. Among the most popular measures used in the literature are the statistical coherence [13], and measures based on the circular variance of the phase difference [9] [14]. In particular, we use a measure based on the average phase difference, which favors in-phase synchronization (see [6] for a thorough comparison of these measures). Specifically, we compute the interelectrode synchronization μt,f,e1 ,e2 for each time t, frequency f , and electrode pair (e1 , e2 ) as μt,f,e1 ,e2 (t) = 1 −
Nr 1
|wrap [φf,e1 ,j (t) − φf,e2 ,j (t)]| , πNr j=1
(2)
where wrap[x] wraps its argument to the interval [−π, π). This measure is called the mean phase difference (MPD). 2.3
Classification of Significant Activity
The MPD measure yields values between zero (no synchronization) and 1 (perfect synchronization); however, the actual differences between values at episodes of high synchrony and episodes of low synchrony are often very subtle, and thus require a statistical analysis to determine the true significance of the observations. One way to do this, is to compute the distribution of the measure under the neutral condition, which we call the null distribution, and use it to obtain a significance index for each value the measure takes. We are interested in detecting
Visualization of Dynamic Connectivity in High Electrode-Density EEG
1043
significant changes in phase-synchronization with respect to the baseline, which in our case is obtained as the average synchronization during the pre-stimulus segment; therefore, we first subtract the baseline from the data as follows: Yt,f,e1 ,e2 = μt,f,e1 ,e2 −
Ts 1
μs,f,e1 ,e2 . Ts s=1
(3)
Positive Y -values indicate an increase in synchronization, whereas negative values represent a decrease, with respect to the pre-stimulus average. To test for significance, one can estimate, for each frequency and electrode pair, the null distribution pY of Y -values in the pre-stimulus. To do this, we approximate pY directly from the data using kernel density estimation [15], where the distribution can be estimated as the normalized sum of kernel functions kh centered at each data point: Ts 1
pY (y) = kh (y − Yt,f,e1 ,e2 ) , (4) Z t=1 where Z is a normalization constant chosen so that pY integrates to 1, and h is a parameter which specifies the width of the kernel functions, and determines the smoothness of pY . In particular, we use a Gaussian kernel with a width given by −1/5 Silverman’s rule of thumb: h = 1.06σTs , where σ is the standard deviation of the sample data. The significance index St,f,e1 ,e2 is then estimated as ⎧ PY (Yt,f,e1 ,e2 )−PY (0) ⎪ ⎪ if Yt,f,e1 ,e2 > 0, ⎨ PY (0) St,f,e1 ,e2 = (5) ⎪ ⎪ ⎩ − PY (0)−PY (Yt,f,e1 ,e2 ) if Y t,f,e1 ,e2 < 0, 1−PY (0) where PY is the cumulative null distribution of the Y -values, i.e., PY (y) = y p (z)dz. Deviations from the baseline are considered significant when the Y −∞ magnitude of the significance value exceeds a given threshold α (we use α = 0.99 in all our tests). One can thus compute a discrete class label field c, given by ⎧ ⎨ 1 if St,f,e1 ,e2 > α, ct,f,e1 ,e2 = −1 if St,f,e1 ,e2 < −α, (6) ⎩ 0 otherwise, which indicates if phase synchronization is significantly higher (c = 1), lower (c = −1), or equal (c = 0) than the pre-stimulus average.
3
Classification and Segmentation of Synchrony Patterns
We define a synchronization pattern (SP) as an Ne × Ne matrix with values in {−1, 0, 1} which indicate, for each electrode pair, if synchronization between both electrodes deviates significantly from the baseline. The label field c can thus be seen as a 2D image in time-frequency (TF) space where each pixel ct,f
1044
A. Alba and E. Arce-Santana
specifies the SP observed at time t and frequency f . To perform a segmentation of this image in regions where the SP is relatively constant, we propose a fast seeded region-growing algorithm coupled with a slower Bayesian regularization technique. We also include an automated seed selection algorithm that favors regions with high degree of homogeneity. For the rest of this section, we conveniently reorganize the label field c as a vector-valued image ct,f ∈ {−1, 0, 1}Ns , where Ns is the number of electrode pairs. For a symmetric synchrony measure (such as the MPD), where μt,f,e1 ,e2 = μt,f,e2 ,e1 , one can reduce the number of non-redundant electrode pairs to Ns = Ne (Ne − 1)/2 by considering only those pairs where e1 < e2 . 3.1
Seeded Region-Growing Algorithm
For the region-growing algorithm, a representative synchrony pattern (RSP) is computed for each region. The algorithm takes a pixel from the border of some region, and compares its neighbors to the region’s RSP (according to a suitable distance function). Any neighbors that are sufficiently similar to the RSP are included in the region, and the RSP is recomputed. This process is iterated until no region can be further expanded. An adequate distance function for our data is given by d(p, q) =
Ns 1
(1 − δ(ps − qs )) ws , Ns s=1
(7)
where p, q ∈ {−1, 0, 1}Ns are two SP’s, δ is the Kronecker delta function, and ws is the weight assigned to the s-th electrode pair. In particular, we use the reciprocal of the inter-electrode distance as weight. We also define the average ˆ f ), as neighbor distance (AVD) d(t,
1 ˆ f) = d(t, d (Ct,f , Ct ,f ) , (8) |N (t, f )| (t ,f )∈N (t,f )
where N (t, f ) is the set of points neighboring (t, f ), which, in our case, contains only the nearest neighbors of (t, f ). The AVD roughly measures the local homogeneity of the SP’s observed around some point (t, f ). The region-growing algorithm computes a region label field lt,f , which indicates the region to which each point (t, f ) belongs. The value -1 represents an unlabeled pixel; therefore, we first initialize lt,f = −1 for all t, f . The algorithm then finds a suitable seed (i.e., a pixel within an homogeneous region) and grows it as described above. Seed selection is performed by choosing the pixel with the lowest AVD (highest local homogeneity) among a set of candidates, which consist of all unlabeled pixels whose neighbors are also unlabeled. Specifically, to find and grow a new seed into a region, the algorithm performs the following steps: 1. Let S be the set of all unlabeled points (t, f ) ∈ L whose neighbors are also
∗ ∗ ˆ unlabeled, and let (t , f ) = arg min(t,f )∈S d(t, f ) be the new seed.
Visualization of Dynamic Connectivity in High Electrode-Density EEG
1045
2. Assign a new label k (i.e., such that, at this point, lt,f = k for all t, f ) to the new seed. Let lt∗ ,f ∗ = k. 3. Initialize the RSP rk of region k with the SP corresponding to the seed point. In other words, let rk = Ct∗ ,f ∗ . 4. Initialize a priority queue Q and insert the seed (t∗ , f ∗ ) in Q with a priority ˆ ∗ , f ∗ ). given by −d(t 5. While Q is not empty, do the following: (a) Pull the highest-priority point (t, f ) from Q. (b) For each (t , f ) ∈ N (t, f ) such that lt,f = −1 and d(rk , Ct ,f ) < , for a given threshold , let lt ,f = k, and insert (t , f ) in Q with a priority ˆ , f ). given by −d(t (c) If the region label field l has changed, re-compute the RSP rk as the itemby-item mode of all SP’s observed within the region; in other words, rk,s = mode(t,f )
: lt,f =k
{Ct,f,s } ,
(9)
for s = 1, . . . , Ns . 3.2
Bayesian Regularization
Some of the regions obtained by region-growing may show holes (i.e., unlabeled points inside the region) or rough edges. This can be corrected by performing a regularization stage after all seeds have been grown. A technique that produces very good results is the Bayesian classification with a prior Markov random field (MRF) model [16] for the label field l. Under this model, the probability of observing l given the data c is p(l | c) =
1 exp [−U (l)] , Z
(10)
where Z is a constant that does not depend on l, and U (l) is the energy function, which in our case is given by U (l) = −
1
Λt,f (lt,f ) + λt V (lt,f , lt+1,f ) + λf V (lt,f , lt,f +1 ), Ns t,f
t,f
(11)
t,f
where λt and λf are the time and frequency granularity parameters, respectively, V is the Ising potential function, which is given by V (lt,f , lt ,f ) = 1 − 2δ(lt,f − lt ,f ) where (t, f ) and (t , f ) are two adjacent sites, and Λ is a pseudo-likelihood function defined as log pL (Ct,f | lt,f = k), for k = −1, Λt,f (k) = , (12) 1 =−1 {Lt,f (k)} , for k = −1 k =−1 Lt,f (k) − maxk Nk s with log pL (Ct,f | lt,f = k) = N s=1 log pk,s (Ct,f,s ). Here, pk,s (q) is the probability of observing class q for the electrode pair s over region k. Given an approximate initial segmentation (which can be obtained from the region-growing
1046
A. Alba and E. Arce-Santana
algorithm), one can estimate these probabilities simply by counting, for each region k and electrode pair s, the number of occurrences of class q. Regularization of the label field l is achieved by minimizing U (l). We do this by computing the Maximizer of Posterior Marginals (MPM) estimator using the Gibbs sampling algorithm [16] with the segmentation obtained from the regiongrowing stage as starting point. 3.3
Visualization
The results of our methodology are displayed using two graphics: one showing the regions in the TF plane obtained from the segmentation, and the other showing the RSP’s corresponding to those regions. The RSP’s are plotted using multitopographic displays [6], which consist of a large head diagram where, at each electrode site, a smaller head diagram is displayed with the spatial distribution of the synchronization between the corresponding electrode and every other site (i.e., the columns of the RSP in its matrix form). However, in the case of high density EEG, plotting a head diagram for each electrode would lead to illegible results. One solution consists in grouping the electrodes in fewer cortical areas, which can be obtained, for example, from a Voronoi partition whose centers are the 19 standard sites of to the 10-20 placement system (see below). A representative pattern is thus computed for each area as the average or item-by-item mode of the data vectors corresponding to the electrodes within the area.
4
Results
For illustrative purposes, we have applied the methodology described above to EEG data from a Go/NoGo task [17] designed to study the inhibition of the motor response. During this task, uppercase letters are shown on a screen, one at a time. The subject is instructed to respond by pressing a button only if an X that has been preceded by an O appears. This is the Go condition. Any letter different than X which has been preceded by an O accounts for the NoGo condition, as it originates the inhibition of the motor response. EEG was sampled with 120 electrodes each 5 ms. Ten subjects participated in the experiment, resulting in Nr = 400 trials for the Go condition and Nr = 356 for NoGo. The duration of each trial is 2560 ms with the stimulus onset at 1000 ms. Therefore, the parameters for this experiment are Nt = 512, Ts = 200, fs = 200 Hz, and Ne = 120. All our tests were made in a 2.4 GHz Intel dual core workstation. The class label field c corresponding to each condition was pre-computed and stored on disk, a process which took various hours due to the amount of data being processed. Figure 2 presents a TFT synchrony increase histogram Ht,f,e = Ne e =1 δ(ct,f,e,e − 1). The segmentation and regularization algorithms were applied to the c fields using different homogeneously-spaced subsets of the 120 electrodes in order to evaluate the performance of our method with respect to the electrode density. The subsets were obtained using the following procedure:
Visualization of Dynamic Connectivity in High Electrode-Density EEG
1047
1. Let M = {1, . . . , Ne } be the set of indices corresponding to all the electrodes used in the EEG recording. 2. Start with the subset M1 containing the indices of those electrodes corresponding to the standard 10-20 placement system (19 electrodes). 3. Start with i = 1 and iterate while M − Mi = ∅: (a) For each e ∈ M find its nearest neighbor in Mi ; that is, g(e ) = argmine∈Mi (xe − xe )2 + (ye − ye )2 . (b) For each e ∈ Mi , find the farthest electrode m(e) such that g(m(e)) = e.In other words, let m(e) = argmaxe ∈M : g(e )=e (xe − xe )2 + (ye − ye )2 . (c) Let Mi+1 = Mi ∪ {m(e) : e ∈ Mi } be the next (larger) subset, and increase i by 1. We obtained five different electrode montages of sizes 19, 38, 67, 96, and 120, all of which are shown in Figure 1. We then proceeded to perform the segmentation and regularization for both conditions, using only the electrode data from each subset. The segmentation parameters are = 0.3, λt = 2, and λf = 0.7. The number of seeds/regions was 12, one of them used for labeling the prestimulus segment. For the regularization stage, good results were obtained with 500 Gibbs sampler iterations (every pixel (t, f ) is visited once in each iteration). Results for the Go condition (post-stimulus only), using 19 and 120 electrodes, are shown in Figures 2b and 2c, respectively. All RSP’s were computed using the 120 electrodes. Note that the regions obtained from the 120-electrode segmentation coincide, to great extent, with homogeneous regions in the TFT map. One can observe that relatively similar results are obtained with different electrode subsets, particularly in the case of larger regions. Some smaller regions, however, differ considerably between the 19-electrode and the 120-electrode segmentation: in some cases, for example, two regions with similar RSP’s in the 19-electrode segmentation are combined into a single region when using 120 electrodes. Similar observations can be made from the results corresponding to the NoGo condition. Segmentation times for all cases are presented in Table 1. Note that the segmentation time increases more or less quadratically with the number of electrodes, whereas the regularization time barely increases. In fact, the regularization time does not depend on the number of electrodes, except during the initialization, when the log-likelihoods are pre-computed. One could therefore
Fig. 1. Subsets of a 120-electrode montage: M1 (19 electrodes), M2 (38 electrodes), M3 (67 electrodes), M4 (96 electrodes), and M5 (120 electrodes)
1048
A. Alba and E. Arce-Santana
Fig. 2. (a) TFT synchrony increase histogram for the Go condition; the color scale indicates, for each electrode, the number of sites whose synchronization with the electrode increases significatively, (b) 19-electrode segmentation, (c) 120-electrode segmentation. The segmented TF maps are shown in the left side, while the RSP’s are shown in the right side using a multi-toposcopic display where red (dark gray) areas represent synchrony increases, whereas green (light-gray) areas indicate synchrony decreases.
Visualization of Dynamic Connectivity in High Electrode-Density EEG
1049
Table 1. Segmentation and regularization times for both conditions and different electrode montages. All times in seconds. Montage M1 M2 M3 M4 M5
Ne Seg (Go) Reg (Go) Seg (NoGo) Reg (NoGo) 19 2.418 14.525 2.563 15.474 38 7.905 15.426 16.212 16.371 67 30.813 17.872 58.071 19.353 96 75.355 21.686 111.675 23.019 120 117.115 25.893 173.730 27.793
obtain a very quick, rough segmentation for preview purposes and parameter adjusting by using fewer electrodes and possibly no regularization; once the optimal parameters are found, one can use the full electrode set and Bayesian regularization to obtain a high-quality segmentation.
5
Conclusions
In this paper, we discuss an efficient visualization system for the exploratory analysis of EEG synchronization patterns, and evaluate its performance with high electrode-density EEG data. This visualization is performed by segmenting the time-frequency space in regions with relatively homogeneous synchrony patterns, using a seeded region-growing algorithm with automatic seed selection. Bayesian regularization is also applied to improve the quality of the results. The rationale of this methodology is that a constant SP, which is observed over a relatively large time-frequency window, may be associated to a task-related neural process. Preliminary results with high electrode-density data show that segmentation time increases linearly with the number of lead pairs under analysis, whereas regularization time is nearly independent of the electrode density (it is mostly dependent on the number of regions). Very good results with 120 channels can be obtained in only a few minutes with a current off-the-shelf computer, whereas rough results for preview purposes are obtained in only a few seconds by limiting the analysis to fewer electrodes and reducing the number of iterations in the regularization stage. Acknowledgements. For this work, A. Alba was supported by grants PROMEP/ 103.5/07/2416 and C07-FAI-04-19.21. E. Arce was supported by grants PROMEP/103.5/04/1387 and C06-FAI-11-31.68. The authors would like to thank Dr. Thalia Harmony for providing the EEG data sets used in this work, and Dr. Jose L. Marroquin for his insight.
References 1. David, O., Cosmelli, D., Lachaux, J.P., Baillet, S., Garnero, L., Martinerie, J.: A theoretical and experimental introduction to the non-invasive study of large-scale neural phase synchronization in human beings. International Journal of Computational Cognition 1, 53–77 (2003)
1050
A. Alba and E. Arce-Santana
2. Varela, F.J., Lachaux, J.P., Rodriguez, E., Martinerie, J.: The brainweb: Phase synchronization and large-scale integration. Nature Reviews, Neuroscience 2, 229– 239 (2001) 3. Friston, K.J., Stephan, K.M., Frackowiak, R.S.J.: Transient phase-locking and dynamic correlations: are they the same thing? Human Brain Mapping 5, 48–57 (1997) 4. Rodriguez, E., George, N., Lachaux, J.P., Martinerie, J., Renault, B., Varela, F.J.: Perception’s shadow: long-distance synchronization of human brain activity. Nature 397, 430–433 (1999) 5. David, O., Friston, K.J.: A neural mass model for MEG/EEG coupling and neuronal dynamics. Neuroimage 20, 1743–1755 (2003) 6. Alba, A., Marroquin, J.L., Pe˜ na, J., Harmony, T., Gonzalez-Frankenberger, B.: Exploration of event-induced EEG phase synchronization patterns in cognitive tasks using a time-frequency-topography visualization system. Journal of Neuroscience Methods 161, 166–182 (2007) 7. David, O., Cosmelli, D., Friston, K.J.: Evaluation of different measures of functional connectivity using a neural mass model. Neuroimage 21, 659–673 (2004) 8. Mizuhara, H., Wang, L., Kobayashi, K., Yamaguchi, Y.: Long-range EEG phase synchronization during an arithmetic task indexes a coherent cortical network simultaneously measured by fMRI. Neuroimage 27, 553–563 (2005) 9. Lachaux, J.P., Rodriguez, E., Martinerie, J., Varela, F.J.: Measuring phase synchrony in brain signals. Human Brain Mapping 8, 194–208 (1999) 10. ten Caat, M., Maurits, N.M., Roerdink, J.B.T.M.: Data-Driven Visualization and Group Analysis of Multichannel EEG Coherence with Functional Units. IEEE Transactiona on Visualization and Computer Graphics 14, 756–771 (2008) 11. Alba, A., Arce, E.: Interactive segmentation of EEG synchrony data in timefrequency space by means of region-growing and bayesian regularization. In: Electronics, Robotics and Automotive Mechanics Conference (CERMA 2007), pp. 235– 240. IEEE Computer Society, Los Alamitos (2007) 12. Guerrero, J.A., Marroquin, J.L., Rivera, M., Quiroga, J.A.: Adaptive monogenic filtering and normalization of espi fringe patterns. Opt. Lett. 30, 3018–3020 (2005) 13. Nunez, P.L., Srinivasan, R., Westdorp, A.F., Wijesinghe, R.S., Tucker, D.M., Silberstein, R.B., Cadusch, P.J.: EEG coherency I: statistics, reference electrode, volume conduction, Laplacians, cortical imaging, and interpretation at multiple scales. Electroencephalography and Clinical Neurophysiology 103, 499–515 (1997) 14. Lachaux, J.P., Rodriguez, E., Le Van Quyen, M., Martinerie, J., Varela, F.J.: Studying single-trials of phase-synchronous activity in the brain. Int. J. Bifur. Chaos 10, 2429–2439 (2000) 15. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, London (1986) 16. Marroquin, J.L., Mitter, S., Poggio, T.: Probabilistic solution of ill-posed problems in computational vision. J. Am. Stat. Assoc. 82, 76–89 (1987) 17. Harmony, T., Alba, A., Marroquin, J.L., Gonzalez-Frankenberger, B.: TimeFrequency-Topographic Analysis of Induced Power and Synchrony of EEG Signals During a GO/NO-GO Task. Int. Journal of Psychophysiology (in press, 2008)
Generation of Unit-Width Curve Skeletons Based on Valence Driven Spatial Median (VDSM) Tao Wang and Irene Cheng Computing Science Department, University of Alberta, Alberta, Canada {taowang,lin}@cs.ualberta.ca
Abstract. 3D medial axis (skeleton) extracted by a skeletonization algorithm is a compact representation of a 3D model. Among all connectivity-preservation skeletonization methods, 3D thinning algorithms are generally faster than the others. However, most 3D thinning algorithms cannot guarantee generating a unit-width curve skeleton, which is desirable in many applications, e.g. 3D object similarity match and retrieval. This paper presents a novel valence driven spatial median (VDSM) algorithm, which eliminates crowded regions and ensures that the output skeleton is unit-width. The proposed technique can be used to refine skeletons generated from 3D skeletonization algorithms to achieve unit-width. We tested the VDSM algorithm on 3D models with very different topologies. Experimental results demonstrate the feasibility of our approach.
1 Introduction Skeletonization has been studied in many areas of research, including medical image processing and visualization, computer-aided design and engineering; and virtual reality [2-3]. The goal is to extract a compact representation (skeleton), which can be further processed and analyzed more efficiently than the original 3D model. Skeletonization is also known as skeletonizing or topological skeleton generation. A 2D skeleton can be extracted using medial axis transformation (MAT) [1]. MAT is achieved by “prairie fire” propagation, which considers a 2D shape as a prairie of uniform and dry grass. When a fire is lit along its border, all fire fronts advance into the prairie at the same speed. The skeleton (medial axis) of the shape is the set of points reached by more than one fire fronts simultaneously. In 3D space, a skeleton can be traced along the loci of the centers of all inscribed maximal spheres in a 3D model. A 3D skeleton can be composed of a set of primitive shapes, such as cones and sticks, a set of 2D surfaces (surface skeleton) or 1D line segments (curve skeleton) (Fig. 1). While the first category is useful for animations, surface and curve skeletons are more commonly adopted in applications, including animation, virtual navigation, virtual endoscope, surgical planning, estimation of volume change and mesh segmentation. In this paper, we are interested in applications which require 3D object similarity match and retrieval, as well as registration. Therefore we focus on the extraction of 3D curve skeletons. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1051–1060, 2008. © Springer-Verlag Berlin Heidelberg 2008
1052
T. Wang and I. Cheng
Fig. 1. (Left) A 3D skeleton composed of primitive shapes. (Middle) A 3D box with its skeleton composed of 3D surfaces and 1D segments and (Right) the same 3D box with its curve skeleton. (Middle & right images courtesy of Cornea [2]).
There are four types of 3D skeletonization techniques discussed in the literature: thinning based algorithms [3-5, 15], general field based algorithms [6-8], Voronoi diagram based algorithms [9-11], and shock graph based algorithms [12-14]. In 3D skeleton based similarity matching type of operations, connectivity-preservation and time efficiency, especially for real time applications, are important criteria for the underlying skeletonization algorithm. In general, thinning based algorithms and Voronoi diagram based algorithms (with very dense boundary points) can preserve connectivity of the 3D models. Among the skeletonization methods, thinning based algorithms and general field based algorithms are faster than the others. For this reason, we focus on thinning based algorithm in this paper because it can preserve connectivity and it is time efficient. 1.1 Thinning Based Algorithms A 3D thinning algorithm is applied in a local neighborhood of an object point. The operation iteratively removes object points that satisfy some pre-defined deletion criteria to generate a skeleton from a 3D binary image. A 3D binary image is created by assigning a 0 or 1 value to each point in a 3D object space. Points having the value 1 are called object (black) points, while points having the 0 value are called background (white) points. Generating a 3D binary image from voxel volume data is straightforward. When generating from a 3D mesh, voxelization has to be performed as a pre-processing step [22-23]. The purpose of voxelization is to fill the void inside a 3D mesh. A thinning operation iteratively deletes object points (i.e., changes them from black to white) using some deleting masks until certain criteria are satisfied. Fig. 2 illustrates the basic idea of a 3D thinning algorithm [15]. Note that in Fig. 2 (Middle), “ • ”denotes an object point and “ o ”denotes a background point. Although 3D thinning algorithms can preserve connectivity and are time efficient compared with other skeletonization techniques, many of them [3-5, 15] fail to generate unit-width curve skeletons (i.e., the skeleton is not one-voxel thick). For instance, in Fig. 2 (Right), there are some regions in the skeleton crowded with dense object points. This is not desirable in applications [2], which require a unique joint between segments, or require a unique joint node to build a skeleton graph or topology tree.
Generation of Unit-Width Curve Skeletons Based on VDSM
Æ
1053
Æ
Fig. 2. (Left) A 3D model, (Middle) an example of a deleting mask and (Right) the 3D model with its curve skeleton (thick line)
In this paper, we present a technique to generate unit-width curve skeletons. This technique follows four steps. First, the degree of each object point is computed. Second, the crowded regions are identified and the “exits” of each region are located. Third, the center of each crowded region is computed based on our proposed valence driven spatial median (VDSM) algorithm. In the last step, we apply Dijkstra’s shortest path algorithm [17-18] to connect the exits with the center in each crowded region, and remove other crowded object points that are not on any shortest path. The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 proposes our algorithm for extracting unit-width curve skeletons. Experimental results are presented in Section 4, before the work is concluded in Section 5.
2 Related Work Cornea proposed a potential field based algorithm to generate curve skeletons [2]. The idea is to extract some critical points in a force field to generate the skeleton. This algorithm has three steps. First is to compute the vector field on a 3D model. Second is to locate the critical points in the vector field, and finally the algorithm extracts the curve skeleton following a force directed approach. However, connectivity of the critical points is not guaranteed (Fig. 3).
Fig. 3. 3D models with disconnected skeletons (images courtesy of Cornea [2])
Voronoi diagram based algorithms assign the surface points on a 3D object as generating points and apply an incremental construction method to approximate the skeleton using Voronoi diagram computation [9-10, 25]. When the surface points are sufficiently dense, this technique can generate a satisfactory skeleton. However, it
1054
T. Wang and I. Cheng
may fail when the surface points are sparse. Moreover, computing the Voronoi diagram is time-consuming. Brunner et al. [16] use an iterative algorithm to merge junction knots, which generate the minimal cost, to create unit-width curve skeletons (Fig. 4). However, this algorithm only works in non-equilateral 3D grid but not in 3D binary images (equilateral 3D grid) or 3D mesh (non-grid). Also, this algorithm is expensive because it needs to compute the different merging options. For example, if there are E junction knots and V edges between them, the complexity of this algorithm is O ( E × V 3 ) .
Fig. 4. The left graph shows the junction knots in the curve skeleton. In the right graph, these junction knots are merged to a single junction knot to create a unit-width curve skeleton [16].
Sundar et al. [19] use a clustering method to reduce the number of points on a curve skeleton. A representative point is selected to replace a cluster of points that are within a distance of Dthreshold. This algorithm has two drawbacks. First, different clusters need different Dthreshold values. How to determine different Dthreshold values for different clusters remains unresolved. Second, a cluster is not fully connected in most cases. Therefore, this algorithm may disconnect the skeleton by choosing a point that is not connected with other points. Wang and Lee presented a curve skeleton extraction algorithm [28]. Their technique consists of three steps. First, it uses iterative least square optimization to shrink and simplify a 3D model. Then, it extracts the curve skeleton through a thinning algorithm. In the last step, a pruning approach is used to remove unnecessary branches based on shrinking ratios. However, this method requires many free parameters. In addition, it often generates skeletons which deviate from the center of the model. Svenssona et al. [26] propose an algorithm to extract unit-width curve skeletons from skeletons generated with 3D thinning. However, their method requires a really thin skeleton (i.e., at most two-voxel thick) as input. Many 3D thinning techniques [3-5] can be used to generate curve skeletons. However, the skeletons generated are not guaranteed to be unit-width. In the next section, we will introduce our VDSM algorithm, which can generate a unit-width curve skeleton without requiring the input of a close to unit width skeleton and without the need to compute a distance threshold for each crowded region.
3 The Proposed Algorithm In this section, we describe our algorithm for extracting unit-width curve skeletons. This algorithm has five main characteristics. 1) It works on 3D binary images, and 3D meshes by performing voxelization [22-23] as a pre-processing step. 2) It preserves connectivity. 3) It does not need control parameters, e.g. thresholds. 4) The input skeleton can be more than two voxels thick. 5) It is time efficient.
Generation of Unit-Width Curve Skeletons Based on VDSM
1055
3.1 Definitions Given a non unit-width curve skeleton extracted from a 3D binary image
⎧1 if ( x, y, z ) is an object point B(x, y, z) = { δ ( x, y, z ) }, where δ ( x, y, z ) = ⎨ , ⎩0 otherwise we define a unit-width curve skeleton using the following definitions. Definition 1: The 26-neighborhood of an object point p defines a 3x3x3 grid centered at p (Fig. 2 Middle). Definition 2: Let q be another object point. The Euclidean distance between p and q is defined as dis =| p − q | . Consider each edge in the cube be 1 unit. p and q are 26connected if dis ≤ 3 . Definition 3: The degree of an object point p is denoted by D( p ) , and is defined as the number of object points in p’s 26-neighborhood, excluding p itself. Definition 4: An object point p is called an end point if D ( p) = 1 . Definition 5: An object point p is called a middle point of p1 and p2 if
D ( p) = 2 (p1 and p2 are not 26-connected). Definition 6: An object point p is called a joint point of p1, p2, …, pn if
1) D( p) = n (n > 2, and p1 , p 2 , ..., p n are 26-connected with p), and 2) p1, p2 , …, pn are either end points or middle points. Definition 7: An object point p is called a crowded point if it is neither an end point, a middle point, nor a joint point. Definition 8: A crowded region is a set of 26-connected crowded points. Definition 9: An object point is called an exit if it is an end point or a middle point, and is 26-connected to one object point of a crowded region. Definition 10: A skeleton is called a unit-width curve skeleton if it has no crowded regions. 3.2 Valence Computation
The VDSM algorithm first computes the degree (valence) of each object point in the curve skeleton by counting the object points in the 26-connected neighborhood. 3.3 Crowded Regions and Exits
The next step is to mark the end points, middle points, joint points, and crowded points. Adjacent crowded points are consolidated into crowded regions. In each crowded region, exits are located and marked.
1056
T. Wang and I. Cheng
3.4 Valence Driven Spatial Median (VDSM) Algorithm
In this step, the center of each crowded region is computed. Once the center is obtained, we can connect it with all the exits in that region, and remove the object points that are not on the path between an exit and the center. For a given crowded region R with n object points {P1, P2, …, Pn}, the center can be computed as the arithmetic mean of the n points. However, the center defined by arithmetic mean can be outside a given region if the n object points form concave boundaries (Fig. 5 (Left)). Masse et al. discussed other location estimators, e.g. Tukey median, Liu median, Oja median, depth-based trimmed mean, coordinate median, and spatial median [27], and found that spatial median gave the best estimation. The spatial median is given by: n
arg min ( p
∑ | p − P | / n) i
(1)
i =1
where | p − Pi | is the Euclidean distance. In this case, only the n object points inside a crowded region are candidates. However, spatial median may not be able to locate the center for some shapes (Fig. 5 (Middle)). We propose a Valence Driven Spatial Median (VDSM) algorithm to compute the center of a crowded region. The center is given by: n
arg min ( p
∑ D ( p ) | p − P | / n) 1
i
(2)
i =1
where D(p) is the degree of the point p.
1 assigns penalties to boundary points D( p )
which have smaller degrees than inside points. As a result, the derived center is attracted towards the middle of the region (Fig. 5 (Right)). In our implementation, we modify Equation (2) to Equation (3) to eliminate the computational cost in performing the square root and division operations. n
arg min ( p
∑D i =1
1
| p − Pi |2 )
(3)
p
Since n is a constant, removing n does not affect the value for which the expression attains its minimum. Fig. 5 (Right) shows the center obtained by using VDSM.
Fig. 5. The red (gray in B&W) point denotes the “center” of a crowded region. From left to right, the locations of center defined by arithmetic mean, spatial median and VDSM are shown respectively.
Generation of Unit-Width Curve Skeletons Based on VDSM
1057
3.5 Unit-Width Curve Skeleton
In the last step, we apply Dijkstra’s shortest path algorithm [17-18] to connect the exits with the center computed in the previous step. We remove other object points in the crowded region that are not in the paths connecting the exits and the center. The outcome is a unit-width curve skeleton. The pseudo code of our algorithm is as follows: Input: non-unit-width curve skeleton I Output: unit-width curve skeleton O Algorithm Generating_Unit_Width_Curve_Skeleton (I) Initialization: Initialize output O and copy I to O. Valence computation: Calculate the degree of each object point on the skeleton O. Points classification: Mark end points, middle points, joint points, and crowded points. If there is no crowded point, output O and finish. Crowded region location: Organize crowded points into crowded regions. Exit location: Find all exits in each crowded region. Center determine: Determine the center point in each crowded region using the VDSM algorithm. Shortest path computation: Apply Dijkstra’s shortest path algorithm to find the shortest path between the center and each exit. Remove the object points that are not on the shortest paths. Output: Output the unit-width curve skeleton O. End algorithm
If there are E crowded points and V edges between them, the complexity of our method is O(E+V2) [17-18] based on Dijkstra’s shortest path algorithm. Our VDSM algorithm inherits the characteristics of Dijkstra’s algorithm: unique, connected and has no circle. Our technique also guarantees the generated skeleton to be unit-width. Fig. 6 shows an example of unit-width curve skeleton generation.
(a)
(b)
(c)
(d)
Fig. 6. (a) Non-unit-width curve skeleton (b) a crowded region (c) two exits of the crowded region and (d) the constructed shortest path
4 Experimental Results We downloaded 1800 3D models from the Mesh Compendium [20] and the Princeton Shape Benchmark [21]; and used a selection to validate the effectiveness of our
1058
T. Wang and I. Cheng
Fig. 7. Examples of crowded region
(a)
(d)
(g)
(b)
(e)
(c)
(f)
(h)
(i)
Fig. 8. Examples of unit-width curve skeletons generated with our VDSM algorithm
algorithm. Surface meshes were converted to 3D binary images with binvox [22-23]. The input skeletons (non-unit-width) were extracted by a 3D thinning algorithm [15]. More examples of crowded region are shown in Fig. 7, and some examples of unitwidth curve skeletons generated with our algorithm are shown in Fig. 8.
Generation of Unit-Width Curve Skeletons Based on VDSM
1059
5 Conclusion and Future Research In this paper, we present a Valence Driven Spatial Median (VDSM) algorithm to generate unit-width curve skeletons from non-unit-width skeletons. Locating the center of a crowded region accurately is essential for generating a compact skeleton representation of a 3D model. Although the spatial median location estimator out-performs other frequently used locators, it may not fall at the centre of a region. We propose the Valence Driven Spatial Median (VDSM) algorithm to compute the center of a crowded region, and apply Dijkstra’s shortest path algorithm to generate a unit-width curve to replace the crowded region. This algorithm can be used in conjunction with another skeletonization algorithm, to ensure that the output skeleton is unit-width. We have already used this algorithm to generate unit-width curve skeletons, for constructing topology graphs and chain codes. Since skeletonization algorithms are sensitive to noise, in future work, we will incorporate a pre-processing step for minimizing the effect of noise on the generated unit-width skeleton.
References 1. Blum, H.: A transformation for extracting new descriptors of shape. In: Models for the Perception of Speech and Visual Form, pp. 362–380. MIT Press, Cambridge (1967) 2. Cornea, N.D.: Curve-Skeletons: Properties, Computation And Applications, Ph.D. Thesis, The State University of New Jersey (May 2007) 3. Ma, C.M., Sonka, M.: A fully parallel 3D thinning algorithm and its applications. Computer Vision and Image Understanding 64(3), 420–433 (1996) 4. Palagyi, K., Kuba, A.: A 3D 6-subiteration thinning algorithm for extracting medial lines. Pattern Recognition Letters 19(7), 613–627 (1998) 5. Lohoua, C., Bertrand, G.: A 3D 6-subiteration curve thinning algorithm based on P-simple points. Discrete Applied Mathematics 151, 198–228 (2005) 6. Pudney, C.: Distance-Ordered Homotopic Thinning: A Skeletonization Algorithm for 3D Digital Images. Computer Vision and Image Understanding 72(3), 404–413 (1998) 7. Rosenfeld, A., Kak, A.C.: Digital Picture Processing. Academic Press, New York (1982) 8. Arcelli, C., di Baja, G.S.: A width independent fast thinning algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 7, 463–474 (1985) 9. Ogniewicz, R.L., Ilg, M.: Voronoi Skeletons Theory and Applications. CVPR 1992 (1992) 10. Ogniewicz, R.L., Kubler, O.: Hierarchic Voronoi Skeletons. Pat. Rec., 343–359 (1995) 11. Sherbrooke, E.C., Patrikalakis, N.M., Brisson, E.: An algorithm for the medial axis transform of 3d polyhedral solids. IEEE T. VCG 2(1), 44–61 (1996) 12. Giblin, P., Kimia, B.B.: A formal classification of 3D medial axis points and their local geometry. In: CVPR 2000 (2000) 13. Leymarie, F.F., Kimia, B.B.: The Shock Scaffold for Representing 3D Shape. In: Arcelli, C., Cordella, L.P., Sanniti di Baja, G. (eds.) IWVF 2001. LNCS, vol. 2059, pp. 216–229. Springer, Heidelberg (2001) 14. Leymarie, F.F., Kimia, B.B.: Computation of the Shock Scaffold for Unorganized Point Clouds in 3D. CVPR 2003 (2003) 15. Wang, T., Basu, A.: A note on A fully parallel 3D thinning algorithm and its applications. Pattern Recognition Letters 28(4), 501–506 (2007)
1060
T. Wang and I. Cheng
16. Brunner, D., Brunnett, G.: An extended concept of voxel neighborhoods for correct thinning in mesh segmentation. In: Spring Conference on Computer Graphics, pp. 119–125 (2005) 17. Dijkstra, E.W.: A note on two problems in connexion with graphs. In: Numerische Mathematik, pp. 269–271 (1959) 18. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn., pp. 595–601. MIT Press and McGraw-Hill (2001) 19. Sundar, H., Silver, D., Gagvani, N., Dickinson, S.: Skeleton Based Shape Matching and Retrieval. In: Shape Modeling International 2003, pp. 130–142 (2003) 20. http://www.cs.caltech.edu/~njlitke/meshes/toc.html (retrieved in April 2008) 21. Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The Princeton Shape Benchmark, Shape Modeling International, Genova, Italy (June 2004) 22. http://www.cs.princeton.edu/~min/binvox/ (retrieved in April 2008) 23. Nooruddin, F., Turk, G.: Simplification and Repair of Polygonal Models Using Volumetric Techniques. IEEE Trans. on VCG 9(2), 191–205 (2003) 24. Attali, D., Montanvert, A.: Computing and Simplifying 2D and 3D Continuous Skeletons. Computer Vision And Image Understanding 67(3), 261–273 (1997) 25. Svenssona, S., di Bajab, G.S.: Simplifying curve skeletons in volume images. Computer Vision and Image Understanding 90, 242–257 (2003) 26. Masse, J.C., Plante, J.F.: A Monte Carlo study of the accuracy and robustness of ten bivariate location estimators. Comput. Statistics & Data Analysis 42, 1–26 (2003) 27. Wang, Y.S., Lee, T.Y.: Curve-Skeleton Extraction Using Iterative Least Squares Optimization. IEEE T. VCG 14(4), 926–936 (2008)
Intuitive Visualization and Querying of Cell Motion Richard Souvenir, Jerrod P. Kraftchick, and Min C. Shin Department of Computer Science, The University of North Carolina at Charlotte
Abstract. Current approaches to cell motion analysis rely on cell tracking. In certain cases, the trajectories of each cell is not as informative as a representation of the overall motion in the scene. In this paper, we extend a cell motion descriptor and provide methods for the intuitive visualization and querying of cell motion. Our approach allows for searches of scale- and rotation-invariant motion signatures, and we develop a desktop application that researchers can use to query biomedical video quickly and efficiently. We demonstrate this application on synthetic video sets and in vivo microscopy video of cells in a mouse liver.
1
Introduction
Characterizing the motion of cells in tissue is a significant problem in biomedical research. There is increasing evidence that in some cases these motion patterns can serve as discriminative features used to indicate, for example, insufficient blood flow, blockages, or even the presence of a tumor. Vision-based solutions typically rely on detecting and tracking the position of individual cells over time. The motion tracks of all of the cells could provide the raw data necessary to answer questions about the patterns of Fig. 1. Tracking results cell motion. For most problems, manual tracking is from cells in liver vessels. impractical due to the high number of cells in a sin- In this paper, we provide gle scene. This has led to a number of automated methods to intuitively viapproaches to cell tracking that are successful in sualize and query this type tracking a large number of cells in a variety of imag- of data. ing conditions. The literature on object tracking is vast [1], and many approaches have been explored specifically for cell tracking [2,3,4]. Figure 1 shows an example of the results of cell tracking overlaid on an intravital microscopy image. While these types of results provide some information about the motion of the cells, the representation is cluttered and querying for specific motion patterns is not straightforward. Moreover, for an important class of problems in biomedical research, the set of individual tracks is not of primary interest. For these problems, characterizing the motion patterns of cells in a given area or searching for specific cell motion patterns is key. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1061–1070, 2008. c Springer-Verlag Berlin Heidelberg 2008
1062
R. Souvenir, J.P. Kraftchick, and M.C. Shin
In this paper, we describe an application we develop for searching for rotationand scale-invariant cell motion patterns in biomedical video. We also develop a desktop application that researchers can use to query biomedical video quickly and efficiently. The remainder of the paper is structured as follows. In Section 2, we describe the underlying descriptor for motion analysis. In Section 3, we present our method for querying video and discuss the design of an intuitive desktop application, which we present in Section 4. Section 5 shows the results of searching for specific motion patterns on synthetic video and in vivo microscopy video of cells in a mouse liver. Conclusions are presented in Section 6.
2
Motion Representation
Our motion model is known as the Radial Flow Transform (RFT) [5]. The RFT provides estimates of motion at each pixel location, without explicitly tracking individual cells. Unlike related methods, the RFT does not learn a single direction of motion at each location, but provides a construct capable of being queried for desired motions in a region, including complex motion patterns where the objects may not only travel in a single direction at a given location. The RFT can be used for two problems in motion analysis: (1) detection of specific motion patterns in a scene and (2) visualization of motion in specific regions of a scene. The RFT can be used in two general cases: (1) as a postprocessing step to automated cell tracking (e.g., Figure 1) or (2) to generalize the sequential (possibly noisy) detection of objects of interest into flow estimates. Here, we describe the RFT given the results of a detection process. In this case, the input is a classification map, C, where Ct (x) ∈ [0, 1] is the detection confidence of some object at pixel location x at time t. We model the distribution of the radial flow at a pixel as a function of the direction. So, for each pixel loca(a) (b) tion, x, we estimate the radial flow R(x, θi ) in direction θi ∈ Θ where Θ Fig. 2. Examples of radial flow. (a) left and is a set of (generally evenly-spaced) right motion and (b) “converging from all diangles. The RFT is designed for sit- rections” motion. Each cell represents a pixel uations where the input is noisy and location and the vectors describe the domisparse; this binning step, which de- nant direction(s) of motion. creases the granularity of output, allows us to accumulate motion estimates in a more robust way. In our experiments, we determined empirically that more than 8 evenly-spaced representative directions (0◦ , 45◦ , etc ) tends to result in many empty and low-count bins in the radial histogram. Figure 2 shows a close-up visualization of the radial flow for two image patches: (a) mostly left and right motion, and (b) a “converging from all directions” motion.
Intuitive Visualization and Querying of Cell Motion
2.1
1063
Flow Estimation
In this section, we describe how to estimate the flow Ft (x) at each time t and how these values are accumulated to estimate R(x, θi ). Ft can be calculated using both sequential classification results and tracking results. Here, we describe the procedure using iterative classification and describe the necessary changes that should be made when full tracking results are available.
Fig. 3. Estimating flow for objects in sequential frames. The intensity of the image in (c) represents the Euclidean distance transform for objects in the current frame (b). The tail of the arrow represents a detected object in the previous frame (a) and the direction of flow is calculated using the nearest-neighbor assumption.
The detection results represented by a confidence map, C, are used to estimate visual flow. Figure 3 depicts the estimate of visual flow for two consecutive frames from a synthetic data set. To estimate the flow without explicitly tracking each cell, the nearest-neighbor assumption is applied and the distance from each pixel in frame t − 1 to the nearest positively classified pixel in frame t is calculated: Dt (x) = min ||x − x ||2 , x ∈Ct+
(1)
where Ct+ is the set of locations positively classified at time t. Figure 3 shows the Euclidean distance transform for the positively classified regions in frame t and the locations of the positively classified regions from frame t − 1 marked at the tail of the arrows, which point in the direction of flow. The flow is estimated at regions of the image containing positively classified pixels at time t − 1 and all of the pixels along the path from a classified pixel in frame t − 1 to the nearest pixel in frame t. For each of these pixels, x, the estimate of the flow at time t is: Ft (x) = ∇Dt (x) · Dt (x).
(2)
Similarly, when Ft is calculated using tracking results (e.g., Figure 1), the direction and magnitude of motion can be obtained for those pixel locations with non-zero motion from frame t − 1 to t.
1064
2.2
R. Souvenir, J.P. Kraftchick, and M.C. Shin
Accumulating Flow Estimates
The flow estimates, F , are accumulated to provide a compact representation of the flow at each location in the image: R(x, θi ) = Ψ (Ft (x), Θ, θi )Φ(Ct (x)) (3) t
where Ψ (v, Θ, θ) is an indicator function which returns 1 if the dot product between v and the unit vector cos θ, sin θ is the highest of all the angles in Θ. This function finds the correct binning of the motion estimate, F . Φ represents an application-dependent weighting function, which can be used to incorporate prior information into the reliability of the flow estimates. In our experiments, we have empirically determined that Φ(Ci (x)) = 1 produces reasonable results and is robust to noisy flow measurements. Figure 4 shows (a) a keyframe from a biomedical video, (b) the cell tracks from the entire video calculated using [6], and zoomed-in portions of (c) the tracking results and (d) corresponding RFT.
(a)
(b)
(c)
(d)
Fig. 4. These images depict (a) a keyframe from a biomedical video, (b) the cell tracks from the entire video, and zoomed-in portions of (c) the tracking results and (d) corresponding RFT
The RFT can be used by microcirculation researchers to quickly understand the flow patterns at a given location. Here, we compare the output of RFT to traditional optical flow. Figure 5 shows visualization results on red blood cells (RBC), which generally move in the direction implied by the vessels and NKT cells which move more erratically. For each cell type, we show a section of the entire field, an overlay of the ground-truth motion of the cells, optical flow results, and the RFT visualization. To calculate optical flow, we use the LucasKanade method [7] between sequential frames. We report the mean optical flow, only counting pixels’ locations if the optical flow magnitude was greater than 0. These results demonstrate the power of the RFT to describe multiple dominant directions of motion in a visually intuitive way. The RFT is related to methods outside of biomedical image analysis, such as road extraction from aerial video, where the goal is to find coherent paths
Intuitive Visualization and Querying of Cell Motion Type Example Frame Motion Overlay
Optical Flow
1065
Radial Flow
RBC
NKT Fig. 5. Visualization of radial flow for biomedical image analysis. For both red blood cells and NKT cells, this table compares the visualization of the radial with that of traditional optical flow.
of motion. In [8], the authors use contextual information, such as the detection of vehicles, to detect the position and dominant direction of roads. In [9], the authors model the spatio-temporal image derivatives, and define roads as connected paths of consistent motion. Also, the idea of using histograms of local patches of motion to find motion patterns has been used for gesture recognition [10]. These problem domains, however, differ significantly from cell motion since we cannot assume consistent motion nor a global underlying pattern.
3
Queries Using the RFT
In addition to its use for visualizing motion patterns, the RFT can also be used for queries of motion patterns. In [5], RFT filters are used as queries; the filters are analogous to image filters and applied similarly to find correlated patterns of motion. Two limitations to this approach are that the filter must closely match the desired motion pattern and the query is not scale or rotation invariant. Here, Fig. 6. Sentry motion we introduce our novel approach for querying using RFT pattern which takes advantage of the Generalized Hough Transform [11]. This method is more robust to inexact matches, scale-invariant and rotationally-invariant. In addition, we use a multi-resolution approach for matching to find patterns at varying scales in the input. We employ an lw ∗ lh image window to describe the motion pattern of interest. For each location xs in this window, we define S(xs , θi ) = 1 if searching for a motion pattern with flow in direction θi at xs and S(xs , θi ) = 0 otherwise. Figure 6 shows a visualization of a 5 ∗ 5 window for the sentry motion pattern – a convergence followed by a divergence from some central location.
1066
R. Souvenir, J.P. Kraftchick, and M.C. Shin
(a)
(b)
(c)
(d)
Fig. 7. Illustration of the query process using the Generalized Hough Transform. The query motion pattern in (a) is converted to a “ballot” (shown in (b)) where a flow measurement casts votes for the center and orientation of the pattern (star) relative to the position and direction of the flow measurement). (c) shows a sample artificial RFT and (d) shows the maximum vote (of possible orientations) in the accumulator at each pixel location. In this toy example, the highest score is at the center of a matching (rotated 90◦ ) motion pattern.
For the matching process, we use a method similar to the Generalized Hough Transform [11]. Consider the query image window as a bounded shape (an lw ∗ lh rectangle, in this case) and each desired flow measurement as a “feature” of this shape. For each measurement in the window, we store the position dC and orientation φC relative to a some reference point and orientation, θref , of the image window. For convenience, we choose the center of the image window and θref = 0◦ . We maintain the position and rotation offset for each flow measurement in the query. In order to find the location in the RFT, R, we define an accumulator, A, which is the same size as R to keep track of the possible location and orientation of putative matches to the query. Each non-zero measurement in R casts votes into the accumulator for the possible locations and orientations of the query pattern using the list of offsets. After each measurement in the RFT is examined, we can visualize the likelihood of a matching pattern at each location and orientation in the accumulator, or, as is typically done, select the bin with the highest total as the best match to the query. Figure 7 illustrates the procedure for a very simple example. The result is a map of the accumulated votes where the highest scores indicate the presence of the query pattern. To search for patterns of varying scale, we employ a multi-resolution approach where the RFT is successively down-sampled by a scale factor, α, and this matching method is applied. At lower scales, the votes in the accumulator are scaled by α1 .
4
RFT Application
We provide an intuitive interface for the design of RFT queries. The program allows the user to specify the size of the query motion pattern and the input video to search. To describe the motion pattern, the user “draws” motion arrows from the desired location using the mouse interface. The application then converts between the input vector field and the RFT-based representation.
Intuitive Visualization and Querying of Cell Motion
1067
The RFT provides a straightforward conversion to and from a vector flow field descriptor. Let V (x) represent a set of vectors for each location x. Then, 1 if ∃v∈V (x) Ψ (v, Θ, θ) S(xs , θi ) = (4) 0 otherwise where Ψ (v, Θ, θ) is the “angle binning” indicator function as previously described. To visualize the RFT (as in Figure 5), we need the inverse computation. Fortunately, the conversion from an RFT to a vector flow representation is also straightforward. V (x) = {cos θ, sin θ|S(x, θi ) > τ } (5) where τ is a threshold parameter that can be adjusted to deal with noisy measurements. In our results, τ = 0. RFT can handle multiple directions of flow at a single location, so each location maintains a set of direction vectors, V , which can be visualized (Figure 5).
5
Results
In this section, we discuss results of performing queries using the Radial Flow Transform on simulated data and real biomedical data. 5.1
Comparison to RFT Filters
In [5], RFT filters are used as query input; the filters are analogous to image filters and applied similarly to find correlated patterns of motion. Using a toy example, we show how our method finds scaled and rotated versions of the query, which RFT filters are unable to detect. Figure 8 shows (a) a synthetic RFT containing 4 versions (scaled and rotated) of a motion signature, (b) the query used for both methods, (c) the results using the RFT filter, and (d) the results using our matching method. The higher intensity represents a closer match to the template. Note that our method discovered all 4 instances of the pattern while the approach using RFT filters only discovered the exact match. 5.2
Simulated Data
We generated synthetic image sequences by randomly placing nc cells and nt interest points in the image field. The interest points were static and each induced a specified behavior in all nearby cells. Each cell is assigned a general direction, v, which is corrupted by Gaussian noise at each time step. To simulate a hypothesized motion pattern, the interest points attracted cells within a certain radius. (For our tests, this value was roughly 20% of the width of the image.) Figure 9 shows results on simulated data. For each run, we generated a set of paths to mimic vessels and introduced “cells” traveling in random directions within those paths. Figure 9(a) shows a set of paths from in a 500*500 pixel area
1068
R. Souvenir, J.P. Kraftchick, and M.C. Shin
(a)
(b)
(c)
(d)
Fig. 8. These images depict (a) a synthetic RFT containing 4 versions (scaled and rotated) of a motion signature, (b) the query used for both methods, (c) the results using the RFT filter, and (d) the results using our method. The higher intensity represents a closer match to the template. Our method discovered all 4 instances of the pattern.
(a)
(b)
(c)
Fig. 9. (a) Diagram of synthetic data cell paths with the interest point marked (b) Pattern match score for the RFT query motion described in the inset (c) ROC curves for 6 similar experiments
with 50 frames, 15 cells, and 1 interest point with influence radius 100. The query was the 3*3 “converging” pattern shown in the inset of (b). Figure 9(b) shows the results on this synthetic data set. The intensity of image represents the pattern match score. Note that the motion represented in the query is horizontal, but the vessel around the interest point is not. We conducted similar experiments with varying paths, number of cells, and interest points. ROC curves for the pattern match scores are shown in (c). The average area under the curve (AUC) of all experiments was .99. 5.3
NKT Motion Analysis
There is ongoing research on the motion patterns of NKT cells in the liver tissue. Some researchers believe that the motion patterns of these cells change in the presence of tumors. We present our tool as a method to aid in this research, and show the results of applying RFT to real biomedical data. Note that we are highlighting the ability to detect and characterize motion patterns in this type of data, but not necessarily to detect tumors.
Intuitive Visualization and Querying of Cell Motion
(a)
(b)
1069
(c)
Fig. 10. NKT Results. (a) shows a keyframe of NKT data and (b) shows the cell tracks. The boxes mark areas in which a sentry pattern was observed. (c) shows the output of querying the RFT of this video using the pattern depicted in the inset. The highest-scoring locations match the regions indicated in (b).
This data was collected using an Olympus IX70 fluorescence inverted microscope. The livers of cxcr6gf p/+ mice were imaged and the NKT cells are made visible using a green fluorescent protein (GFP) filter. The liver surface is imaged every minute for one hour at 10x magnification. Using these video frames, we generated classification maps of putative cell locations using a supervised machine learning approach [12], in a manner similar to [13]. Figure 10 shows a 1000*1000 pixel sample frame of NKT data. Tracking results from this sequence are shown in (b) and the locations of observed patterns are marked. The 3*3 query is shown in Figure 10(c, inset). The pattern match results are shown on in Figure 10(c). The highest-scoring regions in the output matched the areas where sentry patterns were observed in the video.
6
Conclusions
In this paper, we extended the Radial Flow Transform to support more general motion pattern queries on biomedical video. We incorporated this into an application for querying cell motion patterns which, in conjunction with classic cell tracking, can be used in a variety of biomedical image analysis experiments.
Acknowledgements We would like to thank our collaborators in the UNCC Department of Biology and Carolinas Medical Center for providing data and guidance, Christopher Sinclair for developing the user interface, and the reviewers for helpful comments. This work was partially funded by NIH 5R21GM077501.
1070
R. Souvenir, J.P. Kraftchick, and M.C. Shin
References 1. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv. 38(4), 13 (2006) 2. Mukherjee, D., Ray, N., Acton, S.: Level set analysis for leukocyte detection and tracking. IEEE Trans. on Image Processing 13(4), 562–572 (2004) 3. Debeir, O., Van Ham, P., Kiss, R., Decaestecker, C.: Tracking of migrating cells under phase-contrast video microscopy with combined mean-shift processes. IEEE Trans. on Medical Imaging 24(6), 697–711 (2005) 4. Li, K., Chen, M., Kanade, T.: Cell population tracking and lineage construction with spatiotemporal context. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 295–302. Springer, Heidelberg (2007) 5. Souvenir, R., Kraftchick, J., Lee, S., Clemens, M., Shin, M.: Cell motion analysis without explicit tracking. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7 (June 2008) 6. Sbalzarini, I.F., Koumoutsakos, P.: Feature point tracking and trajectory analysis for video imaging in cell biology. Journal of Structural Biology 151(2), 182–195 (2005) 7. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI 1981), pp. 674–679 (1981) 8. Hinz, S., Baumgartner, A.: Road Extraction in Urban Areas Supported by Context Objects. In: Int’l Archives of Photogrammetry and Remote Sensing, vol. 33, pp. 405–412 (2000) 9. Pless, R., Jurgens, D.: Road extraction from motion cues in aerial video. In: Proc. of the ACM Conf. on Geographic Information Systems, pp. 31–38 (2004) 10. Roth, M., Freeman, W.T.: Orientation histograms for hand gesture recognition. In: Intl. Workshop on Automatic Face and Gesture Recognition (1995) 11. Ballard, D.H.: Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition 13(2), 111–122 (1981) 12. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Vit´ anyi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995) 13. Mallick, S., Zhu, Y., Kriegman, D.: Detecting particles in cryo-EM micrographs using learned features. Journal of Structural Biology 145(1-2), 52–62 (2004)
Registration of 2D Histological Images of Bone Implants with 3D SRµCT Volumes Hamid Sarve1 , Joakim Lindblad1 , and Carina B. Johansson2 1
2
Centre for Image Analysis, Swedish University of Agricultural Sciences, Box 337, SE-751 05 Uppsala, Sweden {hamid,joakim}@cb.uu.se ¨ ¨ Department of Clinical Medicine, Orebro University, SE-701 85 Orebro, Sweden
[email protected]
Abstract. To provide better insight in bone modeling and remodeling around implants, information is extracted using different imaging techniques. Two types of data used in this project are 2D histological images and 3D SRµCT (synchrotron radiation-based computed microtomography) volumes. To enable a direct comparison between the two modalities and to bypass the time consuming and difficult task of manual annotation of the volumes, registration of these data types is desired. In this paper, we present two 2D–3D intermodal rigid-body registration methods for the mentioned purpose. One approach is based on Simulated Annealing (SA) while the other uses Chamfer Matching (CM). Both methods use Normalized Mutual Information for measuring the correspondence between an extracted 2D-slice from the volume and the 2D histological image whereas the latter approach also takes the edge distance into account for matching the implant boundary. To speed up the process, part of the computations are done on the Graphic Processing Unit. The results show that the CM-approach provides a more reliable registration than the SA-approach. The registered slices with the CMapproach correspond visually well to the histological sections, except for cases where the implant has been damaged.
1
Introduction
With an aging and increasingly osteoporotic population, bone implants are becoming more important to ensure the quality of life. We aim to improve the understanding of the mechanisms of implant integration. This involves studying the regeneration of bone-tissue in the proximity of an implant. Histomorphometrical data (bone to implant contact and bone area in the proximity of the implant) are today extracted from histologically stained undecalcified cut and ground sections of the implants, imaged using traditional microscope (see Fig. 1b). However, we foresee that combining information obtained with a number of different techniques will help to gain further understanding of the integration of biomaterials; we combine the traditional 2D data with acquired 3D SRµCT (synchrotron radiation-based computed microtomography) data (see G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1071–1080, 2008. c Springer-Verlag Berlin Heidelberg 2008
1072
H. Sarve, J. Lindblad, and C.B. Johansson
(a)
(b)
(c)
Fig. 1. (a) Visualization of an SRμCT volume (b) Histological section (c) Illustration of the implant, regenerated bone, a possible slice (IV ), the volume (V ) and the implant symmetry axis (ISA)
Fig. 1a). Comparing bone area measurements obtained on the 2D sections with bone volumes obtained on the 3D reconstructed volumes are of immediate interest. For the 2D histological sections, a ground truth can be obtained by experts segmenting the images. However, manual annotation is very time consuming and difficult for the volumes due to lack of histological information. This obstacle can be bypassed by finding a slice in the volume which corresponds to the 2D histological image, for which ground truth exists and is easier analyzed. Doing so will also enable a direct comparison of estimates from the two modalities. In this work we present a method for registering 2D histological images of bone implants with corresponding 3D SRµCT volumes. In the following section we describe previous work in this field. In Sect. 4 we describe two developed methods and in Sect. 5 we evaluate these methods. The registration results are shown in Sect. 6. Finally, in Sect. 7 we discuss the results and validation of the methods.
2
Background
Image registration is the task of finding a geometrical transformation to align two images. In the rigid body case, the transformation includes rotation and translation. Medical image registration methodology and various methods are described in [1] and [2]. The diversity in modalities has during the past 20 years caused a need for intermodal and interdimensional registration. A common 2D– 3D registration task is registering CT with X-ray flouroscopy [3,4,5,6,7]. We focus our interest on registration of SRμCT volumes with histologically stained microscopy images, which has, to the best of our knowledge, not gained much attention. The methods for registration are in [2] classified as Point-based methods, Surface-based methods and Intensity-based methods. Intensity-based registration finds a transformation which maximizes a similarity measure. A number of works use Normalized Mutual Information (NMI) [8]. A reason for NMI’s popularity is its ability to measure the amount of information the two images
Registration of 2D Histological Images of Bone Implants with 3D SRµCT
1073
have in common independently of their modality. An overview of works on NMI-based registration is found in [9]. Commonly used algorithms for finding the sought-after transformation involves Simulated Annealing (SA) [10], Genetic Algorithms [11], Powell’s Method (PM) [12]. In [13], Lundqvist evaluates PM and SA for registration and shows that SA performs better than PM. Another registration method is Chamfer Matching (CM) introduced by Tenenbaum et al. [14]. As the method requires pre-segmentation, it is suitable for tasks where segmentation of the objects of interest is easily performed. It is shown in [15] that CM is feasible and efficient for CT and PET lung image registration. Registration of large 3D volumes has long been a cumbersome task. When implemented on CPU, extracting a slice with arbitrary angle and translation from a volume is a time-consuming task. This operation is executed frequently in this registration process. However, the rapidly growing texture memory on graphics cards over the past decade has made it possible to perform operations on large 3D-volumes on their programmable Graphics Processing Unit, GPU. These processors, having a parallel and pipelined architecture, provide computational advantages over traditional CPUs [16]. A number of works have taken advantage of the computational power of GPU. They mainly use the GPUpower for creation of Digitally Reconstructed Radiograph (DDR) [5,6,17]. K¨ ohn et al. [18] presents a 2D–3D rigid registration on GPU based on regularized gradient flow. In this work, we utilize the GPU for extracting 2D slices from a 3D volume, which outperforms the CPU implementation by more than an order of magnitude.
3
Materials and Imaging
Pure titanium screws (diam. 2.2 mm, length 3 mm) are inserted in femur condyle region of 12 w. old rats for 4 weeks. Eight condyles are after retrieval immersed in fixative and embedded in resin. All samples are imaged with the SRµCT device of the GKSS1 at beamline W2 using a photon energy of 50 keV. The tomographic scans are acquired with the axis of rotation placed near the border of the detector, and with 1440 equally stepped radiograms obtained between 0◦ and 360◦ . Before reconstruction combination of the projection of 0◦ – 180◦ and 180◦ – 360◦ are built. A filtered back projection algorithm is used to obtain the 3D data of X-ray attenuation for the samples. The field of view of the X-ray detector is set to 6.76 mm × 4.51 mm (width × height) with a pixel size of 4.40 µm showing a measured spatial resolution of about 10.9 µm. After the SRµCT-imaging, the samples are divided in the mid region (longitudinal direction of the screws). One undecalcified section with the implant in situ of 10µm is prepared from approximately the mid portion of each sample [19] (a possible section is illustrated in Fig. 1c). The section is routinely stained in a mixture of Toluidine blue and pyronin G, resulting in various shades of purple stained bone tissue and light-blue stained soft tissue components. Finally, 1
Gesellschaft fur Kernenergieverwertung in Schiffbau und Schiffahrt mbH at HASYLAB, DESY, in Hamburg, Germany.
1074
H. Sarve, J. Lindblad, and C.B. Johansson
samples are imaged in a light-microscope, generating color images with a pixel size of 9.17µm (see Fig. 1b).
4
Registration
The task of registration in this work is to find a slice IV extracted from the SRμCT volume V , such that IV is most similar to the 2D histological image, IH . IV is extracted using a function, T (V, p) which returns a slice from V given by the parameter of p = (x, y, z, φ, θ, γ), where (x, y, z) are the translations of the slice in each axis and (φ, θ, γ) the rotations about the axes. The function T is implemented on the GPU by means of OpenGL 3D textures. In total, the parameter space has six degrees of freedom. As the search space is too large for finding a globally optimal solution, we search for a good solution, i.e., an IV highly similar to IH . Two approaches are evaluated for this purpose; Chamfer Matching (CM) and Simulated Annealing (SA), described in Sect. 4.2 and 4.3 respectively. 4.1
Similarity Measures
Measurement of the similarity between IV and IH is a principal part of the registration process. The CM-approach uses the edge distance (ED) given in [20], for measuring the dissimilarity of the implant edges of IV and IH . It is calculated as: ED(Id , Ie ) = n1e Id2 Ie , where Id is the distance transformed edge image of the segmented implant in IH , Ie the binary edge image of the segmented implant in IV , and ne is the number of edge pixels. A perfect match implies ED = 0. In addition to ED, NMI is also used by the presented CM-approach to measure the similarity of the bone tissue regions. It is calculated as g g i=1 HH (i) log HH (i) + j=1 HV (j) log HV (j) N M I(IV , IH ) = (1) g g H (i, j) log HV H (i, j), VH i=1 j=1 where HV H is the joint histogram of IV and IH , HV and HH their marginal histograms respectively, and g the number of grayscale levels. A perfect registrations implies NMI = 2. The SA-approach uses NMI only. In order to apply the NMI similarity measure, the histological color images are transformed to grayscale equivalents. The commonly used color to grayscale transformations for natural images are based on human perception, where the green channel has high influence, and are not suitable for this application. A transformation that is adjusted to our purpose is derived as: IH = RGB IH [0.01 0.3 0.69] . The chosen factors were empirically shown to give a high similarity value for a slice aligned to the 2D histological image and low similarity value for a miss-aligned slice.
Registration of 2D Histological Images of Bone Implants with 3D SRµCT
4.2
1075
Chamfer Matching Approach
Our CM-approach (see Alg. 1), proceeds from the hierarchical CM-method proposed by Borgefors [20]. It is divided into two steps; firstly, a slice that minimizes ED is found by chamfer matching. Secondly the sought-after registered slice is found by rotating the implant about the axis of the matched implant to maximize NMI. The implant is well distinguishable in both V and IH and easily segmented by thresholding. A Euclidean distance transform is applied on the edge image of the segmented implant in IH . The edge image is computed as the inner 4-connected contour of the segmented implant. In [20], a hierarchical structure is suggested to reduce the computational load. However, as our method is GPU-accelerated, rather than having a resolution hierarchy, the resolution is kept constant but at each hierarchy-level l, the step sizes Δlν of parameter ν ∈ {x, y, z, φ, θ, γ} are decreased to facilitate successive refinement of the matching. The algorithm searches for a slice with minimum ED in a gradient descent manner, see Alg. 1, where the neighborhood for each level, l, is defined as Nl (p) = {q | qν − pν ∈ {−Δlν , 0, Δlν } ∀ν}. The slice with lowest ED at the final level is chosen as the matched slice, IM . To improve the result, the search is initialized with a set of slices as proposed in [20]. After each level, the k-best results are chosen to be the initial slices for the next level. Algorithm 1. Chamfer matching of implant boundaries of V and IH Input:
Id : Euclidean distance transformation of implant contour in IH V C : segmented implant boundary in the 3D SRμCT volume Output: IM : Matched 2D slice Parameters: nl : number of levels, n0i number of iterations in level 0 Δ0 : initial step sizes for the transformation parameters fCM : step size decline factor ∈ [0, 1] P0 : vector of k0 initial parameters for level 0 1. for all l = 1, 2, . . . , nl do keep the kl -best results in Pl for all p ∈ Pl do for all i = 1, 2, . . . , ni do p = arg min ED(T (V C , p ), Id ) p ∈Nl (p)
3. IM
if p = p then ni = i, break else p = p insert p in Pl+1 1 Δl+1 = Δl · fCM , nl+1 = nli · fCM i if kl ≥ 2 then kl+1 = kl /4 = T (V C , p)
In the mid-implant region, the distance between two center of thread crests is approximately 0.4 mm, which corresponds to about 35 pixels in the full scale
1076
H. Sarve, J. Lindblad, and C.B. Johansson
SRμCT volume. As it takes 360 degrees of rotation about the implant symmetry axis (ISA) to travel from one thread crest center to another, the implant can rotate up to 10 degrees about ISA before the implant-edge is shifted one pixel. This means that a registration based on implant-matching only may match the implant to one of several slices with minimum ED. Hence, the matched slice must be somewhat rotated about ISA in both directions in order to find the most similar slice, IV . As the edge distance will be roughly the same for small rotations, the bone region information needs to be taken into account to determine the rotation. This is done by measuring the NMI (described in 4.1) for IM rotated ±20◦ in Δr steps about ISA. The IM with the rotation about ISA which yields the highest NMI is selected as IV . The ISA-vector is calculated by a Principal Component Analysis; each voxel of the segmented implant is considered a data point and its coordinate (x, y, z) is saved in a matrix M . The principal axis where the variance of the segmented implant is largest is computed as the normalized eigenvector with the highest corresponding eigenvalue of the covariance matrix of M . 4.3
Simulated Annealing
SA is a heuristic optimization algorithm which mimics the physical act of annealing. This implementation of SA (see Alg. 2), as used for registration, starts with initial temperature, T0 . As the annealing proceeds, the temperature is reduced at each iteration step and a candidate slice with transformation parameters from a neighborhood, defined by Nl,T (p) = {q | p is extracted randomly − T /T0 · Δlν ≤ qν −pν ≤ T /T0 · Δlν ∀ν ∈ {x, y, z, φ, θ, γ}}, where T is the current temperature. The similarity between the candidate slice and IH is measured using NMI. The candidate slice is accepted depending on a probability function. The higher the temperature, the higher the probability that a less similar state is accepted. After each step the temperature is decreased by temperature decline factor fT . When the final temperature Te is reached, the annealing stops and the candidate slice with the highest NMI is selected as the registered slice. Analogously to the CM-approach, a pseudo hierarchical structure is implemented; nl re-annealings are performed and at each level, l, the resolution is kept constant but the Δlν is decreased with a factor fSA .
5
Evaluation
As no ground truth exists for the studied data set, the evaluation and verification of the methods are made complicated. Our approach is to evaluate the methods on monomodal data where a ground truth can be created; a slice with known transformation parameters, IV , is extracted and registered with V using the presented methods. The distance between the retrieved slice, IV , and IV , denoted D, is considered the registration error which gives 4an indication on how well the registration performs. D is calculated as D = 14 j=1 |(xj yj zj )T − (xj yj zj )T |, where xj , yj , zj denote the coordinates of the corner j of the extracted slice and
Registration of 2D Histological Images of Bone Implants with 3D SRµCT
1077
Algorithm 2. Registration using Simulated Annealing Input:
IH : 2D histological image V : 3D SRμCT volume Output: IV : registered 2D SRμCT slice Parameters: p0 : initial parameters, Δ0 : initial step range, fSA : step decline factor T0 : initial temperature, Te : final temperature, fT : temp decline factor nl : number of levels 1. P = ∅, p = p0 2. for all l = 1, 2, . . . , nl T = T0 /l repeat until T ≤ Te /l random p ∈ Nl,T (p) ΔNM I = NMI(T (V, p ), IH ) − NMI(T (V, p), IH ) if eΔN M I /T > random q ∈ U (0, 1) then p = p , add p to P T = T · fT Δl+1 = fSA · Δl 3. IV = T (V, p∗ ), where p∗ = arg min N M I(T (V, p∗ ), IH ) p∗ ∈P
analogously xj , yj , zj are the corner coordinates of IV . The distance is calculated on volumes with normalized dimensions 1×1×1. To stress the registration, we also add Gaussian noise of different reasonable magnitudes to V and IV prior to the segmentation step.
6
Results
The parameters for the CM-approach are chosen as: P0 = (0, 0, pz , pφ , pθ , 0) where pθ = {−90◦ , 0◦ , 90◦ , 180◦ }, pφ = {−15◦, 15◦ }, pz = {−15, 15}, Δ0x = Δ0y = Δ0z = 2, Δ0φ = Δ0γ = 2.5, Δ0θ = 8◦ , nl = 6, n0i = 20, fCM = 0.5 and Δr = 0.125. As for the SA-approach, p0 was set to a random transformation, 1 nl = 3, T0 = 0.001, Te = 0.00002, fT = 0.997 and fSA = 0.5, Δ0ν is set to 32 of Table 1. Averages for successful registrations over 24 slices and percent failed registrations. Zero-mean Gaussian noise with variance σ is added. Average time consumption, t¯, per registration is measured for a 256×256×256 volume on a 3.6GHz Intel Xeon CPU (6 GB RAM) and an nVidia Quadro FX 570 graphics card. σ=0 σ = 0.01 σ = 0.05 Appr. D NMI ED Fail D NMI ED Fail D NMI ED CM 0.60% 1.332 0.74 4.2% 0.58% 1.118 0.79 8.3% 1.64% 1.088 1.05 SA 0.27% 1.421 0.94 33.3% 0.40% 1.121 0.97 62.5% 0.58% 1.090 1.09 SA† 0.18% 1.480 0.91 25.0% 0.31% 1.122 0.92 66.7% 0.49% 1.089 1.12 † : Slower cooling scheme is applied (fT = 0.9997)
Fail 16.7% 79.2% 83.3%
t¯ (min) 4.3 4.3 30.3
1078
H. Sarve, J. Lindblad, and C.B. Johansson
(a)
(b) N M I= 1.079 ED=1.693
(c) N M I= 1.078 ED=1.157
(d)
(e) N M I= 1.100 ED=1.787
(f) N M I= 1.088 ED=1.150
(g)
(h) N M I= 1.058 ED=3.791
(i) N M I= 1.071 ED=1.081
(j)
(k) N M I= 1.103 ED=1.628
(l) N M I= 1.087 ED=1.263
(m)
(n) N M I= 1.094 ED=3.269
(o) N M I= 1.070 ED=5.907
Fig. 2. Histological sections (left). Registered slices using the SA-approach (middle) and the CM-approach (right), given similar time constraints (fT = 0.997). Implant in (m) has been damaged during the cutting process.
Registration of 2D Histological Images of Bone Implants with 3D SRµCT
1079
the dimension for ν = {x, y, z} and Δ0θ = 22.5◦ , Δ0φ = Δ0γ = 3◦ . These settings are adjusted to achieve a good trade-off between speed and performance, as well as somewhat similar time consumption for the two approaches. The SA-approach is also evaluated with a slower cooling, fT = 0.9997. The result of the evaluation is summarized in Table 1. Both the presented methods may get stuck in a local optimum, which can be far away from a correct match. If D > 5% we classify that registration as failed and exclude it from the listed averages. The resulting images of registration of six volumes with corresponding 2D histological images are shown in Fig. 2.
7
Summary and Conclusions
Two rigid-body 2D–3D intermodal GPU-accelerated registration methods for 3D SRμCT volumes and 2D histological sections of implants are presented and evaluated. The evaluation shows that the CM-approach is more robust; it has higher success rate than the SA-approach on monomodal data, given similar time constraints. However, when the SA-approach does find a correct match, it provides higher precision. The CM-approach is, in contrast to SA, deterministic; for a given input it always provides the same output, which is of great value in that it provides reproducibility of results. On the other hand, CM requires a segmentation of the images and is hence recommended for registration tasks where segmentation is easily carried out. A visual examination of the results confirms the robustness of the CM-approach. The registered images by the CM-approach correspond well to the respective histological section, except where the implant has been damaged during the cutting process (one such example is shown in Fig. 2m). The SAapproach shows to be more unreliable. Comparing Figs. 2k and 2l hints on a discrepancy between visual impression and the NMI-measure, indicating that using NMI alone may not be the best option. Future Work Preliminary studies indicate that the CM-approach can be improved by distance transforming the implant boundary of the volume instead of the histological image. This would possibly improve the results for images of damaged implants but increases the computational load. Future work involves segmentation of the bone regions of the SRμCT data and quantification of the bone tissue of the 3D-volumes and comparison of these with manually obtained quantifications of the histological data. Acknowledgments. Research technicians Petra H. Johansson and Ann Albrektsson are greatly acknowledged for skillful sample preparations. Dr. Ricardo Bernhardt and Dr. Felix Beckmann are acknowledged. Martin Ericsson is acknowledged for OpenGL-support. This work was supported by grants from The Swedish Research Council, 621-2005-3402 and was partly supported by the IASFS project RII3-CT-2004-506008 of the Framework Programme 6.
1080
H. Sarve, J. Lindblad, and C.B. Johansson
References 1. Hajnal, J.V., et al.: Medical Image Registration. CRC Press, Boca Raton (2000) 2. Milan, S., Michael, F.: Handbook of Medical Imaging. SPIE Press (2000) 3. Z¨ ollei, L., Grimson, E., Norbash, A., Wells, W.: 2D-3D rigid registration of XRay fluoroscopy and CT images using mutual information and sparsely sampled histogram estimators. CVPR 2, 696 (2001) 4. Russakoff, D.B., Rohlfing, T., Calvin, R., Maurer, J.: Fast intensity-based 2D-3D image registration of clinical data using light fields. ICCV 01, 416 (2003) 5. Kubias, A., et al.: Extended global optimization strategy for rigid 2D/3D image registration. In: Kropatsch, W.G., Kampel, M., Hanbury, A. (eds.) CAIP 2007. LNCS, vol. 4673, pp. 759–767. Springer, Heidelberg (2007) 6. Ino, F., et al.: A GPGPU approach for accelerating 2-D/3-D rigid registration of medical images. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330, pp. 939–950. Springer, Heidelberg (2006) 7. Knaan, D., Joskowicz, L.: Effective intensity-based 2D/3D rigid registration between fluoroscopic X-ray and CT. In: Ellis, R.E., Peters, T.M. (eds.) MICCAI 2003. LNCS, vol. 2878, pp. 351–358. Springer, Heidelberg (2003) 8. Studholme, C., Hill, D.L.G., Hawkes, D.J.: An overlap invariant entropy measure of 3D medical image alignment. Pattern Recognition 32, 71–86 (1999) 9. Pluim, J., Maintz, J., Viergever, M.: Mutual-information-based registration of medical images: a survey. IEEE Trans. on Medical Imaging 22, 986–1004 (2003) 10. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sciene 220, 671–681 (1983) 11. Goldberg, D.: Genetic Algorithms in Optimization, Search and Machine Learning. Addison-Wesley, Reading (1989) 12. Powell, M.J.D.: An efficient method for finding the minimum of a function of several variables without calculating derivatives. Computer Journal 7, 152–162 (1977) 13. Lundqvist, R.: Atlas-Based Fusion of Medical Brain Images. PhD thesis, Uppsala University, Uppsala (2001) 14. Barrow, H.G., Tenenbaum, J.M., Bolles, R.C., Wolf, H.C.: Parametric correspondence and chamfer matching: Two new techniques for image matching. In: Proc. 5th Int. Joint Conf. Artificial Intelligence, pp. 659–663 (1977) 15. Cai, J., et al.: CT and PET lung image registration and fusion in radiotherapy treatment planning using the chamfer-matching method. International journal of radiation oncology 43, 871–883 (1999) 16. Lejdfors, C.: High-level GPU Programming. PhD thesis, Lund University (2008) 17. Hong, H., Kim, K., Park, S.: Fast 2D-3D point-based registration using GPUbased preprocessing for image-guided surgery. In: Mart´ınez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 218–226. Springer, Heidelberg (2006) 18. K¨ ohn, A., et al.: GPU accelerated image registration in two and three dimensions. In: Bildverarbeitung fur die Medizin 2006, pp. 261–265 (2006) 19. Johansson, C., Morberg, P.: Cutting directions of bone with biomaterials in situ does influence the outcome of histomorphometrical quantification. biomaterials 16, 1037–1039 (1995) 20. Borgefors, G.: Hierarchical chamfer matching: A parametric edge matching algorithm. In: PAMI, vol. 10(6), pp. 849–865 (1988)
Measuring an Animal Body Temperature in Thermographic Video Using Particle Filter Tracking Atousa Torabi1 , Guillaume-Alexandre Bilodeau1 , Maxime Levesque2 , J.M. Pierre Langlois1, Pablo Lema2 , and Lionel Carmant2 1
Department of Computer Engineering and Software Engineering, ´ Ecole Polytechnique de Montr´eal, P.O. Box 6079, Station Centre-ville Montr´eal (Qu´ebec), Canada, H3C 3A7 {atousa.torabi,guillaume-alexandre.bilodeau,pierre.langlois}@polymtl.ca 2 Pediatry, Sainte-Justine Hospital, 3175, Cˆ ote Ste-Catherine,Montr´eal, (Qu´ebec), Canada, H3T 1C5
[email protected],
[email protected],
[email protected]
Abstract. Some studies on epilepsy have shown that seizures might change the body temperature of a patient. Furthermore, other works have shown that kainic acid, a drug used to study seizures, modify body temperature of a laboratory rat. Thus, thermographic cameras may have an important role in investigating seizures. In this paper, we present the methods we have developed to measure the temperature of a moving rat subject to seizure using a thermographic camera and image processing. To accurately measure the body temperature, a particle filter tracker has been developed and tested along with an experimental methodology. The obtained measures are compared with a ground truth. The methods are tested on a 2-hour video and it is shown that our method achieves the promising results.
1
Introduction
Neonatal seizures are convulsive events in the first 28 days of life in term infants or for premature infants within 44 completed weeks of conceptional age. Neonatal seizures are the most frequent major manifestation of neonatal neurologic disorders [1]. Population-based studies of neonatal seizures in North America report rates between 1 and 3.5 per 1000 live births. Initially thought to have little long-term consequences, we have more and more evidence that these seizures are deleterious to the developing brain [2]. Therefore, more emphasis is put on the treatment of these early life seizures. However, due to the immature connections in the neonatal brain, these seizures exhibit unusual clinical patterns, mimic normal movements and have primitive EEG patterns that are not easily recognizable. Therefore, one would be required to monitor all at-risk newborns continuously to confirm the epileptic nature of their events. At Ste-Justine Hospital, this typically represents 40 patients at any one time. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1081–1091, 2008. c Springer-Verlag Berlin Heidelberg 2008
1082
A. Torabi et al.
Several authors (e.g. [3], [4], [5]) have been interested in automatic detection of neonatal seizures using video recordings or electroencephalogram (EEG) pattern recognition. Preliminary data from our laboratories on an animal model of neonatal seizures suggest that body temperature monitoring could increase the automatic detection rate of significant seizure-related clinical events. The work of Sunderam et al. [6] has shown that seizures might change the body temperature of a patient. Furthermore, another work [7] has shown that kainic acid (KA), a drug used to study seizure, has impact of the body temperature of a laboratory rat. Since laboratory rats are used as an animal model to understand seizure, we are interested in monitoring continuously their temperature to understand its evolution during seizure and under KA. This paper presents our temperature measurement methodology. There are few related works. In the work of Sunderam et al. [6], thermal images of the faces of six patients were acquired every hour and during seizure events as indicated by real-time EEG analysis. Thermal images were filtered manually to remove images where occlusion occurred. Since the face was in the middle of the image, the temperature measured is the maximum in the center region and there were no tracking requirements. Other works in avian flu [8] and breast cancer [9] detection using thermography have not addressed automated continuous temperature monitoring. In our case, we are interested in continually monitoring the temperature of a rat that can move inside a perimeter, so we have to devise a more automated tracking and measuring method. The paper is structured as follows. Section 2 presents our measurement methodology. Experimental results are presented and discussed in section 3. Section 4 concludes the paper.
2
Methodology
In this section, we first present the acquisition setup and then we present our measurement methodology. 2.1
Data Acquisition
Our temperature sensor is a Thermovision A40M thermographic camera (FLIR Systems). Before acquiring animal videos, we first assessed the measurement error of the sensor for a still object in order to develop a baseline performance reference. From the manufacturer specifications, the accuracy is ± 2%, but the precision is not specified. To evaluate the camera’s precision, we captured thermographic images of a wood tabletop from a fixed point of view continuously (at 27 frames/s) for approximately 30 minutes in a room at about 24◦ C. The room temperature was not controlled. We selected an area of 20 × 20 pixels in the middle of the image. The camera was configured with a linear measurement range of 20◦ C to 40◦ C, and pixels were quantified with 8 bits. That is, pixel values of 0 and 255 correspond to temperatures of 20◦ C and 40◦ C, respectively. This range provides a reasonable interval around the expected rat body temperature of approximately 30◦ C. The interval between two adjacent pixel values is
Measuring an Animal Body Temperature in Thermographic Video
1083
0.078◦C. Without averaging a pixel region, the precision should be one-half of this interval, that is 0.039◦C. By averaging over a region, we may obtain a precision slightly better. The temperature of the tabletop in each frame is estimated by calculating the mean of the 10 hottest pixels in the region of interest. Assuming that the temperature changes smoothly, and because the temperature is not controlled, we computed the regression of the data using a 7th order polynomial and computed the average fitting error. The average precision is the average fitting error, which is 0.021◦C with a standard deviation of 0.026◦C. Figure 1 shows the measured temperature and the fitted polynomial. We did not validate the accuracy as we do not have the equipment to do so and in our measurements, we are only interested in the temperature variation, not in its absolute value.
23.8
23.75
°
C
23.7
23.65
23.6
23.55 0
200
400
600
800 1000 Seconds
1200
1400
1600
Fig. 1. Temperature measured of a table top during 1620 seconds, and fitted polynomial to evaluate the measurement error
To acquire thermographic images during animal experiments, the rat is placed in a metal mesh cubic cage with an open top (see figure 2(a)) and the thermographic camera is angled down toward the cage. The camera is on a 525MV tripod (Manfrotto) and pointed toward the open top of the cage with an angle of approximately 20◦ with the vertical. During initial experiments, we determined that the rat fur prevented precise measurements of the body temperature. Temperature measures on the whole body are not reliable as they depend on the thickness of the fur and visible area. We concluded that the rat should have an area of approximately 10 cm2 that is shaved to measure precisely its temperature. Indeed, the head of the rat, another interesting region, which is warmer because it has less fur, is not always visible and it is occluded by a device (Neuralynx Cheetah System) to record local field potentials (LFP) inside the rat’s brain. Observing the rat from the top and shaving a region on its back gives better results, as this region is almost constantly visible since the rat tends to remain on its four feet. However, using this strategy means that we have to use a tracking algorithm to follow the shaved patch and discriminate it from the head. Figure 2(b) and 2(c) shows typical frames that must be processed.
1084
A. Torabi et al.
(a)
(b)
(c)
Fig. 2. Experimental setup. (a) Camera setup and mesh cage, (b) and (c) two frames of shaved patch to track without occlusion and with occlusion from the LFP recording device.
2.2
Particle Filter Tracking
For our purpose, the tracking algorithm must not lose track of the patch for the duration of the video and it must not be distracted by other hot areas like the head which is not exactly at the same temperature. We applied a particle filter tracker [10] adapted to our tracking problem. Particle filter tracking is robust to sudden movements and occlusions during long term video sequence, and thus should be appropriate for our conditions. Our tracking algorithm is based on the following assumptions: – the images are grayscale, with white (255) meaning hot, and black (0) meaning cold in a given range; – the temperature of the rat is higher than its surrounding, particularly for the shaved patch and the head; – the shaved patch can be occluded by the device that records LFP signals on the rat’s head. – the rat is always in the camera’s field of view. The particle filter is a Bayesian tracking method, which recursively in each time t approximates the state of the tracking target as a posterior distribution using a finite set of weighted samples. A sample is a prediction about the state of the tracking target. In our work, the tracking target is modeled by an ellipse fitted to control points which lie on the edges on the perimeter of the shaved patch (figure 3). The ellipse is defined by Ax2 + Bxy + Cy 2 + F = 0;
(1)
The parameters A, B, C and F are computed to optimize ellipse fitting. The particle filter state at each time step t is defined as a vector Xt of control points by Xt = (Cpx (t), Cpy (t)) (2) where Cpx (t) and Cpy (t) are vectors of control points coordinates. We used the intensity and edge features as the observation model for our particle filter
Measuring an Animal Body Temperature in Thermographic Video
1085
Fig. 3. Fitted ellipse to control points on patch perimeter
tracking method. For the edge feature, a dot product of unit normal vector of the ellipse at the positions of control points and the image gradient at the same positions is used as a measure. The dot product is used because the large gradient magnitude around the patch’s perimeter is desirable and the gradient direction should be perpendicular to the perimeter. The gradient measure φg (s) for a given sample s is φg (s) =
Ncp 1 |nk (s) · gk (s)| , Ncp
(3)
k=1
where nk (s) is the unit normal vector of the ellipse at the k th control point, gk (s) is the image gradient at the same point, and Ncp is the number of control points. To facilitate the fusion of edge and intensity features, the gradient measure is normalized by subtracting the minimum gradient measure of sample set S and dividing by the gradient measure range as follow φ¯g (s) =
φg (s) − minsj ∈S (φg (sj )) . maxsj ∈S (φg (sj )) − minsj ∈S (φg (sj ))
(4)
For the intensity feature, the Euclidean distance between the average intensity μs of the sample ellipse pixels and the average intensity μr of the template ellipse pixels is used as a measure. The template ellipse is the result of ellipse fitting to the chosen control points in the first frame of video sequence. The intensity measure φi (s) for a given sample s is φi (s) = |μs − μr | .
(5)
The intensity measure is normalized as follows maxsj ∈S (φi (sj )) − φi (s) φ¯i (s) = maxsj ∈S (φi (sj ))
(6)
At time t, samples are selected with replacement from the sample set t−1 N St−1 = xt−1 , j , wj j=1
(7)
1086
A. Torabi et al.
where N is the number of samples, xt−1 is the control points coordinates of the j j th sample at time t − 1, and wjt−1 is its corresponding weight . Sample set St−1 is the approximation of posterior distribution of the target state at time t − 1. Nsj samples are chosen with probability of wjt−1 , which is derived for sample j by ¯ t−1 wjt−1 = φ¯g (st−1 (8) j ) + φi (sj ). This means that samples with high weights may be chosen several times (i.e., identical copies of one sample) and samples with small weights may not be chosen at all. In two consecutive frames, the particle filter state does not change dramatically. It is mostly a translation around ellipse major and minor axis and rotation around its center. In each time step t, samples are propagated in state space using a first order auto-regressive dynamical model defined as Xt = Xt−1 + ωt ,
(9)
where Xt and Xt−1 are particle filter states at time t and t − 1 respectively, and ωt is the stochastic part of the dynamical model. It is a multivariate Gaussian random variable and it corresponds to random translations and rotations of the ellipse. To measure temperature, the best sample is chosen by sb = argmaxsj ∈S φ¯g (sj ) + φ¯i (sj ) . (10) The algorithm is the following: 1. Initialization. Manually select control points on the perimeter of the shaved patch and fit the ellipse. For each new frame: 2. Threshold the infrared image to keep the hottest regions (pixels with intensity more than 80). Find the gradient using a Sobel edge detector. 3. Select samples from samples set St−1 based on the weights (equation 8). 4. Propagate samples using equation 9. 5. Compute the observation measurement to update new samples weights using equation 8. 6. Choose the best sample by equation 10 and calculate its temperature TA by TA (f ) = Tmin + ((mean(A)/255) ∗ (Tmax − Tmin )) ,
(11)
where Tmin and Tmax are the minimum and maximum values of the temperature range selected for the camera and mean(A) is the average intensity of the tracked region pixels. The area A for measuring the temperature is bounded by the ellipse fitted to the control points of the best sample.
3
Experimentation
We now present the experimentation methodology, results and discussion.
Measuring an Animal Body Temperature in Thermographic Video
3.1
1087
Experimentation Methodology
To test our measurement method, we shot a 1hour and 57 minutes video of a Sprague-Dawley rat (Charles River Laboratories) during an experimentation using 6 mg/kg i.p. of KA (Sigma-Aldrich Canada Ltd). All experimental procedures conformed to institutional policies and guidelines (Sainte-Justine Research Center, Universit´e de Montr´eal, Qu´ebec, Canada). The camera setup was described in section 2.1. The video is 90706 frames at 12.89 frame per second with a 320 × 240 resolution and is compressed with Xvid FFDshow encoder (Quality: 100%) (http://sourceforge.net/projects/ffdshow). Our particle filter tracker is implemented in Matlab (The MathWorks) and is run on a Xeon 5150 2.66 GHz computer (Intel). For particle filter tracking, 8 control points are used for ellipse fitting. Control points are chosen with approximately the same distance from each other to cover the entire patch perimeter. Since in our experiment patch translational motion is more frequent than rotational motion, 75% of samples are specified for random translational motions and 25% of samples for random translational+rotational motions. Our tracking algorithm is tested by changing the number of samples N (45, 100 or 210 samples) and observation model (intensity or intensity+edge). The number of samples was chosen arbitrary to show their effect on tracking result and processing time. Temperature calculations were based on the 10 hottest pixels in the tracked area. Tmin was 20 and Tmax was 40. Since we have a large quantity of data, to measure the performance of our tracking algorithm we used two metrics. For the first metric, we generated a partial ground truth by selecting frames at random over the whole video sequence. The four corners of the patch were selected to build a bounding polygon and the temperature value was calculated as in equation 11. This gives a set of ground truth temperatures TGT . We selected F (F = 450) frames. The temperature measurements by the tracking algorithm for these frames were then compared with the ground truth. The evaluation metric is the root mean square error defined as F 1 Trms = (TA (i) − TGT (i))2 . (12) F i=1 The second metric is based on the assumption that the temperature of the rat’s body changes smoothly. We computed the regression of the temperatures using a 23rd order polynomial (largest well-conditioned polynomial). Then, we computed the fitting error. This gives the average precision μm and its standard deviation σm . This fitting error is then compared with the fitting error obtained for a static target (a tabletop, see section 2.1) and for the ground truth. If tracking is good, we expect similar precision in the measurements. 3.2
Results and Discussion
Figure 4 shows the results obtained for our test video by applying particle filter tracking using a combination of intensity and edge as observation model and
1088
A. Torabi et al.
Fig. 4. Results from our tracker compared to ground truth. a) Temperature values obtained with our particle filter tracker and regression result. b) Error for the 450 ground truth points. c) Ground truth temperature values and regression result.
z−scores (σ) / Temprature (°C)
10 8 6 4 2 0 −2 −4 −6 LFP Intensity Temprature
−8 −10
1000
2000
3000
4000
5000
6000
Time (s)
Fig. 5. Results from our tracker synchonized with LFP recordings. Temperatures were shifted by -27◦ C. The LFP recordings were normalized around a mean μ of 0 and a standard deviation σ of 1 (Z-scores: z = x−μ ). Z-scores larger than ±2σ correspond to σ seizure events.
210 samples. The global decrease of the temperature between 0 and about 1800 seconds is caused by the KA. This phenomenon was previously observed [7] with a rectal thermometer at 15-30 minutes intervals. The local changes in temperature observed from 3000 seconds up to the end seem to be correlated with some seizure events (see figure 5). This conclusion is being further investigated with more experiments. By comparing figure 4a) and figure 4c), one can notice that the tracking result is noisier than the ground truth. Sudden drops of temperature, more than 1◦ C, are caused by tracking errors. In this experiment, occlusion by the LFP recording device caused the tracking errors around 150, 1850, 2100, 4850, 5100 seconds. In such cases, only a portion of patch is visible. During these times, particle filter tracking cannot track the patch correctly because of considerable changes in the patch shape and distraction by the rat’s head. The other reason of tracking errors was the frame rate which was only 12.89 frames per second because the thermographic images were captured simultaneously with other data on the same computer. This low frame rate caused tracking errors when the rat moved too fast (i.e. wet dog shakes following some seizures) because of large displacement in the images around 3800, 4200, 4550, 5350, 6000, 6050, and 6700 seconds.
Measuring an Animal Body Temperature in Thermographic Video
1089
Table 1. Average precision and root mean square error for each method and for the still object of section 2.1. N: Number of samples, Ob. model: Observation model, μ: average precision, σ: standard deviation, Trms : root mean square error, FPS: frames per second. Method Still obj. (sec.2.1) Ground truth Particle filter Particle filter Particle filter Particle filter
N Ob. model μm (σm ) (◦ C) Trms (◦ C) N/A N/A 0.021(0.026) N/A N/A N/A 0.079(0.100) 0.000 210 intensity 0.084(0.150) 0.104 210 intensity+edge 0.082(0.131) 0.087 100 intensity+edge 0.086(0.139) 0.117 45 intensity+edge 0.093(0.188) 0.103
Time (s) N/A N/A 59059 67909 32300 19989
FPS N/A N/A 1.513 1.317 2.767 4.472
Figure 4b) shows that after 3000 seconds, the tracking errors are mostly positive because of the rat’s quick movements which caused erroneously tracking the rat’s head which is slightly warmer than the shaved patch. It is noticeable that some negative errors are not represented on this graph since the ground truth is composed of points selected at random and do not include all the tracking errors. However, it indicates that negative errors are less frequent than positive errors caused by erroneously tracking the head. Negative errors mean that tracking is neither on the head, nor on the patch. Table 1 gives the values obtained for the metrics defined in section 3.1 and two associated processing times by changing the observation model and number of samples. Results show that by increasing the number of samples processing time increase but temperature measures change more smoothly and have average precision and standard deviation more similar to ground truth. We found out that a number of samples less than 45 is not sufficient for desirable tracking result and more than 210 does not improve tracking result but it increases the processing time. For 100 samples, we get a larger RMS error than for 45 samples because the ground truth is only of 450 points, which is small compared to 90706 measures with the tracker. Thus, it does not cover all the errors and most errors may be located where there are no ground truth points. Results vary for different tracker runs because of the random variable of equation 9. This is particularly the case when there are few samples like 45. However, with 100 or 210 samples, the precision is better and thus the results will be more stable and should be consider a better choice than 45 samples. For small number of samples, it is the risk that tracker fails by sudden rotation and movements of the rat’s body because samples are not distributed sufficiently in all directions around the tracking target. The best result that we obtained was with 210 samples using intensity+edge for observation model. Table 1 also shows that combining the intensity and edge features for observation model reduces RMS error compare to using only intensity feature. This is because modeling patch edges by an ellipse reduces tracking errors from the patch to the rat’s head, which sometimes can have similar intensities, but different shapes. It is interesting to notice that the ground truth precision is larger than the average precision obtained for a still object. At this point, we may hypothesize
1090
A. Torabi et al.
that is because the shaved area is deformable and its normal is not always aligned with the camera sensor’s normal. Thus, the infrared radiation measured by the camera changes with the angle of the shaved area. Furthermore, as the shaved area is deformable, the skin thickness may vary regularly as it stretches depending on the rat position and attitude. Another possibility is that seizure events cause temperature changes that violate the smoothness constraints and increase the fitting error. We will test a rat in a control condition (without KA) to verify the attainable precision with a moving target. Given these results with our equipment, capture setup, and assuming smooth temperature variations, we can expect to observe phenomena that cause sudden temperature changes over a few frames larger than 0.2◦ C.
4
Conclusion
This paper presented a methodology to measure the body temperature of a moving animal in a laboratory setting. Because of the experimental setup, uneven thickness of the fur with viewpoint and the possibility of occlusion, to measure the body temperature we needed to shave a region on rat’s back. Since the head and shaved region may have different temperatures, tracking is required to measure temperature on the same body region continuously. We proposed a particle filter tracker based on shaved region intensity and shape. Our method was tested on a 2-hour video sequence with a rat having seizures at regular intervals. Results show that our tracker achieves measurements with an RMS error less than 0.1◦ C. Errors are caused by severe occlusions or by quick rat motions (wet dog shakes). Although we estimate we can observe phenomena causing changes of more than 0.2◦ C, we do not obtain a precision similar to a still object. Part of this difference with camera precision is caused by the tracker, while another part is caused by other reasons. We hypothesize that change in the orientation of the measured surface causes measurement errors, so it may not be possible to attain the precision obtained on a still object. Furthermore, in the test video, temperature changes may not be smooth and they may increase the fitting error by a polynomial. As future works, we are interested in improving particle filter tracking by adding scaling to model motion to handle severe patch deformations because of changes in view angles. We will apply a filter to the temperature measures to remove outliers. We will also investigate the impact of changes of orientation of the measured surface. Finally, we are interested to apply this methodology in more experiments and automatically detect abnormal events based on changes in the body temperature.
References 1. Volpe, J.J.: Neonatal Seizures: Current Concepts and Revised Classification. Pediatrics 84(3), 422–428 (1989) 2. Carmant, L.: Mechanisms that might underlie progression of the epilepsies and how to potentially alter them. Advances in neurology 97, 305–314 (2006)
Measuring an Animal Body Temperature in Thermographic Video
1091
3. Karayiannis, N., Srinivasan, S., Bhattacharya, R., Wise, M., Frost, J.D., Mizrahi, E.: Extraction of motion strength and motor activity signals from video recordings of neonatal seizures. IEEE Transactions on Medical Imaging 20(9), 965–980 (2001) 4. Celka, P., Colditz, P.: A computer-aided detection of eeg seizures in infants: a singular-spectrum approach and performance comparison. IEEE Transactions on Biomedical Engineering 49(5), 455–462 (2002) 5. Faul, S., Boylan, G., Connolly, S., Marnane, L., Lightbody, G.: An evaluation of automated neonatal seizure detection methods. Clinical Neurophysiology (7), 1533–1541 (2005) 6. Sunderam, S., Osorio, I.: Mesial temporal lobe seizures may activate thermoregulatory mechanisms in humans: an infrared study of facial temperature. Epilepsy and Behavior 49(4), 399–406 (2003) 7. Ahlenius, S., Oprica, M., Eriksson, C., Winblad, B., Schultzberg, M.: Effects of kainic acid on rat body temperature: unmasking by dizocilpine. Neuropharmacology 43(1), 28–35 (2002) 8. Camenzind, M., Weder, M., Rossi, R., Kowtsch, C.: Remote sensing infrared thermography for mass-screening at airports and public events: Study to evaluate the mobile use of infrared cameras to identify persons with elevated body temperature and their use for mass screening. Technical Report 204991, EMPA Materials Science and Technology (2006) 9. Amalu, W.: Nondestructive testing of the human breast: the validity of dynamic stress testing in medical infrared breast imaging. In: Engineering in Medicine and Biology Society, 2004. IEMBS 2004. 26th Annual International Conference of the IEEE, vol. 2, pp. 1174–1177 (2004) 10. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998)
A New Parallel Approach to Fuzzy Clustering for Medical Image Segmentation Huynh Van Luong and Jong Myon Kim* School of Computer Engineering and Information Technology, University of Ulsan, Ulsan, Korea, 680-749
[email protected],
[email protected]
Abstract. Medical image segmentation plays an important role in medical image analysis and visualization. The Fuzzy c-Means (FCM) is one of the wellknown methods in the practical applications of medical image segmentation. FCM, however, demands tremendous computational throughput and memory requirements due to a clustering process in which the pixels are classified into the attributed regions based on the global information of gray level distribution and spatial connectivity. In this paper, we present a parallel implementation of FCM using a representative data parallel architecture to overcome computational requirements as well as to create an intelligent system for medical image segmentation. Experimental results indicate that our parallel approach achieves a speedup of 1000x over the existing faster FCM method and provides reliable and efficient processing on CT and MRI image segmentation. Keywords: Medical image segmentation, Fuzzy C-Means algorithm, parallel processing, data parallel architectures, MRI images.
1 Introduction Segmentation is an indispensable step in medical image analysis and visualization. It separates structures of interest from the medical images including organs, bones, different tissue types, and vasculature. Several different segmentation methods and approaches have been applied for different application domains. Some methods including histogram analysis, region growing, edge detection, and pixel classification have been proposed in the past [1]-[3], which use the local information and/or the global information for image segmentation. Some techniques using the neural network approach have also considered the problems inherent in image segmentation [4], [5]. Fuzzy clustering [6]-[8] is a suitable technique for medical imaging due to the limited spatial resolution, poor contrast, noise, and non-uniform intensity variation inherent in the medical images. Fuzzy clustering is a process in which the pixels are classified into the attributed regions based on the global information of gray level *
Corresponding author.
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1092–1101, 2008. © Springer-Verlag Berlin Heidelberg 2008
A New Parallel Approach to Fuzzy Clustering for Medical Image Segmentation
1093
distribution and spatial connectivity. One of the well-known fuzzy clustering algorithms is the Fuzzy c-Means (FCM) algorithm where c is a priori chosen number of clusters. The FCM algorithm allows overlapping clusters with partial membership of individuals in clusters. However, FCM requires tremendous computational and memory requirements due to the complex clustering process. Application-specific integrated circuits (ASICs) can meet the needed performance for such algorithms, but they provide limited, if any, programmability or flexibility needed for varied application requirements. General-purpose microprocessors (GPPs) offer the necessary flexibility and inexpensive processing elements. However, they will not be able to meet the much higher levels of performance required by high resolution and high frequency medical image and video data. This is because they lack the ability to exploit the full data parallelism available in these applications. Among many computational models available for imaging applications, single instruction multiple data (SIMD) processor arrays are promising candidates for application-specific applications including medical imaging since they replicate the data, data memory, and I/O to provide high processing performance with low node cost. Whereas instruction-level or thread-level processors use silicon area for large multi-ported register files, large caches, and deeply pipelined functional units, SIMD processor arrays contain many simple processing elements (PEs) for the same silicon area. As a result, SIMD processor arrays often employ thousands of PEs while possibly distributing and co-locating PEs with the data I/O to minimize storage and data communication requirements. This paper presents a new parallel implementation of the FCM algorithm to meet the computational requirements using a representative SIMD array architecture. This paper also evaluates the impact of the parallel approach on processing performance. This evaluation shows that our parallel approach achieves a speedup of 1000x over the existing faster FCM method and provides reliable and efficient processing on computerized tomography (CT) and magnetic resonance imaging (MRI) image segmentation. The rest of the paper is organized as follows. Section 2 presents background information of the FCM algorithm and the SIMD processor array used in this paper. Section 3 describes a parallel implementation of the FCM algorithm. Section 4 analyzes the performance of our parallel approach and compares our approach to other existing methods, and Section 5 concludes this paper.
2 Background Information 2.1 Image Segmentation with Fuzzy C-Means Algorithm Segmentation is an essential process of image analysis and classification, wherein the image pixels are segmented into subsets by assigning the individual pixels to clusters. Hence, segmentation is a process of portioning an image into some regions such that each region is homogeneous and none of the union of two adjacent regions is homogeneous. The FCM algorithm has been used with some success in image segmentation in general and also in medical image segmentation. The FCM algorithm [9] is an
1094
H.V. Luong and J.M. Kim
iterative algorithm of clustering technique that produces optimal c partitions and centers V={v1, v2,…, vc} that are exemplars and radii will define these c partitions. Let unlabelled data set X={x1, x2,…, xn} be the pixel intensity where n is the number of image pixels to determine their memberships. The FCM algorithm tries to partition the data set X into c clusters. The standard FCM objective function is: ( , )=∑
∑
||
||
(1)
We assume that the norm operator ||.|| represents the standard Euclidean distance = 1 and the degree of fuzzification m ≥1. with constraint ∑ A data point xk belongs to a specific cluster vc which is given by the membership value of the data point to that cluster. Local minimization of the objective function is accomplished by repeatedly adjusting the values of and according to the following equations: = ∑ =
||
||
||
||
∑ ∑
.
(2) (3)
is iteratively minimized, becomes more stable. Iteration of pixel groupings As ( ) ||<E, where V(t) is is terminated when the termination measure = || ( ) (t-1) is previous centers, and E is the predefined termination threshold. new centers, V It can be analyzed that these two equations of and bear heavy computational load for large data sets. Thus, this paper prefers to overcome the computational burden by using a parallel implementation of the FCM algorithm on a SIMD processor array system. The next section presents an overview of our baseline SIMD array architecture. 2.2
SIMD Processor Array Architecture
A block diagram of the SIMD model [10] used here is illustrated in Figure 1. This SIMD processor architecture is symmetric, having an array control unit (ACU) and an array consisting of processing elements (PEs). When data are distributed, the PEs execute a set of instructions in a lockstep fashion. With 4x4 pixel sensor sub-arrays, each PE is associated with a specific portion (4x4 pixels) of an image frame, allowing streaming pixel data to be retrieved and processed locally. Each PE has a reduced instruction set computer (RISC) data-path with the following minimum characteristics:
ALU - computes basic arithmetic and logic operations, MACC - multiplies 32-bit values and accumulates into a 64-bit accumulator, Sleep - activates or deactivates a PE based on local information, Pixel unit - samples pixel data from the local image sensor array, ADC unit - converts light intensities into digital values, Three-ported general-purpose registers (16 32-bit words), Small amount of local storage (256 32-bit words), Nearest neighbor communications through a NEWS (north-east-west-south) network and serial I/O unit.
A New Parallel Approach to Fuzzy Clustering for Medical Image Segmentation
1095
Neighboring PEs Comm. Unit
Register File 16 by 32 bit 2 read, 1 write
Arithmetic, Logical, and Shift Unit MACC MMX Local Memory
CFA
S&H and ADC
SP. Registers & I/O
Sleep
Decoder
Single Processing Element
Fig. 1. A block diagram of a SIMD processor array
3 Parallel Approach for Fuzzy Clustering To carry out two equations (2), (3) of the FCM algorithm, we consider all pixels of the images. Although this algorithm was designed for an OSCAR cluster using SPMD model and Message Passing Interface (MPI) [11], we need a new model to undertake these kinds of tasks with increasing availability of parallel computers. Thus, using the specified SIMD array, we distribute all pixels into all PEs in which every PE owns 16 pixels as illustrated in Figure 2. Assume n is the total number of pixels. As a result, the number of PEs involved in the computation is n/16. By dividing the pixels among n/16 processors, every PE caries out the computation only on the local memory containing 16 owned pixels along with their membership values as well as center values . Then, the FCM algorithm is implemented on n/16 processors in which some new equations are required for every PE. This enhances the performance and efficiency as compared to the sequential FCM algorithms.
Fig. 2. Distribution of image data points into each PE Node (N) in which each PE holds 4x4 pixels and all PEs process in parallel with a torus interconnection network
1096
H.V. Luong and J.M. Kim
From equations (2) and (3), we have: =
and
=
||) /(
⁄(|| ⁄ ||
∑
∑
|| ∑
=
∑
/
) /(
)
∑ /
∑
,
(4) (
∑
) (
, )
where l is the order of every PE from 1 to n/16 and p is an integer of 1, 2,…, n. If let
=∑
(
)
=∑
(
)
,
(5)
we have: =
Moreover, with we have:
=
∑
/
∑
/
.
=∑
/
=∑
/
,
, with 1≤i≤c
(6) (7)
After some transformations, we derive the algorithm into five steps. 1. Detect the input image; distribute the pixels into all processors; and initiate clus( ) ( ) ( ) , ,…, ). ter centers ( ) = ( of 16 pixels in every PE according to the 2. Compute all membership values formula (4). 3. Compute and from the expression (5) for every PE and then use both the torus interconnection network and the communication unit to transfer immediate results to the neighbor as well as to calculate numerator and denominator of from the expression (6). As a result, new center values are calculated from the formula (7). ( ) = || ( ) || or 4. Check the termination threshold. When ( ) ( ) {|| ||} , the algorithm is stopped. Otherwise, go to step 2. 5. Assign all pixels to the cluster centers. Pixels are mapped to set of V={v1, v2,…, vc} by using the maximum membership value of every pixel. For instance, = { , ,…, }. Finally, we have the segmented output image. These 5 steps are implemented on the specified SIMD array which consists of 4,096 PEs with several brain MRI images. The next section provides more details about the performance evaluation of our parallel implementation.
4 Performance Evaluation In this section, the performance of the parallel FCM implementation for several different brain MRI images is presented. In addition, the parallel approach is compared
A New Parallel Approach to Fuzzy Clustering for Medical Image Segmentation
1097
to the sequential FCM algorithms with the same brain image. In this experiment, all the parallel FCM implementations use the following parameters depending on the desired segmentation. To evaluate the performance of the proposed algorithm, we use a cycle accurate SIMD simulator. We develop the parallel FCM algorithm in their respective assembly languages for the SIMD processor array. In this study, the image size of 256 × 256 pixels is used. For a fixed 256 × 256 pixel system, the number of 4,096 PEs is used because each PE contains 4x4 pixels. Table 1 summarizes the parameters of the system configuration. Table 1. Modeled system parameters Parameter
Value
Number of PEs Pixels/PE Memory/PE [word] VLSI Technology Clock Frequency Interconnection Network intALU/intMUL/Barrel Shifter/intMACC/Comm
4,096 16 256 [32-bit word] 100 nm 150 MHz Torus 1/1/1/1/1
Figure 3 shows three original brain MRI images: Image 1, Image 2, and Image 3. They are segmented into different clusters with c=3, 4, 5. In addition, parameters for the number of clock cycles along with the execution time Ts are considered. For example, Image 1 with c=3 in Figure 3 (a) consumes 1,636,588 clock cycles, resulting in 0.01091 s (second) in execution time with 150MHz as shown in Table 3. After clustering, these three images are segmented into different parts such as bone, tissue, or brain. Depending on the specific images or requirements of doctors, the appropriate number of clusters is selected. Let consider Image 1 in Figure 3 (a). If a doctor prefers to segment a brain MRI image into two components such as bone and brain, either c=3 or c=4 is enough to meet the requirement. On the other hand, c=5 is necessary for Image 2 and Image 3 because the bone is separated into two types of bones such as skull and spine. In terms of execution time, all cases satisfy the realtime except for c=5 of Image 1 as illustrated in Table 3. The metrics of execution time and sustained throughput of each case form the basis of the study comparison, defined in Table 2. Table 2. Summary of evaluation metrics Execution time C t exec = f ck
Sustained throughput O ⋅ U ⋅ N PE Gops [ η E = exec ] sec t exec
where C is the cycle count, f ck is the clock frequency, Oexec is the number of executed operations, U is the system utilization, and NPE is the number of processing elements.
1098
H.V. Luong and J.M. Kim
Fig. 3. Original three brain MRI images with segmented output images with c=3, 4, 5 clusters: (a) The original images, (b) Image 1 after segmentation, (c) Image 2 after segmentation, (d) Image 3 after segmentation
A New Parallel Approach to Fuzzy Clustering for Medical Image Segmentation
1099
Table 3 summarizes the execution parameters for each image in the 4,096 PE system. Scalar instructions control the processor array. Vector instructions, performed on the processor array, execute the algorithm in parallel. System Utilization is calculated as the average number of active processing elements. The algorithm operates with System Utilization of 70% in average, resulting in high sustained throughput. Overall, our parallel implementation supports sufficient performance of real-time (30 frame/sec or 33 ms) and provides efficient processing for the FCM algorithm. Table 3. Algorithm performance on a 4,096 PE system running at 150 MHz
c=3
1,371,176
265,412
69.56
1,636,588
10.91
Sustained Throughput [Gops/sec] 302
Image 1 c=4
3,666,026
708,326
72.23
4,374,352
29.16
314
c=5
13,771,812
2,657,184
69.01
16,428,996
109.52
300
c=3
1,397,852
270,536
69.75
1,668,388
11.12
320
Image 2 c=4
4,671,151
902,015
70.88
5,573,166
37.15
325
c=5
3,513,288
678,804
70.72
4,192,092
27.95
324
c=3 Image 3 c=4 c=5
2,188,200
423,028
71.43
2,611,228
17.41
368
3,903,642 3,668,814
753,846 708,666
70.19 70.05
4,657,488 4,377,480
31.05 29.18
361 360
Medical Image
Vector Instruction
System Total Cycle Scalar Instruction Utilization [cycles] [%]
t exec [ms]
Table 4 shows the distribution of vector instructions for the parallel algorithm. Each bar divides the instructions into the arithmetic-logic-unit (ALU), memory (MEM), communication (COMM), PE activity control unit (MASK), and image loading (PIXEL). The ALU and MEM instructions are computation cycles while COMM and MASK instructions are necessary for data distribution and synchronization of the SIMD processor array. Results indicate that the proposed algorithm is dominated by ALU and MASK operations. Table 4. The distribution of vector instructions for the algorithm Instruction Distribution [%]
Image 1 C=3
C=4
Image 2 C=5
C=3
C=4
Image 3 C=5
C=3
C=4
C=5
ALU
63.9485 63.9661 63.9669 63.9475 63.9601 63.9578 63.9587 63.9562 63.9556
MEM
0.1125
0.0857 0.0773 0.1103 0.0859 0.0920 0.0883 0.0854 0.0881
COMM
0.2200
0.2152 0.2123 0.2213 0.2165 0.2134 0.2246 0.2195 0.2164
MASK
35.7142 35.7307 35.7428 35.7159 35.7356 35.7339 35.7253 35.7367 35.7371
PIXEL
0.0047
0.0022 0.0007 0.0046 0.0017 0.0028 0.0030 0.0021 0.0026
1100
H.V. Luong and J.M. Kim
In addition, we compare the performance of our parallel approach with other existing methods to verify the efficiency of the proposed approach. For all the experiments, the same original brain MRI image (128x128 pixels) is used. Figure 4 demonstrates that our approach is comparable to the sequential Otsu’s method and parallel FCM [12]. However, our approach outperforms these methods in terms of execution time. Table 5 shows the performance comparison of our parallel approach and conventional methods. Our approach achieves a speedup of five and third orders over the sequential implementation and the faster proposed FCM method, respectively.
(a)
(b)
(c)
(d)
Fig. 4. Original brain MRI image with segmented output images for different methods: (a) the original brain MRI image, (b) using Otsu’s method, (c) using faster proposed FCM method, (d) using our parallel proposed FCM method. Table 5. Performance comparison of three methods
Method Otsu’s FCM method Rahimi’s parallel FCM method Proposed parallel FCM method
Execution Time (seconds) 736 6 0.00706
These results demonstrate that the proposed parallel approach supports performance-hungry medical imaging and provides reliable and efficient processing for medical image segmentation.
5 Conclusion As recent advances in medical imaging demand more and more tremendous computational throughput, the need for high efficiency and high throughput processing is becoming an important challenge in the medical application domain. In this regard, this paper has presented a new parallel implementation of the well-known FCM algorithm for medical image segmentation in which the pixels are classified into the attributed regions based on the global information of gray level distribution and spatial connectivity. Experimental results using a representative data parallel architecture indicate that our parallel approach achieves a speedup of 1000x over the existing faster method, providing a sufficient performance of real-time (30 frames/sec or 33ms) and efficient processing on CT and MRI image segmentation.
Acknowledgements This work was supported by 2008 Research Fund of University of Ulsan, South Korea.
A New Parallel Approach to Fuzzy Clustering for Medical Image Segmentation
1101
References 1. Fu, K.S., Mu, J.K.: A Survey on Image Segmentation. Pattern Recognition 13, 3–16 (1983) 2. Sahoo, P.K., Soltani, S., Wong, A.K.C., Chen, Y.C.: A Survey of Thresholding Techniques. CVGIP 41, 233–260 (1988) 3. Panda, D.P., Rosenfeld, A.: Image Segmentation by Pixel Classification in (Gray Level, Edge Value) space. IEEE Transactions on Computers 22, 440–450 (1975) 4. Hall, L.O., Bensaid, A.M., Clarke, L.P., Velthuizen, R.P., Silbiger, M.S., Bezdek, J.C.: A Comparison of Neural Network and Fuzzy Clustering Techniques in Segmenting Magnetic Resonance Images of the Brain. IEEE Transactions on Neural Networks 3, 672–682 (1992) 5. Kim, Y., Rajala, S.A., Snyder, W.E.: Image Segmentation using an Annealed Hopfield Neural Network. In: Proc. RNNS/IEEE Symp. Neural Informatics and Neurocomputers, vol. 1, pp. 311–322 (1992) 6. Tabakov, M.: A Fuzzy Clustering Technique for Medical Image Segmentation. In: International Symposium on Evolving Fuzzy Systems, pp. 118–122 (2006) 7. Dunn, J.C.: A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3, 32–57 (1973) 8. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981) 9. Sahaphong, S., Hiransakolwong, N.: Unsupervised Image Segmentation Using Automated Fuzzy c-Means. In: 7th IEEE International Conference on Computer and Information Technology, pp. 690–694 (2007) 10. Kim, J., Wills, D.S., Wills, L.M.: Implementing and Evaluating Color-Aware Instruction Set for Low-Memory, Embedded Video Processing in Data Parallel Architectures. In: Yang, L.T., Amamiya, M., Liu, Z., Guo, M., Rammig, F.J. (eds.) EUC 2005. LNCS, vol. 3824, pp. 4–16. Springer, Heidelberg (2005) 11. Rahimi, S., Zargham, M., Thakre, A., Chhillar, D.: A parallel Fuzzy C-Mean algorithm for image segmentation. IEEE Annual Metting on Fuzzy Information 1, 234–237 (2004) 12. Wu, J., Li, J., Liu, J., Tian, J.: Infrared Image Segmentation via Fast Fuzzy C-Means with Spatial Information. In: IEEE International Conference on Robotics and Biomimetics, pp. 742–745 (2004)
Tracking Data Structures Coherency in Animated Ray Tracing: Kalman and Wiener Filters Approach Sajid Hussain and Håkan Grahn Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden {sajid.hussain,hakan.grahn}@bth.se http://www.bth.se/tek/paarts
Abstract. The generation of natural and photorealistic images in computer graphics, normally make use of a well known method called ray tracing. Ray tracing is being adopted as a primary image rendering method in the research community for the last few years. With the advent of todays high speed processors, the method has received much attention over the last decade. Modern power of GPUs/CPUs and the accelerated data structures are behind the success of ray tracing algorithms. kd-tree is one of the most widely used data structures based on surface area heuristics (SAH). The major bottleneck in kd-tree construction is the time consumed to find optimum split locations. In this paper, we propose a prediction algorithm for animated ray tracing based on Kalman and Wiener filters. Both the algorithms successfully predict the split locations for the next consecutive frame in the animation sequence. Thus, giving good initial starting points for one dimensional search algorithms to find optimum split locations – in our case parabolic interpolation combined with golden section search. With our technique implemented, we have reduced the “running kd-tree construction” time by between 78% and 87% for dynamic scenes with 16.8K and 252K polygons respectively.
1 Introduction Ray tracing is one of the most widely used algorithms for interactive graphics applications and geometric processing. The performance of these algorithms is accelerated by using bounding volume hierarchies (BVH). BVHs are efficient data structures used for intersection tests or culling in computer graphics. Ray tracing algorithms compute and transverse BVHs in real time to perform intersection test. While ray tracing has evolved into a real time image synthesis technique in the last decade, more efficient hardware, effective acceleration structures and more advanced transversal algorithms have contributed to the increased performance. Among different acceleration structures [3][4], kd-trees have given better or at least comparable performance in terms of speed as compared to others [1]. These structures are more efficient if built using surface area heuristics (SAH) [2]. Interactive ray tracing demands fast construction of BVHs but an optimized fast construction of kd-tree is very expensive for large dynamic scenes. Although efforts are being made to optimize kd-tree construction for large dynamic scenes [5] [6] [7] G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1102–1114, 2008. © Springer-Verlag Berlin Heidelberg 2008
Tracking Data Structures Coherency in Animated Ray Tracing
1103
[8] [9], there still lies a gulf between kd-tree construction and interactive large dynamic scene applications. In this paper, we present an approach to improve and optimize the construction of kd-trees for dynamic ray tracing. We are concerned about the decision of the separation plane location. In most of the dynamic scenes used in research, consecutive frames do not depict considerable differences in terms of geometry information. Our approach is to make use of this particular property for constructing kd-tree structures. We start with the approach used in [9] for static scenes, where parabolic interpolation is combined with golden section search to reduce the amount of work done when building the kd-trees. We further extend this approach for dynamic scenes and make use of the vector Kalman and Wiener filters to predict the split locations for the next consecutive frame. The same golden section search and parabolic interpolation is then used to find the minimum but this time the predicted location is used as a starting point, hence reducing the number of steps used to find the minimum of the parabolic cost function. The vector Kalman filter results are already presented in another IEEE paper. Here, we extend the work by implementing Wiener filter and comparing the results with that of the Kalman filter. We have evaluated our techniques against a standard SAH algorithm for dynamic scenes with varying complexities and behaviours. With our algorithms, we have achieved average kd-tree built times of 30msec for 16-17k triangles scene and 210msec for 252k triangles scene. This corresponds to a reduction of the kd-tree construction time by between 78% and 87%. The rest of the paper is organised as follows. Section 2 gives some related research work on kd-tree construction followed by the theory behind SAH based kd-trees in section 3. We describe the mathematics behind the Kalman and Wiener filters in section 4 along with our proposed technique in section 5. Section 6 gives our implementation results and some discussion. We conclude the paper in section 7 with future work.
2 Related Work kd-tree construction has mainly focused on optimized data structure generation for fast ray tracing. The state-of-the-art O ( n log n ) algorithm has been analysed in depth by [10] and [11]. Further in [12], the theoretical and practical aspects of ray tracing including kd-tree cost function modelling and experimental verifications have been described. Current work in [5] and [13] also aims at fast construction of kd-trees. By adaptive sub-sampling they approximate the SAH cost function by a piecewise quadratic function. There are many different other implementations of the kd-tree algorithm using SIMD instructions like in [14]. Another approach is used by [15], where the author experiments with stream kd-tree construction and explores the benefits of parallelized streaming. Both [5] and [15] demonstrate considerable improvements as compared to conventional SAH based kd-tree construction. The cost function to optimally determine the depth of the subdivision in kd-tree construction has been given by several authors. In [16], the authors derive an expression that confirms that the time complexity is less dependent on the number of objects and more on the size of the objects. They calculate the probability that the ray intersects an object as a function of the total area of the subdivision cells that (partly)
1104
S. Hussain and H. Grahn
contain the object. In [2], the authors use a similar strategy but refine the method to avoid double intersection tests of the same ray with the same object. They determine the probability that a ray intersects at least one leaf cell from the set of leaves within which a particular object resides. They use a cost function to find the optimal cutting planes for a kd-tree construction. A similar method was also implemented in [17]. Recently, kd-tree acceleration structures for modern graphics hardware have been proposed in [8] and [18], where the authors experiment kd-tree for GPU ray tracers and achieve considerable improvement.
3 SAH Based kd-Tree Construction In this section, we give some background about the kd-tree algorithm, which will be the foundation for the rest of the paper. Consider a set of points in a space Rd, the kdtree is normally built over these points. In general, kd-trees are used as a starting point for optimized initialization of k-means clustering [19] and nearest neighbour query problems [20]. In computer graphics, and especially in ray tracing applications, kdtrees are applied over a scene S with bounding boxes of scene objects. The kd-tree algorithm subdivides the scene space recursively. For any given leaf node Lnode of the kd-tree, a splitting plane splits the bounding box of the node into two halves, resulting in two bounding boxes, left and right. These are called child nodes and the process is repeated until a certain criterion is met. In [1], the author reports that the adaptability of the kd-tree towards the scene complexity can be influenced by choosing the best position of the splitting plane. The choice of the splitting plane is normally the mid way along a particular coordinate axis [21] and a particular cost function is minimized. In [2], SAH is introduced for the kd-tree construction algorithm which works on probabilities and minimizes a certain cost function. The cost function is built by firing an arbitrary ray through the kd-tree and applying some assumptions. Fig. 1 uses the conditional probability P(y|x) that an arbitrary fired ray hits the region y inside region x provided that it has already touched the region x. Bayes rule can be used to calculate the conditional probability P(y|x) as P( y | x ) =
P ( x | y ) P( y ) . P( x )
(1)
P(x|y) is the conditional probability that the ray hits the region x provided that it has intersected y, and here P(x|y) = 1. P(x) and P(y) can be expressed in terms of areas [1]. In Fig. 2, if we start from the root node or the parent node and assume that N is a set of all elements in the root node and the ray passing the root node has to be tested for intersection with all the elements in N. If we assume that the computational time it takes to test the ray intersection with elements n ⊆ N is Tn, then the overall computational cost C of the root node would be N
C = ∑ Tn . n =1
(2)
Tracking Data Structures Coherency in Animated Ray Tracing
y
1105
Ray x
Fig. 1. Visualization of conditional probability P(y|x)
After further division of root node (Fig. 2), the ray intersection test cost for each left and right child nodes changes to CLeft and CRight. Thus the overall new cost becomes CTotal and CTotal = CTrans + C Right + C Left , (3) where CTrans is the cost of traversing the parent or root node. The equation can be written as N Left
N Right
i =1
j =1
CTotal = CTrans + PLeft ∑ Ti + PRight
∑T
j
,
(4)
where PLeft =
ALeft
and PRight =
ARight
. (5) A A Where A is the surface area of the root node and the area of two child nodes are ALeft and ARight. PLeft and PRight are the probabilities of a ray hitting the left and the right child nodes. NLeft and NRight are the number of objects present in the two nodes and Ti and Tj are the computational time for testing ray intersection with the ith and jth objects of the two child nodes. The kd-tree algorithm minimizes the cost function Ctotal, and then subdivides the child nodes recursively.
Fig. 2. Scene division and corresponding kd-tree nodes
As shown in [15], the cost function is a bounded variation function as it is the difference of two monotonically varying functions CLeft and CRight. In [15], this important property of the cost function has been exploited to increase the approximation accuracy of the cost function and only those regions that can contain the minimum have been adaptively sampled. We have used the technique in [9] called golden section search to find out the region that could contain the minimum and combined it with parabolic interpolation to search for the minimum. Further, we predict the minimum of the cost function (split locations) for next consecutive frame using the Kalman filter. We use predicted split locations as starting points for kd-tree construction over consecutive frames. In next section, we present some mathematics behind the Kalman and Wiener filter.
1106
S. Hussain and H. Grahn
4 The Kalman and Wiener Filters The Kalman filter is named after its inventor Rudolf Emil Kálmán in 1960 [22]. The Kalman filter presents a recursive approach to discrete data linear filtering and prediction problems. The filter estimates the state of underlying discrete time controlled process x ∈ℜ n which is presented by the following difference equation.
xk = Axk −1 + Buk −1 + wk −1 ,
(6)
with the measurement or observation z ∈ℜ m and presented by
zk = Hxk + vk .
(7)
The random variables wk and vk are process and measurement noises respectively, and assumed to be independent, white and normally distributed with zero mean and covariance matrices Q and R .
p ( w ) ≈ N ( 0, Q ) , p ( v ) ≈ N ( 0, R ) .
(8)
Matrix A in equation 7 is an n × n matrix and it represents the state relationship from previous time step k − 1 to current time step k . The n × 1 matrix B relates the optional control input u to the state x . The m × n matrix H in equation 7 relates the state xk to the measurement zk . More detailed introduction about the Kalman filter could be found in [24], we will just describe some basic steps of the filter. The Kalman filter has two main steps called time update (prediction) and measurement update (correction). The prediction state projects the current state estimate ahead in time and the correction state adjusts the projected estimate by an actual measurement at that time. The filter prediction and update steps are described as follows (the details could be found in [23] along with the derivation). Time update: ∧−
∧
Prediction : x k = A x k −1 + Buk −1 , Error Covariance Projection : Pk − = APk −1 AT + Q.
(9)
Measurement update:
Kalman Gain : K k = Pk H T ( HPk H T− + R ) , ∧ ∧ ∧− ⎞ ⎛ State Ahead Correction : x k = x + K k ⎜ zk − H x ⎟ , Error Covariance Correction : Pk = ( I ⎝− K k H ) Pk−⎠. −1
(10)
In signal processing, the class of linear optimum discrete time filters is collectively known as Wiener filters. The Wiener filter is the filter first proposed by Norbert Wiener in 1949 [25]. Based on statistical approach the goal of the Wiener filters is to filter out noise that has corrupted the signal. Consider u(t ) to be a signal input to the Wiener filter, corrupted by additive noise v(t ) . The estimated output y (t ) is calculated by means of a filter w(t ) using the following convolution equation
Tracking Data Structures Coherency in Animated Ray Tracing
y (t ) = w(t ) *(u (t ) + v (t )) We define the error e(t ) as
1107
(11)
e( t ) = u ( t + α ) − y ( t ) ,
(12)
and the squared error is e 2 (t ) . Depending on the value of α the problem can be formulated as prediction ( α > 0 ), filtering ( α = 0 ) and smoothing ( α < 0 ). For discrete time series, consider the block diagram in Fig.3 built around a linear discrete time filter. The filter input consists of a time series and the filter is itself characterized by the impulse response w( n ) . At some discrete time n , the filter produces output denoted by y (n ) . This output is used to provide an estimate of desired response denoted by d (n ) . With the filter input and desired response representing single realizations of respective stochastic processes, the estimation is ordinarily accompanied by an error with statistical characteristics of its own. In particular, the estimation error denoted by e( n ) is defined as the difference between the desired response d (n ) and the output y (n ) . e( n ) = d ( n ) − y ( n )
(13)
The requirement is to make the estimation error e( n ) as small as possible in some statistical sense. Two assumptions about the filter are made for simplicity. It is linear and operates in discrete time, which makes the mathematical analysis simple and the implementation using digital hardware and software. The impulse response of the filter could be finite or infinite and there are different types of statistical criterion used for the optimization. We use here, Finite Impulse Response (FIR) filter and mean square value of the estimation error. We thus, define the cost function as the mean square error.
J = E ⎡⎣ e( n )e* ( n ) ⎤⎦ = E ⎡⎣| e( n ) |2 ⎤⎦ .
(14)
Where E denotes the statistical expectation operator. The requirement is therefore to determine the operating conditions under which J obtains its minimum value. Further detail reading about the Wiener filter could be found in [26].
Adaptive Mechanism u(n)
N
G ( z ) = ∑ wi z − i i =0
y (n)
e( n )
_ + d (n )
Fig. 3. Block diagram of adaptive control of statistical filtering problem
1108
S. Hussain and H. Grahn
We have used an adaptive mechanism Least-Means-Square (LMS) to adaptively update the weights of the Wiener filter taps. The order of the Wiener filter increases as we receive more and more samples (samples in this case are the split locations determined for the kd-tree construction). We use one Wiener filter with adaptive weights update mechanism for each node in the kd-tree of a frame. Fig. 3 also shows the block diagram of adaptive weight control mechanism combined with statistical filtering problem. The Wiener filter weights are updated by the adaptive mechanism and the following equations govern the adaptive update of the Wiener filter weights. y (n ) =
M −1
∑ w ( n )u ( n − i ) ,
(15)
i
i =0
wi (n + 1) = wi (n ) + μ u( n − i )e(n )
i = 0,1,....M − 1 .
(16)
Where e( n ) is the error calculated from equation 13 and μ is the step size normally between 0 and 1.
5 Fast Construction of kd-Trees We combine golden section search and parabolic interpolation [9] with the Kalman and Wiener filters to construct kd-trees for animated ray tracing. We take advantage of the fact that adjacent frames in animated ray tracing do not depict a dramatic change in most of the animated scenes (we are talking about scenes normally used by the research community in computer graphics). The algorithm we present here is simple to describe. In our algorithm, we start with the technique described in [9] and construct the kd-tree for the first frame of an ani d (n )
y (n )
u(n )
d (n )
u(n )
Wiener
Wiener
u (n − 1) e( n )
Adaptive Mechanism
Fig. 4. Wiener filter adaptive prediction
y (n )
Tracking Data Structures Coherency in Animated Ray Tracing
1109
mated scene. Thus, we manage to acquire one sample (sample in our case is the split plane location for a node of the kd-tree) for our prediction filters. We then use the Kalman and Wiener filters to predict split plane locations for the next consecutive frame. The algorithm also keeps track of kd-tree depth and split orientations. We then apply golden section combined with parabolic interpolation for a one dimensional search of split plane but this time the starting point for the one dimensional search is the predicted split location. This gives us a very fast convergence towards the cost function optimum. Fig. 5 gives an example how the prediction step works for Wiener filters. The d (n ) in Fig.5 is the desired response or in other words is the actual split plane location for a particular axis. We require memory to store the tap values of the Wiener filter wi ( n ) and the past inputs u (n − i ) in equation 15. The Wiener adaptive filter we have implemented here is an order of 6 (no. of taps) and the adaptive mechanism with μ = 0.5 . The beauty of the Kalman filter is that we do not need any previous history to predict the future. This gives us memory free prediction compared to Wiener filter. What we do need is the memory to store the predicted values in the prediction step of the Kalman filter. The same memory locations are updated during the update step. Note that in the Kalman filter step, we use initial split positions information from the first frame and add measurement noise to construct virtual next observations. We then predict the actual split positions for the next frame based on these virtual next observations. In the last step, we update the information for the Kalman filter parameters based on actual split locations information returned by the one dimensional search Algorithm 1. In Wiener filter prediction step, we do not need the virtual information and we predict next split plane position based on previous values. We use the predicted position as a staring point for one dimensional search algorithm (golden section search combined with parabolic interpolation [9]) and refine our prediction results. The refined split plane position is then used as a desired response d (n ) and the error e( n ) is then calculated for further use in the adaptive weights update mechanism (Fig.5 shows the process visually). The OptSplit function in Algorithm 1 takes KalmanStruct/WienerStruct which includes all the information about Kalman/Wiener Orientation (KalmanOrient/WienerOrient), Kalman/Wiener Depth (KalmanDepth/WienerDepth) and Kalman/Wiener predicted optimum split location (KalmanStartPt/WienerStartPt). The algorithm itself finds the optimal split orientation for a given depth information and compares it with that of the Kalman/Wiener Orientation. If the two orientations match, the algorithm uses the start point as predicted by the Kalman/Wiener filter. Otherwise, it starts looking for an optimum from extreme positions. Kalman/Wiener Orientation (KalmanOrient/WienerOrient), Kalman/Wiener Depth (KalmanDepth/WienerDepth) are the two vectors which store the orientation and corresponding depth information from the previous consecutive frame. The objective behind the orientation match is to track the requirements for orientation change because of the dynamic scene. If there is no match, we update the Klaman/Wiener filter parameters with the new orientation and apply the one dimensional search algorithm starting from extreme boundaries of the particular bounding box.
1110
S. Hussain and H. Grahn
Algorithm 1 – Optimum Split Search function OptSplit(Polygons, AABB, KalmanStruct/WienerStruct) Orient = OptSplitOrient(Polygons); Extreme = FindExtremes(Polygons, Orient); if (Orient = KalmanOrient/WeinOrient and Depth = KalmanDepth/WienerDepth) Optimum = OptSearch(Polygons, AABB, KalmanOri ent/WienerOrient, KalmanStartPt/WienerStartPt); else Optimum = OptSearch(Polygons, AABB, Orient); return Optimum; end function
6 Results and Discussion We have tested our algorithm on a variety of animation sequences as shown in Fig. 6. The scenes differ with triangular count and animation behaviour. The scenes consist of regular sized and uniformly distributed triangles. We ran our kd-tree construction algorithm and recorded the Kalman and Wiener filters prediction accuracy and the time our algorithm took to build kd-tree for each frame in the sequence. We have chosen MATLAB® and C++ for implementation of our algorithm. We have implemented the Kalman and Wiener filters prediction routine in MATLAB® and kd-tree construction routine in C++. The kd-tree construction is linked in MATLAB® through Dynamic Link Library (DLL). Routines for PLY file reading are also implemented in MATLAB®. The timing results shown in this paper are only for kd-tree construction in DLL. We have performed all the simulations on a workstation with an Intel Core2 CPU, 2.16 GHz processor and 2GB of RAM. The scenes we have used in our simulations vary in terms of their complexities and behaviors. Fig. 6 shows three different animation sequences. In Fig. 7 we analyze cost functions change in cloth-ball (92.2K – 73 Frames) animations for the two axis (y and z) as shown in Fig. 6 (cloth-ball animation) for each scene and for only root node split positions. We also plot actual split positions and predicted split positions of the Kalman and Wiener filters. Note that the actual split positions have been calculated on bases of Surface Area Heuristics (SAH). Let’s analyse Fig. 7 closely, the upper two sub-figures in Fig.7. (left to right) show cost function shift for each frame in cloth-ball animation sequence for y and z axis respectively. The minimums of these parabolic cost functions are the optimum split plan locations. The bottom sub-figure in Fig. 7 shows the actual optimum split plan locations change over time and predicted split plan locations by the Kalman and Wiener filters (time axis is no. of frames in this case) for only y axis of cloth-ball animation. Note the settling time for Wiener filter in this case. The adaptive mechanism controls the tap weights of the Wiener filter and tries to minimize the mean square error between desired and predicted response. Our algorithm has successfully predicted the split plan locations for consecutive frames. Hence, provides good initial guess for one dimensional search (parabolic interpolation combined with golden section search). If we closely analyse the prediction curves, we see that in almost all the cases, the prediction error is very small. Since, the entire scene dataset exhibits a strong coherency between consecutive frames; we have successfully exploited this property here.
Tracking Data Structures Coherency in Animated Ray Tracing
1111
Table 1. Conventional vs Modified kd-Tree Build Time Modified kd-Tree Build (msec) Scene
Primitives
Horse Animation Elephant Cloth-Ball Bunny Dragon
16.8K
Conventional kd-Tree Build (msec) 135
84.6K 92.2K 252.5K
710 802 1610
Initial Build 135 710 802 1610
Running Build Kalman Wiener 30 28 105 120 210
95 105 195
Time Reduced Kalman/ Wiener 78% / 79% 85% / 86% 85% / 87% 87% / 88%
Fig. 6. Animation sequences (top to bottom): Horse (16.8K - 48 Frames), BunnyDragon (252.5K - 16 Frames) and ClothBall (92.2K - 73 Frames)
Although, in all these scenes, there remain a constant number of polygons (triangles) throughout a particular animation sequence, we see a random behaviour in the kd-tree build time especially for Kalman filter. The phenomenon occurs due to random noise added by the Kalman filter prediction steps, where we have constructed the virtual observations by adding the measurement noise from equation 8. The added noise maximum error difference is not greater than 1msec. Table 1 shows the time difference between an initial build and a running build of kdtree data structures for each animation sequence used in this paper. See the considerable improvement in the build time in running mode for both Kalman and Wiener filters. We have not yet added the overhead of the Kalman and Wiener filters in Table 1. In MATLAB®, the Kalman filter’s average aggregated overhead is approx. 400-450 msec and that of Wiener filter is approx. 500-550 msec for the whole sequence of 50 frames with average of 84K polygons (triangles) in each frame. We expect this time down to 100150 msec for Kalman filter and 150-200 msec for Wiener filter if efficiently implemented in C++. So, in worst case we could add 4-5 msec per frame.
1112
S. Hussain and H. Grahn
Fig. 7. Actual and predicted split positions for Kalman and Wiener filters
7 Conclusion and Future Work We have presented an algorithmic speedup technique for fast kd-tree construction for animated ray tracing. The optimum split location search for kd-tree construction is the main time consuming job. As many of the animation sequences used by research community for animated ray tracing exhibit strong data structures coherency properties, we have made use of a Kalman and Wiener filters for predicting the next possible data structure (kd-tree in our case) state of the animated sequence. We use here the Kalman and Wiener filters and load it with initial split plan locations (we build kd-tree for starting frame in the sequence based on the technique described in [9]). The filters then predict the next possible split locations for the next frame in the sequence. We use these predicted locations as starting points for the one dimensional optimum search algorithm. With best initial guess, the algorithm exhibits very fast convergence and we see the results quite promising for the running kd-tree build time as compared to static or initial kd-tree build. In the case of Kalman filter we achieve 78% to 87% increase in kd-tree construction time for the scenes with as low as 17K and as high as 252K polygons. The increase in kd-tree construction time for Wiener filter is 79% to 88% for scenes with same complexities.
Tracking Data Structures Coherency in Animated Ray Tracing
1113
We have implemented our proposed model in MATLAB® and C++. Main prediction engine of the Kalman and Wiener filters is implemented in MATLAB®. C++ handles the kd-tree construction routines. We have demonstrated a considerable decrease in build time as compared to standard SAH based kd-tree.
References [1] Havran, V.: Heuristic Ray Shooting Algorithms. PhD thesis, Faculty of Electrical Engineering, Czech Technical University in Prague (2001) [2] MacDonald, J.D., Booth, K.S.: Heuristics for Ray Tracing Using Space Subdivision. In: Graphics Interface Proceedings 1989, Wellesley, MA, USA, June 1989, pp. 152–163. A.K. Peters, Ltd. (1989) [3] Stoll, G.: Part I: Introduction to Realtime Ray Tracing. In: SIGGRAPH 2005 Course on Interactive Ray Tracing (2005) [4] Zara, J.: Speeding Up Ray Tracing - SW and HW Approaches. In: Proceedings of 11th Spring Conference on Computer Graphics (SSCG 1995), Bratislava, Slovakia, pp. 1–16 (May 1995) [5] Hunt, W., Stoll, G., Mark, W.: Fast kd-tree Construction With An Adaptive ErrorBounded Heuristic. In: Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing, pp. 81–88 (September 2006) [6] Wald, I., Havran, V.: On Building Fast kd-trees For Ray Tracing, and on Doing That In O(N log N). In: Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing, pp. 61–69 (September 2006) [7] Woop, S., Marmitt, G., Slusallek, P.: B-kd trees for Hardware Accelerated Ray Tracing of Dynamic Scenes. In: Proceedings of Graphics Hardware (2006) [8] Foley, T., Sugerman, J.: kd-tree Acceleration Structures For A GPU Raytracer. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pp. 15–22 (2005) [9] Hussain, S., Grahn, H.: Fast kd-Tree Construction for 3D-Rendering Algorithms like Ray Tracing. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Paragios, N., Tanveer, S.-M., Ju, T., Liu, Z., Coquillart, S., Cruz-Neira, C., Müller, T., Malzbender, T. (eds.) ISVC 2007, Part II. LNCS, vol. 4842, pp. 681–690. Springer, Heidelberg (2007) [10] Wald, I.: Realtime Ray Tracing and Interactive Global Illumination. PhD thesis, Computer Graphics Group, Saarland University, Saarbrucken, Germany (2004) [11] Havran, V.: Heuristic Ray Shooting Algorithm. PhD thesis, Czech Technical University, Prague (2001) [12] Chang, A.Y.: Theoretical and Experimental Aspects of Ray Shooting. PhD Thesis, Polytechnic University, New York (May 2004) [13] Havran, V., Herzog, R., Seidel, H.-P.: On Fast Construction of Spatial Hierarchies for Ray Tracing. In: Proceedings of the 2006 IEEE Symposium on Interactive Ray Tracing, pp. 71–80 (September 2006) [14] Benthin, C.: Realtime Raytracing on Current CPU Architectures. PhD thesis, Saarland University (2006) [15] Popov, S., Gunther, J., Seidel, H.-P., Slusallek, P.: Experiences with Streaming Construction of SAH KD-Trees. In: Proceedings of IEEE Symposium on Interactive Ray Tracing, pp. 89–94 (September 2006) [16] Cleary, J.G., Wyvill, G.: Analysis Of An Algorithm For Fast Ray Tracing Using Uniform Space Subdivision. The Visual Computer (4), 65–83 (1988)
1114
S. Hussain and H. Grahn
[17] Whang, K.-Y., Song, J.-W., Chang, J.-W., Kim, J.-Y., Cho, W.-S., Park, C.-M., Song, I.Y.: An Adaptive Octree for Effi¬cient Ray Tracing. IEEE Transactions on Visualization and Computer Graphics 1(4), 343–349 (1995) [18] Horn, D.R., Sugerman, J., Houston, M., Hanrahan, P.: Interactive kd-tree GPU Raytracing. In: Symposium on Interactive 3D Graphics. I3D, pp. 167–174 (2007) [19] Redmonds, S.J., Heneghan, C.: A Method for Initializing the K-Means Clustering Algorithm Using kd-trees. Pattern Recognition Letters 28(8), 965–973 (2007) [20] Stern, H.: Nearest Neighbor Matching Using kd-Trees. PhD thesis, Dalhousie University, Halifax, Nova Scotia (August 2002) [21] Kaplan, M.: The Use of Spatial Coherence in Ray Tracing. In: ACM SIGGRAPH 1985 Course Notes, vol. 11, pp. 22–26 (July 1985) [22] Kalman, R.E.: A New Approach to Linear Filtering and Prediction Problems. Transaction of the ASME—Journal of Basic Engineering, 35–45 (March 1960) [23] Welch, G., Bishop, G.: An Introduction to Kalman Filter. Department of Computer Science, University of North Carolina (July 2006) [24] Grewal, M.S., Andrews, A.P.: Kalman Filtering, Theory and Practice. Prentice Hall, Englewood Cliffs (1993) [25] Wiener, N.: Extrapolation, Interpolation, and Smoothing of Stationary Time Series. Wiley, New York (1949) [26] Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice Hall, New Jersey (1996)
Hardware Accelerated Per-Texel Ambient Occlusion Mapping Tim McGraw and Brian Sowers Department of Computer Science and Electrical Engineering, West Virginia University
Abstract. Ambient occlusion models the appearance of objects under indirect illumination. This effect can be combined with local lighting models to improve the real-time rendering of surfaces. We present a hardware-accelerated approach to precomputing ambient occlusion maps which can be applied at runtime using conventional texture mapping. These maps represent mesh self-occlusion computed on a per-texel basis. Our approach is to transform the computation into an image histogram problem, and to use point primitives to achieve memory scatter when accumulating the histogram. Results are presented for multiple meshes and computation time is compared with a popular alternative GPU-based technique.
1 Introduction Ambient occlusion is a visual effect that can be used in computer graphics to improve the realism of simple lighting models. Local lighting models, such as the Phong lighting model [1] take into account the local surface geometry and the relative positions of light sources and the viewer, but neglect effects such as self-shadowing and occlusion. Ambient occlusion is a view-independent, indirect lighting effect, so for rigid objects it can be precomputed. The values can be computed and stored per-vertex, per-triangle or per-texel. Per-vertex and per-triangle approaches may suffer from undersampling artifacts in areas of coarse triangulation and from long computation time for large meshes. After offline computation the ambient occlusion map can be used in real-time applications [2] with very little performance penalty. For example, the ambient occlusion factor can be incorporated in the Phong model by using it to modulate the constant ambient material color. In this paper we will describe a hardware accelerated technique for precomputing ambient occlusion maps on a per-texel basis, demonstrate the effects of ambient occlusion on synthetic meshes, and compare computation time for our approach with another GPU-based implementation.
2 Background Ambient occlusion [3] was suggested as a way of giving the appearance of global illumination [4] at a fraction of the computational cost. It quantifies the fraction of the hemisphere of ambient illumination which cannot reach the surface. This value can be reduced by concavities and shadowing. Ambient occlusion (AO) is formulated as AO(x) = 1 −
1 π
ω ∈Ω
V (x, ω )(ω · n) d ω
G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1115–1124, 2008. c Springer-Verlag Berlin Heidelberg 2008
(1)
1116
T. McGraw and B. Sowers
Fig. 1. Ambient occlusion at points p and q. Point p is unoccluded (AO(p) = 0) and point q is partially occluded (AO(q) > 0).
where x is a point on the surface, ω is a light direction, V (x, ω ) is a visibility function which has value 0 when x is not visible from direction ω and has value one otherwise, and Ω is the hemisphere with ω · n > 0. The values of AO(x) range from 0 for unoccluded points, and 1 for completely occluded points as illustrated in Figure (1). The idea of ambient occlusion has its roots in the more general concept of ”obscurances” [5]. Obscurance values depend on distance to occluding objects and can be used to incorporate color bleeding effects. Precomputed radiance transfer [6] models more general light transport, including AO. Hardware approaches to AO computation have included pervertex [7] and per-triangle techniques using shadow maps [8], per-vertex techniques using occlusion queries [9], depth peeling [10] and using the fragment shader to compute local screen-space ambient occlusion based on neighborhood depth values [11],[12]. Figure (2) shows an example of how ambiguity between convex and concave regions can be resolved using AO. In general, these cases may be disambiguated my knowing the light direction or by moving the camera. By using AO the darker concave region is easily distinguished from the brighter convex region. Surface darkening in AO can also be due to proximity of surfaces. The “contact shadows” provided by AO, as in Figure (3), can be a useful cue to suggest that two surfaces are touching, or nearly so. While shadowing techniques (shadow volumes or shadow mapping) can resolve convexity and proximity ambiguities, their appearance depends on the position of light sources in the scene, so they cannot be precomputed for dynamic scenes as AO can.
Fig. 2. Convex/concave ambiguity resolved with AO. From left to right : Phong lighting, AO only, AO + Phong, perspective view of Phong + AO.
Hardware Accelerated Per-Texel Ambient Occlusion Mapping
1117
Fig. 3. Proximity cues from AO. Phong lighting (left), AO only (right).
3 Implementation A brute-force approach to computing AO is to discretize the hemisphere of ambient light directions and form rays from the surface in each of these directions, then perform intersection tests between the rays and occluding objects. This is an example of what is commonly referred to as the inside-out approach. The other approach is outside-in: considering each irradiance direction and querying which surface points have unoccluded accessibility to this light direction. We present an outside-in AO algorithm implemented in the OpenGL Shading Language (GLSL) [13]. The algorithm entails multiple renders to texture of the mesh with hidden surface removal provided by z-buffering. The approach we present to AO calculation does not require building a spatial data structure, such as a octree or k-d tree, which can be used to accelerate methods based on intersection queries. Our algorithm does require that mesh vertices have associated texture coordinates. This can be achieved by various mesh parameterization algorithms [14], [15] or texture atlas generation [16]. Since we are storing per-texel ambient occlusion, and we cannot assume that a texel is infinitesimally small, we define AO for a surface patch, S, as the average AO over the patch 1 π x∈S ω ∈Ω V(x, ω )(ω · n) d ω dS AO(S) = 1 − . (2) x∈S dS Letting S be the patch covered by texel R, we can write AO in terms of the texture coordinates (u, v) as AO(R) = 1 −
1 ∂x π (u,v)∈R ω ∈Ω V (x(u, v), ω )(ω · n)| ∂ u ∂x ∂x (u,v)∈R | ∂ u × ∂ v | du dv
× ∂∂ xv | du dv d ω
(3)
where | ∂∂ ux × ∂∂ xv | du dv is the surface area element of the parametric surface x(u, v). Since the texture mapping functions u(x), v(x) are linear over a triangle, the inverse mapping - the surface parameterization x(u, v) is also linear over a triangle. So within a triangle | ∂∂ ux × ∂∂ xv | is a constant. Since the goal of most mesh parameterization and texture atlas generation algorithms is to minimize stretch, we will assume that the stretch is constant over a texel. Note that we have already shown that stretch is constant for texels entirely within a triangle. We are not assuming that stretch is constant over the entire surface,
1118
T. McGraw and B. Sowers
only that it changes slowly enough that we can assume that it is constant over each texel. We can then factor | ∂∂ ux × ∂∂ xv | out of the numerator and denominator and observe that | ∂∂ ux × ∂∂ xv |
| ∂∂ ux × ∂∂ xv |
(u,v)∈R
du dv
=1
(4)
for a single texel, R. We can then write Equation (3) as AO(R) = 1 −
1 π
(u,v)∈R ω ∈Ω
V (x(u, v), ω )(ω · n) du dv d ω .
(5)
Now we will rewrite the formulation in terms of light space coordinates. This coordinate system contains the image plane for each of the rasterized images of the mesh. Let xω , yω be the coordinates in this plane where the subscripts denote the dependence on the light direction ω . Let P be the patch in the image plane covered by the image of texel R. Let the Jacobian of the transformation from texture space to the image plane be ∂ (u,v) given by |J| = ∂ (x . Then we can rewrite Equation (5) as ω ,yω ) AO(P) = 1 −
1 π
(xω ,yω )∈P ω ∈Ω
V (xω , yω )(ω · n)|J| dxω dyω d ω .
(6)
Equation (6) will by solved by discretizing the hemisphere Ω . The view vectors for each image are randomly generated. Uniformly distributed unit vectors can be produced by drawing each component from a zero mean normal distribution and normalizing the vector [17]. The image domain, will be discretized by the rasterization process. Zbuffering will eliminate pixels from the surface with zero visibility but the function V will still be used to discriminate mesh pixels from background by assuming V = 0 for the background. An overview of the algorithm implemented on the vertex processor (VP) and fragment processor (FP) is as follows: Stage 1: Render from light direction, ω . In VP, transform mesh vertices into light coordinates. In FP, compute (ω · n)|J| using interpolated normals. Set the output fragment color as [u(x, y), v(x, y), 1.0, (ω · n)|J|] where u, v are the quantized texture coordinates (as shown in Figure (4)). Stage 2: Sum over pixels. Render point primitives with additive alpha blending. In VP, use vertex-texture-fetch to read the output from stage 1. Set point position to [u(x, y), v(x, y)] and color to (ω · n)|J|. Accumulate AO by repeating stages 1 and 2 for each light direction. Stage 3: Postprocessing. Fix texture atlas seams and normalize the map in FP. Our approach to ambient occlusion computation is similar to computation of image histograms. To compute AO we count the number of times each texture coordinate appears the rendered images and scale this count by the stretch correction factor |J| and the cosine of the angle of incidence ω · n. The image histogram can be defined as H(u, v) = ∑ ∑ δ (u − u(x, y), v − v(x, y)) x
y
(7)
Hardware Accelerated Per-Texel Ambient Occlusion Mapping
1119
Fig. 4. (Left) Output of stage 1 for 4 random light directions: red channel = u(x,y), green channel = v(x,y), blue channel = 1. All channels = 0 in background. (Right) Output of stage 2.
where δ is the 2D discrete dirac delta function defined as δ (x, y) = 1 for x = y = 0 and δ (x, y) = 0 otherwise. Our hardware approach to computing the histogram is to render the image to a vertex buffer, and in the fragment program set the point location to the appropriate histogram bin location. By rendering the points with additive alpha blending the result is an image in the framebuffer of the histogram. Likewise, the ambient occlusion computation is a summation of the visible surface elements over a set of images acquired from multiple viewing directions. For per-texel ambient occlusion computed from the visible (u, v) images we have AO(u, v) = 1 − C ∑ ∑ ∑ δ (ud (x, y) − u, vd (x, y) − v)|J(x, y)|(ωd · nd (x, y)) d
x
(8)
y
Comparing Equation (8) to Equation (6) we see that the continuous integral over the hemisphere has been replaced by a discrete summation over directions. The summation over the light-space coordinates x and y is due to the fact that a single texel may cover more than one pixel. The delta function has replaced the visibility function. Enabling back face culling, or constraining the input mesh to be closed will take care of the case when the surface element faces away from the light, so that we do not accumulate values for the inside of the surface. In practice we do not explicitly compute the constant C. We simply assume that some point on the surface will be unoccluded and normalize the map so that the maximum value of 1 − AO(u, v) is one. Output from stage 1 is rendered to floating-point frame buffer attached textures. In the second stage we render this same data as point primitives. This can be achieved by copying the attached textures as OpenGL pixel buffer objects, or by texture fetch in the vertex shader in stage 2. If desired, the ”bent normal” [3] (the average unoccluded direction) can also be computed and accumulated in the first 2 stages of the algorithm. 3.1 Stretch Correction The surface area covered by a texel may not be constant over the mesh, as shown in Figure (5). Note that this does not invalidate our earlier assumption that stretch is constant over each texel. The degree of stretch can be quantified by the determinant of Jacobian matrix of the mapping function from light coordinates (x,y) to texture coordinates (u,v).
1120
T. McGraw and B. Sowers
Fig. 5. Stretch correction with Jacobian determinant. Checkerboard texture illustrating stretch in the texture mapping (left) at the poles of the sphere, AO map computed without stretch correction (middle) and AO map computed with stretch correction (right).
J = det
∂u ∂x ∂v ∂x
∂u ∂y ∂v ∂y
(9)
If the stretch of the texture mapping is approximately constant then the Jacobian computation is not necessary as it can be absorbed into the constant C. The derivatives of varying quantities which are interpolated over rendered primitives can be queried in many shading languages. In OpenGL the dFdx(...) and dFdy(...) commands provide this functionality. The argument to the function is an expression whose derivative is to be computed. Typically these commands are used in mipmapping to determine the appropriate texture level-of-detail. The equivalent commands are ddx(...) and ddy(...) respectively in the DirectX HLSL [18] and in NVidia’s Cg language [19]. In stage 1 the GLSL fragment shader computes the stretch factor by vec2 dx = dFdx(gl_TexCoord[0].st); vec2 dy = dFdy(gl_TexCoord[0].st); float J = abs(dx.x * dy.y - dx.y * dy.x);
Later, in stage 2, when additive alpha blending is used for summation, the fragment alpha value is multiplied by J, effectively implementing Equation (8). 3.2 Map Smoothing The size of the points rendered in stage 2 can be used to impose smoothness on the resulting ambient occlusion map. When the point size is greater than 1 pixel, the points will overlap in the framebuffer, blurring sharp features in the map. This can help counteract the slight variations in intensity due to light direction randomness and undersampling of the ambient hemisphere. If antialiased points are rendered the size can include fractional pixels, giving fine control over the smoothness of the final map. The effect of increasing the point size from 1 to 2 pixels is shown in Figure (6). In these images the texture intensity is AO4 and nearest-neighbor texture filtering has been used to emphasize the variations in intensity. 3.3 Fixing Seams The results using this method will show small errors at seams in the texture atlas. This can be avoided by computing a mesh parameterization, or using a texture atlas method
Hardware Accelerated Per-Texel Ambient Occlusion Mapping
(a) 1.0
(b) 1.5
1121
(c) 2.0
Fig. 6. Contrast enhanced detail of golfball5 AO map demonstrating smoothing using point size
which results in no seams [20]. Otherwise we correct the results in stage 3 using a post-processing step which fixes the texels located at seams. The morphological erosion and dilation operators are often used for binary image analysis and can be generalized to gray-scale images [21]. The erosion operator shrinks and thins regions in an image. Applying this to our AO image results in the boundary texels being discarded. Following this step we apply two steps of dilation. This operator grows and thickens regions in the image. The first dilation step replaces the boundary pixels with the nearest valid AO value, and the second step expands the AO map by 1 pixel further so that bilinear texture filtering will be correct at the chart boundaries in the AO texture atlas.
4 Results The AO algorithm was implemented in OpenGL 2.0 and GLSL. Experiments were run on a desktop PC featuring Intel Quad Core QX6700 2.66 GHz CPU and 4 GB RAM, and the GPU was a GeForce 8800 GTX with 768 MB VRAM. For all images shown in this paper 512 directions were used to compute a 1024 × 1024 AO map. Figure (7) shows the AO map computed for an elephant and the bones of the hand. The folds in the skin and concavities behind the ears of the elephant are clearly visible. Fine details in the trunk are emphasized. The spatial relationships of the small bones in the wrist are made apparent by the darkening due to proximity. Figure (8) shows the AO map computed for a mesh of the cortical surface of the brain. Note that the ridges and creases (gyri and sulci) are emphasized and clearly distinguishable from one another. Figure (9) shows the AO map computed for a rocker arm, brain surface and pelvic bone. 4.1 Timing Timing results are shown in Table (1). AO maps of resolution 512 × 512 and 1024 × 1024 were generated with 512 and 1024 light directions. For comparison, times required for GPU computation using shadowmaps are also tabulated. The implementation
1122
T. McGraw and B. Sowers
Fig. 7. Ambient occlusion for elephant (left) and hand (right)
Fig. 8. Ambient occlusion for brain2
Fig. 9. Ambient occlusion for rocker arm, brain1 and pelvis (left to right)
used was ATI’s GPU MeshMapper tool. We have observed, both theoretically and experimentally, that our algorithm is linear in the number of directions, so the times for 1024 view directions are approximately double those given in the Tables. Profiling our method shows that the majority of time is spent in the vertex processing phase of stage
Hardware Accelerated Per-Texel Ambient Occlusion Mapping
1123
Table 1. Timing results for 1024 × 1024 AO map computations Our Shadow Map Our Shadow Map triangles vertices Time(sec) Time(sec) Time(sec) Time(sec) 512 directions 1024 directions crank1 14998 7486 5.9 11.2 11.3 20.7 rocker arm2 20088 10044 5.9 11.1 11.8 18.5 pelvis3 63012 31438 6.2 11.3 12.3 22.6 elephant4 84638 42321 6.7 11.6 12.1 23.4 brain12 290656 144996 7.3 19.6 14.6 39.3 brain25 588032 294012 8.5 38.8 16.9 77.0 hand6 654666 327323 9.3 40.5 18.7 81.2 mesh
2. Specifically, the vertex texture fetch is the most time consuming aspect of this phase. A more direct render-to-vertex buffer mechanism would likely result in a significant speed-up.
5 Conclusions In this paper we have presented a new approach to precomputing ambient occlusion maps on the GPU. This technique was demonstrated on various surfaces including range scan data and several meshes from engineering and medical applications. Timing comparisons show that our algorithm is 4× faster than the GPU-based shadow mapping technique for large meshes. For smaller meshes our method still has a time advantage of about 2×.
References 1. Phong, B.: Illumination for computer generated pictures. Communications of the ACM 18, 311–317 (1975) 2. McReynolds, T., Blythe, D.: Advanced Graphics Programming Using OpenGL. Morgan Kaufmann, San Francisco (2005) 3. Landis, H.: Production-ready global illumination. In: RenderMan in Production (SIGGRAPH 2002): Course 16 (2002) 4. Dutre, P., Bala, K., Bekaert, P., Shirley, P.: Advanced Global Illumination (2006) 1 2 3 4 5 6
Sam Drake, Amy Gooch, Peter-Pike Sloan (http://www.cs.utah.edu/∼gooch/model repo.html) Aim@Shape(http://shapes.aim-at-shape.net/) 3D-Doctor(http://ablesw.com/3d-doctor/) Robert Sumner, Jovan Popovic (Computer Graphics Group at MIT) (http://people. csail.mit.edu/sumner/research/deftransfer/data.html) Princeton University Suggestive Contour Gallery (http://www.cs.princeton.edu/gfx/proj/sugcon/models/) Georgia Institute of Technology Large Geometric Models Archive (http://www-static.cc.gatech.edu/projects/large models/)
1124
T. McGraw and B. Sowers
5. Zhukov, S., Iones, A., Kronin, G.: An ambient light illumination model. Rendering Techniques 98, 45–55 (1998) 6. Sloan, P., Luna, B., Snyder, J.: Local, deformable precomputed radiance transfer. In: Proceedings of ACM SIGGRAPH 2005, vol. 24, pp. 1216–1224 (2005) 7. Bunnell, M.: Dynamic ambient occlusion and indirect lighting. GPU Gems 2, 223–233 (2005) 8. Pharr, M., Green, S.: Ambient occlusion. GPU Gems 1, 279–292 (2004) 9. Sattler, M., Sarlette, R., Zachmann, G., Klein, R.: Hardware-accelerated ambient occlusion computation. Vision, Modeling, and Visualization, 331–338 (2004) ` Sbert, M., Cat`a, J., Nicolau Sunyer, S.F.: Real-Time Obscurances with 10. M´endez-Feliu, A., Color Bleeding (GPU Obscurances with Depth Peeling). ShaderX 4 (2006) 11. Shanmugam, P., Arikan, O.: Hardware accelerated ambient occlusion techniques on GPUs. In: Proceedings of the 2007 symposium on Interactive 3D graphics and games, pp. 73–80 (2007) 12. Mittring, M.: Finding next gen: CryEngine 2. In: International Conference on Computer Graphics and Interactive Techniques, pp. 97–121 (2007) 13. Rost, R.: OpenGL Shading Language. Addison-Wesley, Reading (2006) 14. Gotsman, C., Gu, X., Sheffer, A.: Fundamentals of spherical parameterization for 3 D meshes. ACM Transactions on Graphics 22, 358 (2003) 15. Floater, M., Hormann, K.: Surface Parameterization: a Tutorial and Survey. In: Advances In Multiresolution For Geometric Modelling (2005) 16. L´evy, B., Petitjean, S., Ray, N., Maillot, J.: Least squares conformal maps for automatic texture atlas generation. ACM Transactions on Graphics (TOG) 21, 362–371 (2002) 17. Devroye, L.: Non-uniform random variate generation. Springer, New York (1986) 18. St-Laurent, S.: The Complete Effect and Hlsl Guide. Paradoxal Press (2005) 19. Fernando, R., Kilgard, M.: The Cg Tutorial: The Definitive Guide to Programmable RealTime Graphics. Addison-Wesley Longman Publishing Co., Inc., Boston (2003) 20. Sheffer, A., Hart, J.: Seamster: inconspicuous low-distortion texture seam layout. In: Visualization, VIS 2002, pp. 291–298. IEEE, Los Alamitos (2002) 21. Gonzalez, R., Woods, R.: Digital image processing, 3rd edn. Prentice-Hall, Englewood Cliffs (2007)
Comics Stylization from Photographs Catherine Sauvaget and Vincent Boyer L.I.A.S.D. - Universit´e Paris 8 2 rue de la libert´e 93526 Saint-Denis Cedex, France {cath,boyer}@ai.univ-paris8.fr
Abstract. We propose a 2D method based on a comics art style analysis to generate stylized comics pictures from photographs. We adapt the line drawing and the colorization according to the depth of each object to avoid depthless problems. Our model extracts image structures from the photograph, generates automatically a depth map from them and performs a rendering to give a comics style to the image. Different comics styles have been realized and are presented. Results prove that this model is suitable to create comics pictures from photographs.
1
Introduction
We present a 2D-based method to transform a photograph into a comics image. Our aim is to avoid the depthless color problems created during the image color segmentation process. This occurs when objects with the same color but with different depths are represented in the final picture with the same color. Based on an analysis of comics graphic art, we put forward a model which extracts images structures, generates a depth map from them and performs a rendering to give a comics style to the image. ”Comics” is generally defined as a graphic medium that consists of one or more images conveying a sequential narrative. In this paper, we consider only image generation so the speech balloon and its associated text are not considered. Comics creation consists of these successive steps : elaborating a sketch, drawing with pencil, then inking and colorizing. To achieve a comics may be the work of one individual or of a team (plotter, penciller, inker, colorist. . . ). In computer graphics this graphic medium is classified as Non-Photorealistic computer graphics. Existing works [1] have proposed a solution to create cartoon shading for a 3D scene. It is based on particular lighting equations and 3D informations are needed: normals, light position, vertices position. So these methods can not be applied to a 2D photograph. Image processing techniques are often used to produce artistic styles on an image [2], [3], [4]. Previous works have tried to provide image stylization for efficient visual communication and produce images with a style close to comics. DeCarlo et al. [5] describe a computational approach to abstract photographs and to clarify the meaningful structure in an image. Figure 1 presents a photograph and the result produced. Remark that the distance seems to be overwritten and it is difficult to precise the distance of the group in the produced image. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1125–1134, 2008. c Springer-Verlag Berlin Heidelberg 2008
1126
C. Sauvaget and V. Boyer
Fig. 1. Stylization and Abstraction of a photograph, DeCarlo et al.
Fig. 2. Abstraction example, Winnemller et al. and our result
Wang et al. [6] add a temporal coherence to produce video stylization but the problem remains. Winnemller et al. [7] have proposed an automatic real-time framework that abstracts imagery by modeling color opposition contrast and visual salience in terms of luminance. Figure 2 (except the picture on the right) presents a result of this method. Note that the green bush on the left seems to be closer in the resultant images than in the original photograph and on the right, the orientation of a brown region is depreciated. Moreover, in this example we remark that the depthless color problem appears even when the image is blurred. The same problem exists for image editing software (Gimp, Adobe Photoshop) in which cartoon effects are based on edge detection and pseudouniform colorization of the areas. The main drawback of these works is the appearance of depthless areas: objects which actually have different depths in the photograph appear on the image with the same one. As the colorization of the image is realized with no depth information and with the method used to fill in each region (i.e. one color is used per region), the depth disappears. Moreover these works provide images close to a single comics style and it is impossible to produce different ones. In the following we present our 2D model based on a comics graphic art analysis to transform a photograph into comics-stylized pictures. Using comics creators techniques, depth can be reproduced in our images. Moreover different comics styles have been created and are detailed in the result section.
2
Comics Analysis
There are many comics types and it is impossible to give an accurate definition. We prefer to base this analysis on the different styles used in comics [8]. We first
Comics Stylization from Photographs
1127
present different styles, then we focus on the colors in comics. According to the tools used for strokes and colors, different styles have been identified: – straight stroke: this is the most frequent cartoon style without relief or black flats (Edmon Baudoin, Herg´e, Tito, Bourgeon, Tabary. . . ). – stroke with black flats: the character is emphasized and the flats permit dramatization of the picture without half tints (Blutch, Brescia. . . ). – realistic cartoon: straight strokes sometimes are blended with many black flats. The purpose is to produce relief and to give a realistic atmosphere with contrasted lighting (a dramatization) (Adrian Tomine, Gillon. . . ). – wash: half-tones for black and white applied by touch with different intensities or by scale. It is used to create a cold atmosphere (Pascal Rabate. . . ). – other styles exist like modeling by stroke (Fran¸cois Schuiten, Bilal, Manara, Moebius. . . ) or am gray-tone stroke (Moebius. . . ). When used, colors are very important in comics pictures because they create depth. To produce depth the main principle of the comics creators is to decrease the color saturation according to the depth of the objects in the scene. Moreover, a background color can be used to illustrate objects located at a long distance. They are filled in with a low saturation (gray or a special desaturated color) while the foreground objects are filled in with colors with high saturation. The top of figure 3 presents two original pictures. On the left, we measure the color saturation of the building on the left. It decreases (from 25 to 7) according to the depth. On the right, lights and lighting effects on the corridor are different on the ground according to the depth: colors are nor saturated. At the bottom, we modify the original picture (hue and saturation) and the depth disappears (are the cypress farther away than the hill on the left ?). Remark also on the right, that when the same color is used, the mountain seems closer. Colors can also be used to give an atmosphere (see figure 4). For example, dramatic atmospheres can be produced with browns and dark blues, mysterious atmospheres are generally composed of dark blues and greens, violent atmospheres
Fig. 3. Top : How do the illustrators produce depth:(left)”Je suis L´egion” by Fabien Nury & John Cassaday, (right) ”Le chant des Stryges” by Cobeyran & Gu´erineau. Bottom: Original, constant saturation, constant tint (from ”Quintett” by Cyril Bonin).
1128
C. Sauvaget and V. Boyer
Fig. 4. Examples of athmosphere: (left)“Escale dans les ´etoiles” by Jackk Vance, illustrator Alain Brion (2003), (right) “Black OP 4” by Desberg and Labiano (2008)
use oranges and reds, threatening atmospheres are realized mainly with ochres and browns.
3
Our Model: CSM
CSM, Comics Stylization Model, performs different comics stylized images from a photograph.
Fig. 5. Framework overview: the three succesive parts of CSM where dash lines represent parameters given by the user
Our model is composed of three successive parts. The basic workflow of our framework is shown in figure 5. We first extract image structures from the photograph. Then we generate a depth map using them. Finally a stylization process is realized using previous results and a style given by the user. In the following we present each part and detail each component. 3.1
Image Structure and Analysis
As previously described, comics images are often characterized by a pen and ink drawing of contours and a colorization by one color (one tint) per region. So, we propose to detect contours with an edge detection method and the regions in the photograph using a segmentation method and a mean shift filtering method.
Comics Stylization from Photographs
1129
Edge Detection. Several works have been proposed in edge detection. We use the method proposed by Meer et al. [9]. This approach consists of reformulating the framework in a linear vector space where the role of the different subspaces is emphasized. Artifacts are then explained quantitatively and an independent measure of confidence in the presence of the employed edge model is provided. It is well adpated in detecting weak edges. We construct a list of contours LC composed of n contours in which each contour Ci (i ∈ [1; n]) is characterized by a starting point and an end point. Contours have no loops (i.e when a loop or when two possible paths are detected, two contours are created). It is used for plane detection and not for stylization. A length threshold (named LT in figure 5) can be used to remove non-significant contours then LC is updated. Segmentation. Segmentation can be considered a general technique to extract significant image features and different approaches exist. We use the mean-shift algorithm [10], a non-parametric procedure for estimating density gradients. The main idea is that the vector of differences between the local mean and the center of a window, named the mean-shift vector, is proportional to the gradient of the probability density at a point. Regions are created in the joint domain by grouping together neighboring pixels which are closer to a convergence point. The regions containing less than M pixels are eliminated and the process is repeated as long as the number of regions given by the user is not reached. Based on this method we segment our photograph with an interactive tool. Filt. The mean-shift algorithm [10] can also be used to produce a discontinuity preserving smoothing technique. It is a simplification of the segmentation method where the kernel window moves in the direction of maximum increase in the joint density (i.e. there are no created or eliminated regions). It adaptatively reduces the amount of smoothing near abrupt changes in the local structure (i.e. edges). The produced image is more detailled than the segmentation one. The number of regions is not given by the user. Features with a high color constrast are preserved independently of the number of pixels in the region, and regions containing pixels with a low color contrast are almost completely smoothed. 3.2
Depth Map Generation
Depth-map generation is a crucial step of our workflow. It is used during the stylization process to avoid the depthless color problem. A lot of work has been realized in generating a depth map from an image. Methods based on vanishing-point detection using edge detection are not well adapted to a landscape photograph but more to a city photograph with geometry. [11] estimates the depth information by a color-based segmentation, a specific region detection based on semantic rules and an image classification. The classifier is necessary to obtain a reliable depth map and to tag each part of the picture. An approximated depth map is then generated. Unfortunally, the image classifier is not complete and depth map generation must often be realized
1130
C. Sauvaget and V. Boyer
interactively. [12] improves this method but some problems remain: when the inclination of the photograph is not horizontal and when an object like a grid is present, superpixels are created (these details will not be considered here). We propose an approach based on a border plane detection and a color shading per plane. The depth map is generated by the segmentation and the edgedetection images. Border-Plane Detection. To detect border planes, we use the edge-detection image and the segmented one. Our border-plane detection is partially interactive: the user selects a pixel P in LC for each border plane on the edge detection image. Then our algorithm computes the most probable path for each border plane as following: 1. Find the contour Ci in LC containing P . 2. Starting from P follow Ci in the two directions and search KPi (KPi1 or KPi2 depending on the direction). KPi can be the first crossing point, the extremity or a point of the border plane list (in that case, KPi is not evaluated). 3. Find the next point P : – if KPi is a crossing, choose the next point P of the crossing which does not belong to (KPi1 ,KPi2 ) and which has the nearest charateristics to KPi in the segmented image; – if KPi is an extremity, search its nearest point P of Ck (k = i) in the segmented image which has characteristics close to KPi (a threshold value based on a color difference is used to determinate if the characteristic of two points are close). If the distance between KPi and P is greater than the distance between KPi and the border of the image or if no P exists then goto the last step (condition 1). Else search the key points KPk (i.e KPk1 ,KPk2 )for P in Ck . We find P , a point of Ck and a point of Ci which minimize their euclidean distance and which belong to KPk and KPi respectively. A new contour is created bewteen these two points. 4. We construct the border-plane list by adding the path between P and P and we iterate this process (P becomes P ) until condition 1 is verified. 5. Last step: the path is completed by adding a link between KPi and the border of the image. Applying Color Shadings. We have to fill in the different planes with a color shading. The user chooses a plane and its color shading direction adding two points and two color values. The values can be grey levels or a color to make an atmosphere. A linear interpolation is done between these two points. In our model, each value of the depth map is in [0;1] (one component is used for a grey-level depth map, otherwise three components are used) where 1 refers to the camera viewer and 0 to infinity.
Comics Stylization from Photographs
3.3
1131
Stylization
Stylization is the last step of our workflow. It uses the depth map previously generated and results produced during the image analysis and the structure steps. CSM realizes a colorization and a line-drawing image that can be mixed to produce the comic-stylized image. Colorization. To realize the colorization, the user chooses the ”filt image” or the segmented one. Then the depth map created below is applied automatically. To generate the colorization image we represent the colors in the HSV model. Let P (Ph , Ps , Pv ) be a pixel of the colorization image, I(Ih , Is , Iv ) the pixel of the segmented or ”filt image”, depending on the user choice and DM (DMh , DMs , DMv ) the pixel of the Depth Map with the same coordinate of P . We compute the saturation of each pixel P of the colorization image as Ps = DMv × Is or Ps = DMv × M axSat(DepthM ap) depending on the style chosen by the user. In the second case, the maximal saturation of the depth map is used and affects the global saturation of the colorization image. In fact if a dark depth map is created, the colorization image will have low saturation values. For the other components: Ph = Ih if a grey-level depth map is used, otherwise Ph = DMh and Pv = Iv . We are able to produce a binarization with threshold to obtain black and white colorization, grey levels or short straight lines (see figure 6). Note that CSM is able to mix multiple colorization images (an athmosphere, a binarization. . . ) using classical functions (difference, addition, threshold value) to select pixels from these and produce the colorization image used to generate the comics-stylized image. Line Drawing. To realize the line drawing, the edge-detection image corresponding to a sketch is used. A length threshold can be applied to select significant contours. The line drawing image can be a simple copy of the edge-detection image or one of the three styles available: – thick lines: to draw the contours as distorted and thick lines, we divide the contours into 5 equal parts. We increment the thickness of: the extremal parts by one pixel; the medium parts by two pixels; the center part by three pixels. Increasing the thickness of the contours produces aliasing. To avoid this artifact, we apply a 3x3 median kernel filter. Finally to obtain contours, we apply a threshold (to convert the gray scale previously obtained to black or white). – short breaking lines: the user specifies a length and an orientation. We divide each contour and realize a rotation at the center of each breaking line according to these parameters. – Spline lines: for each contour, we create a spline: 1. We determine a list of control points: we decompose the contour into successive straight lines. The control points are placed at the junction of two consecutive straight lines. 2. We update this list by adding and/or deleting some control points:
1132
C. Sauvaget and V. Boyer
• due to the discretization and the edge-detection approximation, it is possible that for a given straight line in the photograph, we detect more than one straight line in the previous step. To solve this problem, we compute the vectors between two consecutive control points and if two consecutive vectors are similar (i.e the dot product is close to 1), the control point belonging the two vectors is removed. • when an acute or a reflex angle is detected between two straight lines (i.e the dot product of two consecutive vectors is close to 0), we want to create a curve close to the straight lines (we do not want to approximate a right or a reflex angle by a “circle arc”). So we propose creating a cusp and control point duplicating (i.e a control point with the same coordinates is added to the list). 3. The new contour is created with a Catmull-Rom spline because it guaranties that each control point is hit exactly and the tangents of the generated curve are continuous over multiple segments. Comics-Stylized Image. Comics-stylized images are produced by mixing the colorization and the line-drawing images. We are able to produce the styles presented above and others (see section 2): – straight strokes are realized with a colorization generally without atmosphere and contours obtained by the edge dectection or thick contours. – strokes with black flats look like the previous ones but we mix the black flats obtained by binarization during the colorization process. – realistic cartoons are obtained as strokes with black flats but atmosphere is always included. – wash is realized using the ”filt image”, an atmosphere and one of the linedrawing styles.
4
Results
This section presents images produced with CSM. The CPU time needed to produce the comics-stylized image depends mainly on the image size (less than one second for a 800×600 image on a Pentium 3.6GHz with 1Go of memory). All depth maps have been generated automatically (i.e the user never modifies the border planes). The top of figure 6 illustrates the depthless problem. Remark that for the leaves of the tree on the right, it is very difficult to see what is in the foreground and in the background when the depth map is not used (segmented). On the contrary, it is very easy to see it when using a depth map. At the top right an atmosphere is added. The last two use Catmull-Rom contours. At the bottom, from left to right, we present different line-drawing styles: thick line contours, with short straight-line colorization, short breaking-line contours and short straight-line colorization with black flats. As one can see in figure 7 (Top), the depth map produced is quite good. A little problem occurs with the tree on the right. The resultant image has been
Comics Stylization from Photographs
1133
Fig. 6. Results on a bridge. Top: original; segmented; with depth map; with atmosphere in the depth map. Bottom: line drawing and post-treatment colorization examples.
Fig. 7. Results on a landscape (Top): original, depth map, filt with atmosphere. Results on a face (movie ”Slevin”) (Bottom): original; B & W; strokes with black flats.
produced mixing the segmented image with the colored depth map. Note that the saturation increases far into the picture. The contours have been created with Catmull-Rom splines. The bottom shows different comics styles. The black and white picture illustrates the binarization with hand-like drawing contours (spline lines). The last image is a mix between a colorization and the binarization. Last example, we produce stylized images with thick contours(right) from the photograph in figure 2. Even if the photograph is blurred, we preserve the depth in the picture (see the bush on the left) and the orientation (on the right).
5
Conclusion
We have proposed a model to transform a photograph into comics-stylized images. CSM provides images without the depthless color problem. Based on an image analysis and structure, it constructs a depth map used to perform the stylization process. The results prove that the depthless problem disappears when a depth map is integrated. Different styles are proposed and can be chosen interactively by the user.
1134
C. Sauvaget and V. Boyer
Future work will be done to increase the number of comics styles available. For example, a blur could be applied to the image according to a particular depth, line drawing could be weighted according to the depth, details can be preserved for a given distance. . . Moreover, we will improve the depth-map generation: a given pixel for each border plane will be automatically detected; an automatic verification will be integrated; and styles will be added to the filling in part of the depth map generation. We will also try to apply this model to a movie in which temporal coherence remains a problem.
References 1. Lake, A., Marshall, C., Harris, M., Blackstein, M.: Stylized rendering techniques for scalable real-time 3d animation. In: NPAR, pp. 13–20 (2000) 2. Treavett, S.M.F., Chen, M.: Statistical techniques for the automatic generation of non-photorealistic images. In: Proc. 15th Eurographics UK Conference (1997) 3. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.: Image analogies. In: SIGGRAPH, pp. 327–340 (2001) 4. Hertzmann, A.: Paint by relaxation. In: Computer Graphics International, pp. 47– 54 (2001) 5. DeCarlo, D., Santella, A.: Stylization and abstraction of photographs. In: SIGGRAPH, pp. 769–776 (2002) 6. Wang, J., Xu, Y., Shum, H.Y., Cohen, M.F.: Video tooning. In: SIGGRAPH, pp. 574–583 (2004) 7. Winnem¨ oller, H., Olsen, S.C., Gooch, B.: Real-time video abstraction. ACM Trans. Graph. 25, 1221–1226 (2006) 8. McCloud, S.: Making Comics: Storytelling Secrets of Comics, Manga and Graphic Novels. Harper Paperbacks (2006) 9. Meer, P., Georgescu, B.: Edge detection with embedded confidence. IEEE Trans. Pattern Anal. Mach. Intell. 23, 1351–1365 (2001) 10. Comaniciu, D., Meer, P.: Robust analysis of feature spaces: color image segmentation. In: CVPR, pp. 750–757 (1997) 11. Battiato, S., Curti, S., La Cascia, M., Tortora, M., Scordato, E.: Depth map generation by image classification. In: Proceedings of the SPIE, vol. 5302, pp. 95–104 (2004) 12. Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. ACM Trans. Graph. 24, 577–584 (2005)
Leaking Fluids Kiwon Um1 and JungHyun Han1,2 1
Department of Computer and Radio Communications Engineering, Korea University 2 Institute of Information Technology Advancement, Ministry of Knowledge Economy, Korea {kiwon um,jhan}@korea.ac.kr
Abstract. This paper proposes a novel method to simulate the flow of the fluids passing through the boundary, which has not been studied in the previous works. The proposed method requires no significant modification of the existing simulation techniques. Instead, it extends the common fluid simulation techniques by adding two post-steps, adjustment and projection. Therefore, the proposed method can be easily integrated with existing techniques. Specifically, the method extends the staggered Marker-and-Cell scheme, semi-Lagrangian advection, level set method, fast marching method, etc. With the extensions, the method can successfully produce the realistic behavior of the leaking fluids.
1
Introduction
In computer graphics area, physically based fluid simulation has been widely studied for the past decade, and we have noticed various realistic fluid-related special effects. So far, the research works on fluid simulation have assumed that the fluid is sealed up by boundary such that the fluid cannot flow in or out. This paper proposes a new method to simulate the leaking fluid behaviors where the fluid flow passes through the boundary, i.e. instead of a completely fluid-proofing boundaries, this paper tackles the leaking boundaries such as cloth, cracked wall, sponge, etc. The major strength of the proposed method is that it does not require any significant modification of the existing fluid simulation techniques: it only requires two additional steps, adjustment and projection. In the additional steps, the flow of the fluid is adjusted as if the flow is disturbed by the boundary, and reprojected to guarantee the important property of fluid flow, incompressibility. In fluid simulation, there exists a tradeoff between the issues of reality and efficiency. When the reality is stressed, more simulation cost is required. On the other hand, the simulation results often loose the rich visual effects when we pursue the efficiency. The simulation cost is remarkably increased when the scene being simulated involves a complex configuration, where the fluid has to interact with various types of materials. Instead of following the physically correct flow models, this paper adopts a strategic method and produces the plausible simulation results with flexibility and efficiency in implementation. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1135–1143, 2008. c Springer-Verlag Berlin Heidelberg 2008
1136
K. Um and J. Han
The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents the foundations of the general simulation method used as a base model for the leaking fluid method proposed in this paper. Section 4 describes the main contribution of this paper, which is an extension of the base model to achieve the leaking fluid effect. The simulation results are presented in Section 5. Finally, Section 6 concludes the paper.
2
Related Work
Since the seminal work of Stam [1] that introduced the unconditionally stable semi-Lagrangian advection method and the Helmholtz-Hodge Decomposition, a lot of fluid simulation methods have been developed in the computer graphics field. Foster and Fedkiw [2] proposed a hybrid method of particle and level set in water simulation for increasing the surface’s visual details. The vorticity confinement method which adds some visual richness to the fluid flow was proposed in [3] as well. Losasso et al. [4] and Irving et al. [5] proposed the ways to simulate the fluids more effectively with octree and tall-cell structure, respectively. All of the above methods have focused on the fluid-proofing boundary, and no fluid can flow through the boundary. Carlson et al. [6] proposed the rigid fluid method extending the distributed Lagrange multiplier technique. The method handles the interaction between fluids and rigid objects. However, this method does not consider the interaction between fluids and deformable objects. Later, Guendelman et al. [7] proposed a novel method to simulate the two-way coupling between the fluids and the thin shell objects such as the water-proofing films. Although these methods have been successful in producing the realistic results in fluid-object interaction, they can handle the completely fluid-proofing objects only. Recently, Lenaerts et al. [8] proposed a particle-based fluid simulation method to simulate the porous flow. Unlike their work based on the Lagrangian view using smoothed particle hydrodynamics (SPH) technique [9], this paper’s method is based on the discretized grid (the Eulerian view) using the staggered Markerand-Cell (MAC) scheme.
3
Foundations of Fluids Simulation
This section provides a basis for the contribution of this paper presented in Section 4. Fluid behavior can be represented as various forms of partial differential equations based on the following fundamental physical principles of fluid dynamics [10]: – Mass conservation – Newton’s second law – Energy conservation. The first two principles lead to continuity and momentum equations, respectively, and form the basis of the Navier-Stokes equations. The energy conservation principle leads to the energy equation, but has largely been ignored in
Leaking Fluids
1137
the computer graphics area. The following subsections describe the details about these equations. 3.1
Governing Equations
This paper assumes the fluids are inviscid and incompressible. To model this kind of fluid behavior, the Euler equations are adequate, in which the viscosity term of the Navier-Stokes equations is not present: ∇·u =0
(1)
∂u ∇p f = − (u · ∇) u − + ∂t ρ ρ
(2)
where u is the velocity field (u, v) in 2D and (u, v, w) in 3D, p is the pressure, ρ is the density of the fluids, and f is the external forces. Eq. (1) is the continuity equation corresponding to the mass conservation law, often called a divergence free condition, and Eq. (2) is the momentum equation derived from the Newton’s second law. To solve these equations, this paper uses discretized Eulerian view in a staggered Marker-and-Cell (MAC) grid scheme [11]. In this scheme, the velocity components are stored on the faces of the cell separately and the pressure is stored in the center of the cell. With this scheme the time steps proceed with the splitting method [12] which solves each term of the momentum equation at each sub-step: external forces, advection, and projection with the continuity equations. External Forces. The external forces, f , mainly affect the flow of the fluids. Note that f can contain many kinds of forces such as the gravitational force with g = (0, −9.8, 0)m/s2, the surface tension force [13], the vorticity confinement force [3], the Coriolis force, etc. After finishing this sub-step, the result fields become u ˜: Δt u ˜ = un + f. (3) ρ Advection. The next sub-step following the addition of the external forces is advection. The result field can be denoted as u ˆ: u ˆ=u ˜ − Δt (˜ u · ∇) u ˜.
(4)
To perform this sub-step, this paper uses the semi-Lagrangian advection method [1], well-known as an unconditionally stable method. Projection. One significant property of fluid flow is the incompressibility. The projection step enforces this property by solving a Poisson equation derived from Eq. (5) with the boundary condition. u∗ = u ˆ−
Δt ∇p ρ
and ∇ · u∗ = 0.
(5)
1138
K. Um and J. Han
This leads to a linear system involving a huge sparse matrix A which is symmetric positive semi-definite, mathematically meaning that q T Aq ≥ 0 for any nonzero vector q. There are lots of algorithms to solve a linear system. One useful method for solving the sparse and positive semi-definite linear system is an iterative method called Conjugate Gradient method. This paper uses the Modified Incomplete Cholesky Conjugate Gradient, Level Zero algorithm. For further details, readers are referred to [14]. 3.2
Level Sets
To represent the surface of the liquid fluid, this paper uses the implicit surface method based on a distance function φ. This function is known as the level set function defining the distance from the iso-contour of the fluid: the inside of the fluids (liquid area) is negative, and the outside (air or empty area) is positive. With this method, the surface mesh to render can be constructed by tracking iso-contour, where φ = 0, using the marching cube algorithm. Interested readers are referred to [15] for the level set methods and [16] for the marching cube algorithm. Advection. With the staggered MAC scheme, the level set values are generally stored on the nodes of the cell and advected by the velocity values interpolated at the position. Similarly with the velocity advection, the level set advection is processed by the semi-Lagrangian advection method. Notice that Eq. (6) for level set advection is similar to that for velocity advection. φn+1 = φn − Δt (u · ∇) φn .
(6)
Redistancing. As we advance the level set field along the time, the property of distance function is damaged gradually. It is required to reconstruct the distance function regularly such that it satisfies the property represented as Eikonal equation, ∇φ = 1. This process is called redistancing. This paper uses a hybrid method proposed in [17] for performing efficiently both redistancing and velocity extension using the Fast Marching Method. Further information can be found in [18].
4
Leaking Model
The previous studies for fluid simulation have focused on preventing fluids from flowing out of or into the boundary. In the real world, however, many kinds of boundaries are not perfectly fluid-proofing. For example, cloth and sponge have the porous property such that the water can permeate and pass through the boundary. This section presents the simulation techniques to support such a leaking boundary.
Leaking Fluids
4.1
1139
Leaking Boundary
If the boundary completely prevents the fluids from flowing in and out, the normal velocity of the fluid on the boundary must be zero, which is well-known as the slip or zero boundary condition in the common fluid simulation methods. These boundary models, however, do not properly handle many kinds of materials in the real world. Leaking boundary is a new boundary model to handle the non-completely fluid-proofing boundary. The leaking boundary disturbs the flow of the fluids. The material property of the boundary determines how much the flow is damped. u∗∗ · n = (kleak u∗ + ubnd ) · n
(0 ≤ kleak ≤ 1)
(7)
where kleak is the leaking coefficient depending on the material properties, ubnd is the velocity of the boundary, and u∗ is the intermediate velocity coming from the projection step, which is for non-disturbed flow. The leaking coefficient, kleak , is adjusted according to the boundary properties of the material. Notice that, if the term kleak is zero for a perfect solid material preventing the flow from leaking, Eq. (7) is same as the ordinary solid boundary conditions. 4.2
Incompressibility
If the fluids permeate the boundary, the cells which contain this boundary are also considered as fluid cells. This implies that the incompressibility of the flow should be guaranteed also for these cells. Note, however, that this condition changes the discretized Poisson equation. For instance, if the cells (i, j), (i − 1, j), (i, j − 1), and (i, j + 1) are fluid cells and the cell (i + 1, j) is the leaking boundary in 2D, the discretized Poisson equation for x direction is as follow:
Fig. 1. A sequence of images showing the simulation results of breaking-dam examples in 2D without the leaking boundary (top) and with the leaking boundary (bottom), where the leaking coefficient kleak is 0.2
1140
K. Um and J. Han Table 1. The full velocity update process Common velocity update process to simulate fluids External forces u ˜ = un + Δt f ρ Advection u ˆ=u ˜ − Δt (˜ u · ∇) u ˜ Projection u∗ = u ˆ + Δt ∇p and ∇ · u∗ = 0 ρ Additional velocity update process to simulate the leaking fluids Adjustment of the velocity field u∗∗ · n = (kleak u∗ + ubnd ) · n on the leaking boundary Projection with un+1 = u∗∗ + Δt ∇p and ∇ · un+1 = 0 ρ the leaking boundary
Fig. 2. A sequence of images comparing the results with different leaking coefficients in the breaking-dam example, kleak = 0.1 (top) and kleak = 0.4 (bottom)
4pi,j − pi+1,j − pi−1,j − pi,j+1 − pi,j−1 = ρΔs ∗∗ ∗∗ − kleak u∗∗ + ubnd − u∗∗ + vi,j+ 1 1 1 − v 1 i+ ,j i− ,j i,j− 2 2 2 2 Δt
(8)
where u∗∗ denotes the velocity damped by the leaking boundary from Eq. (7). Solving this and updating the velocity using the pressure field derived from it with the equation un+1 = u∗∗ −
Δt ∇p ρ
and ∇ · un+1 = 0
(9)
result in the final velocity field, un+1 . 4.3
Simulation Process
The main and distinct advantage of the proposed method is that it does not need to reconstruct existing fluids simulation methods. Addition of the adjustment
Leaking Fluids
1141
Fig. 3. A sequence of images showing the simulation results of breaking-dam example in 3D without the leaking boundary
Fig. 4. A sequence of images showing the simulation results of breaking-dam example in 3D with the leaking boundary, where kleak = 0.3
and projection steps is required. Therefore, users can exploit various existing fluid simulation methods without considerable modifications. The full process to simulate the fluid with the additional steps is summarized in Table 1.
5
Experimental Results
The proposed algorithm has been implemented on a PC with a Intel Core2 Duo E6600 processor and 2 gigabytes of memory. For all experiments, only the gravitational force with g = (0, −10.0, 0)m/s2 is used as an external force driving the fluid flow. The simulation resolutions are 50 × 50 in 2D and 50 × 50 × 50 in 3D. The density value is set to 1000, which is close to the value of water.
1142
K. Um and J. Han
Figure 1 shows the simulation results, with and without the leaking boundary. As shown in the first row, the flow is only blocked by the wall when there is no leaking boundary. In contrast, the flow is disturbed at the boundary when the leaking boundary is added. Figure 2 shows the different results when we use different coefficients for the leaking boundary. The coefficient kleak is 0.1 for the first row, and 0.4 for the second row. Using different leaking coefficients, we can make different amounts of leaked flow of the fluid. Figures 3 and 4 show similar simulation results in 3D.
6
Conclusion and Future Work
This paper presented a new method to simulate the flow of the fluids passing through the boundary, which has rarely been studied in the previous works. The proposed method is so flexible that it can be easily extended from the existing fluid simulation methods. The method requires only two additional steps, and is easy to implement as well. In this paper, only the static leaking boundary is considered, i.e. only the leaking boundary influences the fluids. The influence of the fluid on the boundary is not considered. It is one-way interaction. The two-way coupling between the fluids and objects is necessary in many kinds of special effect. The current algorithm is being extended along the two-way interaction. Further, simulation of the squeezing wet materials like sponge and towel is also being investigated.
Acknowledgement This work was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-314-D00366).
References 1. Stam, J.: Stable fluids. In: SIGGRAPH 1999: Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 121–128. ACM Press/Addison-Wesley Publishing Co., New York (1999) 2. Foster, N., Fedkiw, R.: Practical animation of liquids. In: SIGGRAPH 2001: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 23–30. ACM, New York (2001) 3. Fedkiw, R., Stam, J., Jensen, H.W.: Visual simulation of smoke. In: SIGGRAPH 2001: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 15–22. ACM Press, New York (2001) 4. Losasso, F., Gibou, F., Fedkiw, R.: Simulating water and smoke with an octree data structure. In: SIGGRAPH 2004: ACM SIGGRAPH 2004 Papers, pp. 457– 462. ACM, New York (2004) 5. Irving, G., Guendelman, E., Losasso, F., Fedkiw, R.: Efficient simulation of large bodies of water by coupling two and three dimensional techniques. In: SIGGRAPH 2006: ACM SIGGRAPH 2006 Papers, pp. 805–811. ACM, New York (2006)
Leaking Fluids
1143
6. Carlson, M., Mucha, P.J., Turk, G.: Rigid fluid: animating the interplay between rigid bodies and fluid. In: SIGGRAPH 2004: ACM SIGGRAPH 2004 Papers, pp. 377–384. ACM, New York (2004) 7. Guendelman, E., Selle, A., Losasso, F., Fedkiw, R.: Coupling water and smoke to thin deformable and rigid shells. In: SIGGRAPH 2005: ACM SIGGRAPH 2005 Papers, pp. 973–981. ACM, New York (2005) 8. Lenaerts, T., Adams, B., Dutr´e, P.: Porous flow in particle-based fluid simulations. In: SIGGRAPH 2008: ACM SIGGRAPH 2008 Papers. ACM Press, New York (to appear, 2008) 9. Monaghan, J.J.: Smoothed particle hydrodynamics. Reports on Progress in Physics 68, 1703–1759 (2005) 10. Anderson, J.D.: Computational Fluid Dynamics: The Basics With Applications. McGraw-Hill, New York (1995) 11. Foster, N., Metaxas, D.: Realistic animation of liquids. Graph. Models Image Process 58, 471–483 (1996) 12. Bridson, R., M¨ uller-Fischer, M.: Fluid simulation: Siggraph 2007 course notes video files associated with this course are available from the citation page. In: SIGGRAPH 2007: ACM SIGGRAPH 2007 courses, pp. 1–81. ACM, New York (2007) 13. Brackbill, J.U., Kothe, D.B., Zemach, C.: A continuum method for modeling surface tension. J. Comput. Phys. 100, 335–354 (1992) 14. Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain. Technical report, Pittsburgh, PA, USA (1994) 15. Osher, S., Fedkiw, R.: Level Set Methods and Dynamic Implicit Surfaces: Applied Mathematical Sciences, vol. 153. Springer, New York (2003) 16. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. SIGGRAPH Comput. Graph 21, 163–169 (1987) 17. Adalsteinsson, D., Sethian, J.: The fast construction of extension velocities in level set methods. J. Comput. Phys. 148, 2–22 (1999) 18. Sethian, J.: Level Set Methods and Fast Marching Methods: Evolving Interfaces in Computational Geometry, Fluid Mechanics, Computer Vision, and Materials Science. Cambridge University Press, New York (1999)
Automatic Structure-Aware Inpainting for Complex Image Content Patrick Ndjiki-Nya1, Martin Köppel1, Dimitar Doshkov1, and Thomas Wiegand2 1 Image Processing Department Fraunhofer Institute for Telecommunications - Heinrich-Hertz-Institut Berlin, Germany
[email protected] 2 Image Communication Faculty of Electrical Engineering and Computer Science Technical University of Berlin, Germany
Abstract. A fully automatic algorithm for substitution of missing visual information is presented. The missing parts of a picture may have been caused by damages to or transmission loss of the physical picture. In the former case, the picture is scanned and the damage is considered as holes in the picture while, in the latter case, the lost areas are identified. The task is to derive subjectively matching contents to be filled into the missing parts using the available picture information. The proposed method arises from the observation that dominant structures, such as object contours, are important for human perception. Hence, they are accounted for in the filling process by using tensor voting, which is an approach based on the Gestalt laws of proximity and good continuation. Missing textures surrounding dominant structures are determined to maximize a new segmentation-based plausibility criterion. An efficient post-processing step based on a cloning method minimizes the annoyance probability of the inpainted textures given a boundary condition. The experiments presented in this paper show that the proposed method yields better results than the state-of-the-art. Keywords: Inpainting, texture synthesis, structure, tensor voting, cloning.
1 Introduction The regeneration of missing information is known as inpainting, image completion or image stuffing in the literature. Inpainting relates to seamless insertion of missing samples into large unknown regions of a given picture. Typically, this is a very challenging task with applications to image restoration (scratch removal), wireless image transmission (lost block recovery), and special effects (object removal). Fully automated methods are required here to ensure time efficient processing of images and videos. The inpainting algorithm by Bertalmio et al. [1] propagates image Laplacians in the isophote direction from the surrounding texture into the unknown region. Ballester et al. proposed a formulation of the inpainting problem in a variational framework [2], while Levin et al. [3] use an image-specific prior to perform image inpainting in the G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1144–1156, 2008. © Springer-Verlag Berlin Heidelberg 2008
Automatic Structure-Aware Inpainting for Complex Image Content
1145
gradient domain. Jia and Tang [4] propose an algorithm that constraints inpainting by using segmentation information and tensor voting [5] for dominant structure interpolation. Their approach, however, assumes relatively simple structure properties. Criminisi et al. [6] propose an exemplar- and patch-based method that steers the filling order of the missing texture via a confidence map. Their framework is, however, not generic, as only linear structures are properly handled. Huang et al. [7] propose a global algorithm that formulates inpainting as an energy minimization problem by Markov random fields. The belief propagation is used to solve this problem. Their approach, however, yields annoying artifacts for some typical test pictures. Some semi-automatic systems require the user to provide hints regarding the source region, depth information [8], or dominant structures [9] for successful inpainting. In this paper, images are split into structure and texture information. It is thereby assumed that dominant structures are of salient relevance to human perception [10]. Adjacent structures to the unknown regions are interpolated into these and the interpolated shape of the former is used to constrain the filling process. A new postprocessing method is applied to enhance the inpainting result. No interactive guidance is required. An overview of the approach is presented in Sec. 2, while a more detailed description of the proposed inpainting algorithm is given in Sec. 3 (segmentation), Sec. 4 (structure interpolation), Sec. 5 (filling process), and Sec. 6 (post-processing). Experimental results are presented in Sec. 7.
2 Overall Inpainting Method The overall approach presented in this paper can be illustrated using Fig. 1. It is assumed that the incoming picture P holds an unknown, closed region Ω shown as a black hole in the input picture. The proposed processing chain encompasses segmentation, structure detection and interpolation, filling, and post-processing.
Fig. 1. Block diagram of proposed structure-aware inpainting algorithm
Spatial segmentation provides a mask depicting homogeneous regions, where the homogeneity criteria used in this work are defined in Sec. 3. Based on the segmentation mask, dominant structures such as object or texture boundaries are determined. Using a new algorithm, adjacent structures to Ω are robustly propagated into the unknown region to constrain the filling process (cf. Sec. 4). Subsequently, a patchbased, Markovian filling algorithm is used in this framework. The underlying assumption is thereby that each texture sample is predictable from a small set of spatially neighboring samples and is independent of the rest of the texture. As a result of the
1146
P. Ndjiki-Nya et al.
processing chain shown in Fig. 1, Ω is filled with valid samples copied from its neighborhood. Although the filling module is robust (cf. Sec. 5), noticeable artifacts may still occur. For this reason, a post-processing module is required to improve the visual perception by applying covariant derivatives [11]. The latter mathematical problem is adapted to the present framework in this paper (cf. Sec. 6). A more detailed description of the inpainting modules, as depicted in Fig. 1, is given in the following.
3 Segmentation The spatial segmentation algorithm used in this paper is based on a multi-resolution histogram clustering method proposed by Spann and Wilson [12]. It has been selected because it has shown to be a good compromise between segmentation efficiency and complexity [13]. The fundamental assumption of the method by Spann and Wilson is object property invariance across spatial scales. Their approach generates a multiresolution image pyramid by applying a quad-tree smoothing operation on the original image. Homogeneous regions are extracted at the highest level of the quad-tree, i.e. at the lowest spatial resolution, via statistical classification; where segments are derived from peaks and valleys of luminance histograms. The classification step is followed by a coarse-to-fine boundary estimation based on the partition obtained at the top level of the pyramid. No a priori information, such as the number of segments, is required in this framework. We have extended the approach by Spann and Wilson [12] to account for color information [14]. For that, the components of the color space, the given input image is represented in, are decorrelated using principal component analysis [15]. Among the resulting color components, the one with the most discriminative power is selected for histogram clustering. It is shown in [14] that this approach significantly improves the performance of the fundamental algorithm. A segmentation result is exemplarily shown in Fig. 2.
4 Structure Interpolation A new structure interpolation method is presented in this paper. It is based on the segmentation mask determined in the previous section. Adjacent structures to Ω are determined and propagated into Ω via tensor voting [5]. The major benefit of the proposed approach resides in its ability to handle complex structure constellations in the vicinity of Ω. 4.1 Structure Detection In this paper, structure detection is done based on the segmentation result. Hence, dominant structures are identified as segment boundaries (cf. Fig. 2). Structure occurrence in the vicinity of Ω is “denoised” by assigning small blobs, i.e. such with a size of less than 1% of the overall picture size, to the surrounding texture class. This threshold should, however, be revisited (reduced) for pictures larger than 512x512.
Automatic Structure-Aware Inpainting for Complex Image Content
1147
4.2 Tensor Voting Tensor voting is a computational framework for perceptual organization of generic tokens based on the Gestalt principles of proximity and good continuation [5]. In this framework, tokens are represented via tensors. The latter correspond to a symmetric, positive semi-definite, 2x2 matrix. The voting process yields interaction among tokens. Tensor and voting are connected by a dense tensor field also known as voting field. The latter postulates smooth connections among tokens. In 2D, the conceptual formulation of the tensor voting problem is as follows: given two points and (, ), assuming that there exists a smooth connecting curve between them and that the normal to the curve at is known, the most likely curve normal direction at is to be determined. , are first encoded The known tokens (e.g. dominant structure samples) , using tensors. The information they hold is propagated to their neighborhood via voting. The size of the considered neighborhood is thereby steered by a single parameter . A large yields a higher degree of smoothness and reduced sensitivity to noise, while a small yields a local voting process and an increased sensitivity to noise [5]. The votes cast by the known tokens , to a given location are . The magnitude of each vote thereby accumulated to give a predicted tensor at indicates the confidence that and belong to the same perceptual structure [5]. Once all the discrete locations within Ω have been assigned votes, the smooth curve (2D) that connects the tokens is determined. Only unknown tokens featuring a high likelihood of belonging to the same structure as the known tokens are considered [5]. The likelihood is thereby derived from the “likelihood map” (or curve saliency map) of the tensor voting algorithm [5]. 4.3 New Framework for Processing Complex Structure Constellations Structure interpolation is initialized by providing the tensor voting algorithm with structure ends adjacent to Ω. In order to achieve valid and robust tensors at unknown locations, it is required to limit the observation data (known structure ends) to the vicinity of Ω. Hence, only local curvature information is included into the tensor voting process. For that, structure information is considered only within a margin of thickness ∆ around Ω (cf. Fig. 3, top left). In order to determine the most likely interpolated structure constellations, the segmentation constraint is relaxed to further • allow complex structure constellations in Ω, • weaken the impact of segmentation artifacts as over-segmentation. Based on the curve saliency feature derived from tensor voting, the likelihood of interpolated structure sets, as shown in Fig. 3 (top left), is determined. For that, each known structure pair is submitted to a voting process. As a result, locations influenced by the known structure ends are assigned a tensor each from which the corresponding curve saliency is derived (cf. “likelihood map” Fig. 3, top right). The likelihood that a structure pair is connected is now determined by iteratively thresholding the “likelihood map” with decreasing saliency thresholds. The threshold is thereby decreased from
1148
P. Ndjiki-Nya et al.
100 to 0.5 with a step width of 0.5. Hence, the “likelihood map” is increasingly populated by decreasingly salient points. The iterative process is stopped if the given structure pair is connected. The corresponding threshold gives the connection likelihood. The overall likelihood of a set of interpolated structures adjacent to Ω is now to be determined (cf. Fig. 3, top left). Hence, all combinations of the interpolated structures are evaluated. It is thereby required that every structure end is connected to at least one further structure end. This approach, however, typically yields increased computational complexity. For that, the process is accelerated by limiting the maximum number of connections, , leaving a single structure end. The overall likelihood is given by the mean connection likelihood of the interpolated structures. The best option is given by the structure constellation that maximizes the mean likelihood. Once the best option is determined, tensor voting is repeated to achieve one pixel thin structures. The “tensor cloud” depicted in Fig. 3 (top right, green samples) is thereby interpreted as the confidence interval to accelerate the voting process. Jia and Tang [4] have proposed a similar approach to structure interpolation. They use the segmentation mask both to extract dominant structures and to constrain structure interpolation. Furthermore, Jia and Tang arbitrarily restrict the interpolated structures to be non-intersecting, which does not meet requirements encountered in complex pictures. Fig. 3 (top left) and Fig. 4 show a structure constellation that cannot be handled by the approach by Jia and Tang [4] but can be tackled by the proposed framework. Both drawbacks, i.e. the segmentation mask usage and the nonintersection constraint, yield an a priori restriction of the outcome of the structure interpolator to a limited sub-space of options.
5 Filling Process In this section, it is explained, how the missing texture Ω is reconstructed from surrounding natural textures given predicted structures in Ω. For that, a patch-based approach is presented. The mean squared error (MSE) is used as similarity criterion and constrained by the segmentation mask updated with interpolated structures. Structure interpolation by tensor voting as depicted in Fig. 3 (top left) yields segmentation of Ω. Each resulting region in Ω is assigned the label of the corresponding adjacent region in P without Ω. If a region in Ω is adjacent to more than one known region, then the label of the largest region is assigned. The label assignments are stored in the segmentation mask. The filling process relies on the assumption that structure reconstruction is more important than texture reconstruction. Hence, structure filling is operated before texture filling. Given an interpolated structure, the filling process is initialized by sampling the structure equidistantly (cf. Fig. 3, bottom left) to achieve L samples Ω within Ω. The distance between two samples is thereby given by half the patch size. The known structures outside Ω are also sampled to give M equidistant samples . The sampling procedure is repeated for all structures adjacent to Ω. Matching is now conducted based on the known boundary condition, i.e. from Ω inwards (cf. Fig. 3, bottom right). The current sample Ω is positioned at the center of the
Automatic Structure-Aware Inpainting for Complex Image Content
1149
considered patch (cf. Fig. 3, bottom right) Ω within the unknown area Ω. In order to fill-in the unknown area of Ω , a continuation patch is determined by matching Ω with
, where
represents the set of patches centered around
(cf. Fig. 3, bottom left). Due to the selected distance between the samples
Ω
,
Ω
a 50% overlap of the is ensured. Note that, the larger the overlap, the better the matching information but the slower the filling process. Once all structures are patched in Ω, texture filling is conducted. Filling is done in a helical manner starting from Ω inwards. A 50% overlap is ensured here too. The search area outside Ω is restricted to locations that have the same segmentation label as the considered texture location within Ω. Both for structure and texture filling, the continuation patch is determined as the one minimizing an error criterion. Let’s denote the matched locations outside Ω , such that and . That is, encompasses both structure and texture samples. The error criterion can now be given as ·
·
(1)
represents to the cost function for texture matching and is the correswhere ponding weight. and can be interpreted similarly except that they refer to structures. , and , are normalized to give ,
01 ,
,
01
(2)
Fig. 2. Segmentation result for “Bungee”. Original picture (left), segmentation mask (right).
is MSE-based, where only known samples are matched in the RGB color space (cf. Fig. 3, bottom right, hatched patch area). is determined as the mean distance between the predicted structure in Ω and the given structure in . The distance of a sample to a set of samples is thereby defined as the shortest Euclidian distance between the former and the latter.
1150
P. Ndjiki-Nya et al.
Fig. 3. Structure interpolation and patching. Set of interpolated structures (top left), “likelihood map” (top right) with most likely perceptual structure samples (green, . % ) and known structure samples (white), structure sampling for efficient matching operation (bottom left), matching process under consideration of boundary conditions (hatched patch area, bottom right).
Once a continuation patch has been determined, its content is to be inserted Ω into the unknown area of . The following rules thereby apply to achieve good patching results: • structure patches should not be inserted into texture regions, • source ( ) and destination texture locations Ω should have the same label.
Fig. 4. Natural picture “Fence” with crossing structures. Original (left), interpolated structures (middle), filled (right).
The structure-aware inpainting method proposed in this paper is robust due to the above-mentioned constraints. However, depending on the properties of Ω, noticeable artifacts may still occur. Hence, a post-processing operation is required to ensure a seamless transition between the boundary condition (available, surrounding, natural texture) and Ω.
Automatic Structure-Aware Inpainting for Complex Image Content
1151
6 Post-processing Based on Cloning Methods Theory 6.1 Theory Seamless cloning consists in replacing a region, Ω, of a given picture by another content that typically stems from a different picture, such that subjective impairments are minimized. The cloning principle is depicted in Fig. 5. Ω represents the boundary of Ω. is a known scalar function defined over P without Ω. is the texture source function to be (partially) mapped onto Ω. Let represent an unknown scalar function defined over Ω. In this paper, a seamless cloning method based on covariant derivatives [11] is used.
f* f
P g
Ω ∂Ω Fig. 5. Seamless cloning principle
Covariant derivatives correspond to a perceptual approach based on the observation that the perceived image content does not necessarily match with the objective physical properties of the image. Georgiev [11] shows that a variable background can induce a subjective or covariant gradient in a constant foreground region. With covariant derivatives, the interpolated version of can be determined by minimizing the following cost function min where
and
, are called covariant derivatives [11].
(3) ,
represent ma-
trices that model adaptation properties of the human visual system: difference in lightness differs from difference in sample values [11]. Solutions of (3) also satisfy the Euler-Lagrange equation ∆
∆
(4)
with the Dirichlet boundary condition | where
Ω
|
Ω
(5)
1152
P. Ndjiki-Nya et al.
.
∆.
.
0
(6)
represents the Laplacian operator [11]. 6.2 Application of Seamless Cloning to Inpainting In the inpainting framework, the example texture surrounding Ω is finite (cf. Fig. 2 and Fig. 6). Hence, the best match may still be a bad compromise. In fact, it may happen that no good fit is found for the current neighborhood. If such a bad match is inserted into Ω, then error propagation, also called garbage growing, will be the consequence. Hence, online cloning of each match to fit its boundary condition is proposed in this paper. Ideally, bad matches will be assigned perceptually salient statistical properties of the boundary condition Ω. These properties will recursively be propagated towards the inner area of Ω. Good matches should not be perceptually deteriorated by the post-processing operation. In this work, covariant derivatives are used for cloning matches. The boundary condition must, however, be revisited. Fig. 6 depicts the modified boundary condition. The left image represents the first patch placement, while the right image corresponds to the nth placement. Note that Ω depends on the iteration step here. The boundary condition used in the present framework is depicted by the red dotted line in both images. As can be seen in Fig. 6, two of the four sides of the continuation patch are always adjacent to known samples.
Fig. 6. Boundary condition definition for structure-aware inpainting
For that, all terms required in (5) are known. The two other sides always lie in an undefined area. Hence, is not given here. For a controlled and predictable filling process, it has been shown that modifying the Dirichlet boundary condition to |
Ω
|
Ω
at these locations typically yields satisfactory results.
(7)
Automatic Structure-Aware Inpainting for Complex Image Content
1153
7 Experimental Results Some experimental evaluations have been conducted to validate the proposed fully automatic inpainting method. For that, six natural pictures of different complexities have been used. Two pictures of the test set, “Bungee” and “Cricket”, have already been used in previous publications (e.g. [6],[7]) and are meant to evaluate the performance of the proposed algorithm compared to the state-of-the-art (cf. Fig. 7 and Fig. 8). The remainder pictures are shown to illustrate that the new algorithm can be successfully applied to various other data including complex images that feature crossing structures (Fig. 9). Huang et al. [7] have shown that their approach yields better results than recent fully automated algorithms. Hence, in the following, the new method is compared solely to the approach by Huang and colleagues. Table 1. Configuration of proposed inpainting algorithm Parameter
Setting
Denoising
∆
Search area (matching)
Semantics
1%
cf. Sec. 4.1
3·∆ ∆
cf. Sec. 4.2
2
cf. Sec. 4.3
6-20 pixels
cf. Sec. 4.3
2
2
MSE weight (cf. Sec. 5)
1
Structure measure weight (cf. Sec. 5) cf. Sec. 5
The most important parameters of the proposed method are given in Table 1. Note that the term ∆ ∆ corresponds to the area of the bounding box of the hole to be filled in. Hence, σ2 adapts to the size of the hole. represents the number of columns in a patch. The results obtained for the “Bungee” test picture are depicted in Fig. 7. It can be seen that the proposed algorithm yields less blur, which is due to the optimized postprocessing method employed here. Better structure fidelity can also be observed at the peak of the roof of the house. Similar observations can be made for the “Cricket” picture shown in Fig. 8 (removed cricket player at the left border of the picture). However, although less blur can also be seen at the location of the second cricket player (right part of the picture, cf. Fig. 8), visible artifacts can be noticed on the white object in the background. This is due to the fact that the interpolated contour generated by tensor voting is too smooth, such that the corner that is accurately represented by Huang et al., cannot be represented faithfully. In Fig. 9 (right), it is shown that complex scenarios with crossing structures can be successfully managed by the new approach (cf. also Fig. 4). The recovered structures are perceptually plausible and seamless fit into the surrounding, natural, visual information. A further good result is shown in Fig. 10.
1154
P. Ndjiki-Nya et al.
Fig. 7. Inpainting results for the “Bungee” test picture. Result by Huang et al. [7] (left), result by proposed algorithm (middle), structure interpolation (right).
Fig. 8. Inpainting results for the “Cricket” test picture. Result by Huang et al. [7] (left), result by proposed algorithm (middle), structure interpolation (right).
Fig. 9. Inpainting results for the “Skyscraper” test picture. Original picture (left), result by proposed algorithm (middle), structure interpolation (right).
Automatic Structure-Aware Inpainting for Complex Image Content
1155
Fig. 10. Inpainting results for the “Springbok” test picture. Original picture (left), result by proposed algorithm (middle), structure interpolation (right).
8 Conclusions In this paper, we have described a research approach showing that through the decomposition of natural images into structure and texture, fully automatic, structure preserving inpainting can be robustly conducted. A new combination of structure interpolation and integrated (texture and structure) matching has been presented. It has further been shown that the adaptation of covariant derivatives to the inpainting framework yields significant improvement of the perceived quality. In future implementations of the proposed method, further post-processing algorithms will be explored. Structure interpolation will be further enhanced to automatically modulate the smoothness of the predicted contour.
References [1] Bertalmio, M., et al.: Image Inpainting. In: Proc. ACM SIGGRAPH, pp. 417–424 (2000) [2] Ballester, C., et al.: Filling in by Joint Interpolation of Vector Fields and Gray Levels. IEEE Trans. Image Processing 10(8), 1200–1211 (2001) [3] Levin, et al.: Learning How to Inpaint from Global Image Statistics. In: Proc. ICCV, vol. II, pp. 305–313 (2003) [4] Jia, J., Tang, C.: Image Repairing: Robust Image Synthesis by Adaptive NDtensor Voting. In: Proc. IEEE CVPR (2003) [5] Medioni, G., Lee, M., Tang, C.: A Computational Framework for Segmentation and Grouping. Elsevier, New York (2000) [6] Criminisi, A., Perez, P., Toyama, K.: Region Filling and Object Removal by Exemplarbased Image Inpainting. IEEE Trans. Image Processing 13(9), 1200–1212 (2004) [7] Huang, T., et al.: Image Inpainting by Global Structure and Texture Propagation. In: Proc. ACM ICM, pp. 517–520 (2007) [8] Perez, P.: Patchworks: Example-based Region Tiling for Image Editing, Technical Report, Microsoft Research, MSR-TR-2004-04 (2004) [9] Sun, J., et al.: Image Completion with Structure Propagation. In: Proc. ACM SIGGRAPH, pp. 861–868 (2005) [10] Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt and Co. (1982) ISBN 0-716-71567-8 [11] Georgiev, T.: Covariant Derivatives and Vision. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 56–69. Springer, Heidelberg (2006) [12] Spann, M., Wilson, R.: A Quad-Tree Approach to Image Segmentation which Combines Statistical and Spatial Information. Pattern Recognition 18(3/4), 257–269 (1985)
1156
P. Ndjiki-Nya et al.
[13] Freixenet, J., et al.: Yet Another Survey on Image Segmentation: Region and Boundary Information Integration. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 408–422. Springer, Heidelberg (2002) [14] Ndjiki-Nya, P., Simo, G., Wiegand, T.: Evaluation of Color Image Segmentation Algorithms Based on Histogram Thresholding. In: Atzori, L., Giusto, D.D., Leonardi, R., Pereira, F. (eds.) VLBV 2005. LNCS, vol. 3893, pp. 214–222. Springer, Heidelberg (2006) [15] Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995)
Multiple Aligned Characteristic Curves for Surface Fairing Janick Martinez Esturo, Christian Rössl, and Holger Theisel Visual Computing Group, University of Magdeburg, Germany Abstract. Characteristic curves like isophotes, reflection lines and reflection circles are well–established concepts which have been used for automatic fairing of both parametric and piecewise linear surfaces. However, the result of the fairing strongly depends on the choice of a particular family of characteristic curves: isophotes or reflection lines may look perfect for a certain orientation of viewing and projection direction, but still have imperfections for other directions. Therefore, fairing methods are necessary which consider multiple families of characteristic curves. To achieve this, we first introduce a new way of controlling characteristic curves directly on the surface. Based on this, we introduce a fairing scheme which incorporates several families of characteristic curves simultaneously. We confirm effectiveness of our method for a number of test data sets.
1 Introduction Visualization of characteristic curves provides a valuable and important tool for first– order surface interrogation (see [1] for a recent survey). Inspection of characteristic surface curves allows for rating and improving surface design as well as for intuitive detection of surface defects: on the one side, they simulate aesthetic appearance under certain lighting conditions and environment, while on the other hand continuity and smoothness of these curves visualize respective differential properties for surface derivatives. Characteristic surface curves like reflection lines were originally (and still are) used for interrogation and design of physical models, and the concept is simulated for CAGD models in a virtual environment. Surprisingly these curves are mainly used for interrogation, and only few approaches exist which apply them for surface fairing and design [2,3,4]. Yet, the proposed methods that take advantage of characteristic curves in this setting all have in common that they only consider a single curve family, i.e., a main direction represented by these curves. This results in an optimized behavior of the curves for this single direction, but — as we will show in this paper — the single direction fairing does in general not also yield an optimized characteristic of all other curve directions at the same optimized location. In fact, our experiments indicate that the reverse is true. Section 4 presents a new fairing scheme for triangulated surfaces that is capable of incorporating an arbitrary number of families simultaneously. Prior to that, for efficient use of this scheme in practice, in section 3 we develop intuitive methods for real–time curve control, i.e., determining parameters such that specific interpolation or alignment constraints on the surface are fulfilled. These methods allow for interactive and automatic curve specification. G. Bebis et al. (Eds.): ISVC 2008, Part I, LNCS 5358, pp. 1157–1166, 2008. c Springer-Verlag Berlin Heidelberg 2008
1158
J. Martinez Esturo, C. Rössl, and H. Theisel
(a) Isophotes
(b) Reflection lines
(c) Reflection circles
(d) Isophotes example
Fig. 1. Definitions of characteristic curves and example for family of isophotes on a wavy cylinder
1.1 Related Work In this paper, we consider isophotes, reflection lines, and reflection circles. All these classes of characteristic curves are illumination curves since every curve originates from light–surface interaction [5]. Isophotes can be regarded as surface curves of constant incident light intensity which were extensively used to detect surface imperfections [6,5,1]. The reflection of a straight line on a surface is called reflection line. Just as isophotes, reflection lines possess special properties making them valuable for surface interrogation and surface fairing applications of parametric [2,3,7] and piecewise linear surfaces[4]. Recently, [4] applied reflection lines for fairing triangular meshes employing a screen– space surface parametrization. This work provides profound analysis of the arising numerical minimization and careful discretization of the emerging differential operators [8]. It is most similar yet different to this work. Reflection circles arise from the reflections of concentric circles on a surface similar to reflection lines. Although reflection circles are the more general class of surface curves [9], they haven’t been used as thorough as the other more specialized classes in surface–fairing applications. Still, recently [10] argue that a simplified version of reflection circles called circular highlight lines also performs well in surface–fairing applications. There is vast literature on general surface denoising and fairing methods as well as fair surface design based on polygonal meshes, which we do not consider here but instead refer to a recent survey [11]. Similarly, we do not discuss alternative use of light lines such as surface reconstruction applications (see, e.g., [12]).
2 Characteristic Curves We use definitions of characteristic curves — isophotes, reflection lines, and reflection circles (see figure 1) — which only depend on the normal directions of the surface and not on its position [9]. This means we assume that both, viewer and light sources (which are lines and circles), are located at infinity. This is a common simplification for various kinds of environment mapping. In the following, e denotes the normalized eye vector (viewing direction), and n(u, v) is the unit normal to the surface x(u, v).
Multiple Aligned Characteristic Curves
1159
Isophotes are surface curves of constant incident light intensity essentially taking into account Lambert’s cosine law or diffuse lighting. Given are eye direction e and an angle α, then an isophote consists of all surface points x(u, v) satisfying e · n(u, v) = cos α. Variation of angle α yields a family of isophotes. Reflection lines are surface curves showing the mirror image of a line shaped light source. Given are eye direction e and a line at infinity defined by its unit normal p, then a reflection line consists of all surface points satisfying a·p = 0 with a = 2 (e · n) n−e. Variation of p along a line at infinity yields a family of reflection lines. Reflection circles [9] provide a generalization of isophotes and reflection lines. They can be considered as mirror images of a family of concentric circles on the surface. Given are e and a circle at infinity defined by a normalized center direction r and an angle α, then a reflection circle consists of all surface points satisfying a·r = cos α with a = 2 (e · n) n − e. This can be easily transformed to the condition (e · n) (r · n) = v, where v = 12 (cos α + e · r). Reflection circles provide generalizations of other classes of characteristic curves in a sense for r = e or r = −e they are equivalent to isophotes, whereas for r · e = 2v they are equivalent to reflection lines, respectively. Families of reflection circles are obtained by either variation of v within range [−1, 1], or variation of a, or simultaneous variation of both parameters, respectively. In the following we will consider only the first option of varying the scalar parameter v.
3 Characteristic Curve Control For virtual surface interrogation, e.g., using reflection lines, a simple environment map is sufficient to show families of reflection lines while the user moves the geometric object under inspection. This is simple and intuitive. However, in our setting of surface fairing, we require specification of certain characteristic curves: for a region in focus the user wants to specify curves quickly and intuitively such that they are roughly aligned with a prescribed direction. This setting provides a problem of its own because the defining parameters of the curves do not directly relate to the resulting pathway on the surface. Moreover plain parameter variation often yields counterintuitive and unexpected results. In this section we show how to facilitate control of characteristic curves on surfaces in order to enable their intuitive use in practice. The basic idea of every presented alignment method is to let a user or a (semi-) automatic operation specify a small number of points on the surface which a curve or family of curves shall pass through. Such points will be called selections. Then parameters of the curves are calculated by different alignment methods from surface normals in a way that the respective defining conditions are satisfied. Alignment methods differ in the number of required point selections and in their semantics in relation to the curve class. As every alignment method only depends on a small, constant number of points and specifically on the surface normals in these points, they are independent of the complexity of the surface the controlled curves are embedded in. In practice, the user selects by ray intersections with the surface, and selections can be dragged on the surface to fine–tune a curve alignment in real–time. In a similar way, a stencil of selection prototypes can be projected onto the surface for automatic curve control.
1160
J. Martinez Esturo, C. Rössl, and H. Theisel
(a) (ISOP, 3x)
(b) (ISOP, 2x + 2x)
Fig. 2. Isophotes alignment methods
Fig. 3. Alignment examples using (ISOP, 3x) and (RECI, 2x + constr)
3.1 Alignment of Isophotes It turns out that three selections on the surface are sufficient to define a general isophote passing through these points. We give a closed form expression which yields the parameter e and cos α. Let x1 , x2 and x3 be three selections on a smooth surface and n1 , n2 , and n3 the respective unit surface normals. Then e=
(n1 × n2 ) + (n2 × n3 ) + (n3 × n1 ) and (n1 × n2 ) + (n2 × n3 ) + (n3 × n1 )
(1)
cos α = e · n1 = e · n2 = e · n3 are the parameters defining an isophote (e, cos α) interpolating x1 , x2 and x3 . In the remainder we refer to this alignment method as (ISOP, 3x), indicating that an isophote is aligned using three selections. Figure 2a illustrates the configuration. The alignment of a family of isophotes is achieved using two pairs of selections: Let (x1 , x2 ) and (x3 , x4 ) be two pairs of selections on a smooth surface and (n1 , n2 ) and (n3 , n4 ) the respective unit surface normals. Then e=
(n1 − n2 ) × (n3 − n4 ) , (n1 − n2 ) × (n3 − n4 )
cos α1 = e · n1 = e · n2 cos α2 = e · n3 = e · n4
and
(2)
are the parameters defining two isophotes (e, cos α1 ) and (e, cos α2 ) of the same family passing through the points x1 , x2 and x3 , x4 , respectively. We call this method (ISOP, 2x + 2x) because two isophotes of the same family are aligned requiring two selections for each curve. Two isophotes of the same family aligned using this method are depicted in figure 2b. 3.2 Constraint Alignment of Reflection Lines and Circles Due to their relative simplicity, isophotes constitute a special case for which closed form solutions to the general alignment can be given. In contrast, general alignment of reflection lines and circles requires root finding of higher order polynomials to determine parameters. Hence, no general closed form expressions can be given. Instead, we present a constraint approach.
Multiple Aligned Characteristic Curves
1161
The constraint approach requires only two selections in order to align both, either reflection lines or reflection circles on smooth surfaces. Reflection circles are a generalization of reflection lines: setting its cos α parameter to zero in fact specifies a reflection line with one lost degree of freedom which can be taken advantage of afterwards. We restrict the derivation of the alignment method to reflection lines in the first place and make it applicable for both curve classes by variation of the extra parameter. Let x1 , x2 be two selections on a smooth surface and n1 , n2 their linear independent normals. Then (n1 + n2 ) (n1 × n2 ) e= and r = p = (3) (n1 + n2 ) (n1 × n2 ) are the parameters defining a reflection line (e, p) as well as a reflection circle (e, r, cos α = 0) passing through the points x1 , x2 . These alignment methods are referred to as (REF L, 2x + constr) and (RECI, 2x + constr), respectively. We call the approach constraint as the parameter vectors of eye vector e and normal p of the line at infinity are restricted to be perpendicular, so e · p = 0. Geometrically this means that the eye point at infinity is constraint to the respective line at infinity.
4 Surface Fairing We define the goal of our surface fairing method as follows: a smooth surface should be altered by minimal local displacements such that pathways of characteristic curves are straightened and homogenized. Therefore, our aim is to penalize curvature of characteristic curves. Let C define a set of discrete families of curves f , e.g., specified by a finite set of angles. We consider piecewise linear surfaces M, i.e., triangles meshes defined by (V, E, F ), sets of vertices, oriented edges, and faces, respectively. Then we define discrete error functionals as E(V) = Ef (V) with Ef (V) = κ2f (v) , (4) f ∈C
v∈V
where v denotes the position of vertex v ∈ V and κf (v) is the curve curvature of the family member of f at v. We call Ef family error and E accumulated error, respectively. Minimizing E(V) by altering vertex positions yields an optimized, fair surface M . Curvature of Characteristic Curves Each family of characteristic curves defines a piecewise linear scalar field over the surface, i.e., the defining equations are evaluated at every vertex. Then members of the family are given implicitly as iso-curves w.r.t. to a certain isovalue. We approximate curvature of such characteristic iso-curves per vertex v ∈ V as follows. We find intersections of the iso-contour c(v) = cv with the edges (i, j) ∈ Ev1 bounding the 1–ring neighborhood Nv of v by linear interpolation between c(vi ) = cvi and c(vj ) = cvj . From v and the positions of two intersections, curvature κf (v) is given as the inverse radius of the interpolating circle. If the intersections are approximately collinear, i.e., circle degenerates to a line, we assume zero curvature.
1162
J. Martinez Esturo, C. Rössl, and H. Theisel
Local Optimization by Vertex Displacement In order to minimize the accumulated error E, we iteratively displace vertices along their normal direction. We analyze the local setting for vertex v and its neighborhood. The surface normal n(v) triangle := nv is approximated as the average ofn˜weighted v ˜v = normals, i.e., n (v − v) × (v − v), and n = . Note that Ev1 1 i j v (i,j)∈Ev ||˜ nv || includes all directed (counter-clockwise oriented) edges bounding the 1–ring of v. For simplicity we use an area weighting scheme here, however, applying more sophisticated normal approximation methods (see, e.g., survey [1]) yields similar formulas. Displacing vertices as v = v + nv for small scalar entails recomputation of vertex normals only within the 1–ring of v. It is easy to see that curvatures, however, are effected within the 2–neighborhood Nv2 and curvature variation is therefore locally bounded. Consequently, scalar values within the 3–neighborhood Nv3 of v have to be considered for computing the global variation of the error induced by the displacement. We derive the following expression ˜ v () = n vi × vj + (δik (nvk × vj ) − δjk (nvk × vi )) (5) (i,j)∈Ev1
as updated unnormalized normal direction of vertex v after displacement of vertex vk by nvk . The normal of the displaced vertex remains constant. For fairing, a vertex is iteratively translated in several -steps as long as a single displacement reduces the global error. Mesh Fairing We use the analysis of the local setting to globally minimize the accumulated error E(V) for all vertices (or for those within a region of interest, respectively). We take a randomized and serialized approach which iterates the following steps: 1. Randomly pick a vertex v ∈ V. 2. Take a binary decision whether a translation direction of v in direction nv or −nv makes E decrease; otherwise restart at step 1. 3. Find the most effective displacement of v by integrating -steps as long as the global error reduction is of significant magnitude. The step size is adapted during integration by logarithmical attenuation depending on the error reduction rate. We terminate the global iteration if no more enhancement can be achieved over a specific number of iterations. Our experiments show that E(V) is effectively reduced at reasonable computational cost (see section 5). We remark that in every local optimization step curvature of characteristic curves is reduced not only for vertex v but also in its 2–neighborhood due to the overlap of the respective curvature stencils. Hence, for optimization within a region of interest, the boundary region is automatically processed such that smooth transition of optimized curves across the boundary is ensured.
5 Results Curve control. The proposed alignment methods highly facilitate the interrogation of surfaces by characteristic surface curves. Figure 3 gives two examples (see also supplemental video). We found (ISOP, 3x) and (RECI, 2x + constr) especially useful for
Multiple Aligned Characteristic Curves
Fig. 4. Reflection circles on a Chevrolet Corvette C4 engine hood before (left) and after (right column) fairing
1163
Fig. 5. Fairing of BMW Z3 model hood: Initial three families (first row), only first family faired (second row) and all families faired (third row)
automatic surface fairing applications as they require a low number of selections, are computable in constant real-time and yield stable alignment results. Single family fairing. To begin with, a single family fairing of a Chevrolet Corvette C41 engine hood by reflection circles, which were aligned by (RECI, 2x+constr), is shown in figure 4. Within 2000 iterations the accumulated error dropped by 75.91% for about 500 optimized vertices; the processing time was 40s. All timings were measured on a 2.2GHz AMD Opteron processor. Multiple family fairing. One of our main goals is to show that multiple families of characteristic curves should be considered simultaneously. So far, only a single family had been used in prior work. We demonstrate that the latter generally yields improvement of this single family only, while other families may improve or not — or may even get worse in appearance. Figure 5 shows an example, where three differently aligned families (using (ISOP, 3x)) of isophotes are shown on the initial model of a BMW Z3 engine hood in the top row. The second row shows families resulting from solely fairing the left family: the two other families did not improve in the same way. This is because fairing of a surface by a single curve family does not necessarily improve the overall reflective properties of a surface. A subsequent example shows that the contrary can also be the case. Incorporating all three families into the fairing process gives better overall results, see bottom row. The total processing time for multiple families depends linearly on their number. The accumulated error of all families dropped by 32.5% after fairing the single family using 2000 iterations, however, it dropped by 63.7% fairing all 1
Corvette and BMW models from www.dmi3d.com
1164
J. Martinez Esturo, C. Rössl, and H. Theisel
F amily Error Ef
3 × 104 Horizontal Family V ertical F amily Skew F amily 1 Skew F amily 2
2 × 104
Horizontal Family Vertical Family Skew F amily 1 Skew F amily 2
Horizontal Family Vertical Family Skew Family 1 Skew Family 2
104
0
1000
2000 Iterations
Horizontal Family
3000
4000
1000
Vertical Family
2000 Iterations
3000
4000
Skew Family 1
1000
2000 Iterations
3000
4000
Skew Family 2
Fig. 6. Optimization of the Volkswagen Beetle car roof. Bold family names in the error plot correspond to optimized families. The top picture row shows initial curve families with a plot of their respective curve curvatures. Solely fairing of the horizontally oriented family corrupts the vertically aligned family (left plot, second row). Simultaneous fairing of both families enhances appearance the horizontal family while preserving the quality of the vertical family (center plot, third row). Even better results can be achieved by considering also the two skew families (right plot, last row).
Multiple Aligned Characteristic Curves
1165
directions. With the model scaled to the unit sphere, the average displacement per vertex is of length 8.33 · 10−6 , hence the induced approximation error to the initial surface is negligible. In figure 6 we analyze the fairing of the car roof of a Volkswagen Beetle using multiple families of reflection circles which were automatically aligned by (RECI, 2x + constr) to be uniformly distributed. This example illustrates several different families, and it illustrates several facts: first, the benefits of simultaneous optimization of multiple families, second, the potential corruption of families if only a single other family is considered. In the example the vertical family shows already good quality which degrades when only the horizontal family is optimized. In addition, the behavior of two skew families is shown, in the final experiment they are also considered in optimization. The error is plotted versus the number of iterations for all settings, accumulated error dropped by 32.26%, 41.32% and 58.28%, respectively. We moreover found that no other intermediate family direction showed an imperfect behavior on the surface faired this way. Curve class comparison. Both, isophotes and reflection circles, can be faired by our generic approach. In our experiments cross validation showed comparable performance for both curve classes. We could not affirm the proposition in [10] stating that circular highlight lines are better suited for surface fairing than highlight lines, as all directions are captured.
6 Discussion Our results support our claim that simultaneous consideration of multiple curve families is advantageous for surface fairing. Furthermore, we provide new methods for controlling characteristic curves, which haven’t been applied in any previous approach. Prior work most similar to our method is [4] who considered reflection lines for shape optimization based on triangle meshes. They concentrate on optimizing one single family of reflection lines. The family is provided by the user, control of curve parameters is not discussed. Emphasis on discretization and efficient numerical minimization using screen–space parametrization and other approximations yields a real–time algorithm with some view dependent limitations. In contrast our focus was on new aspects summarized above. Our optimization method uses a far simpler randomized greedy surface optimization which converges to local minima and is far from real–time application. It would be an interesting project for future work to see whether our method could be combined with the minimization framework in [4].
7 Conclusions In this paper we make the following contributions: • We showed that the fairing of a particular family of characteristic surface curves (like isophotes, reflection lines, or reflection circles) does not necessarily yield a fairer surface in the sense that other families of surface curves become fairer as well.
1166
J. Martinez Esturo, C. Rössl, and H. Theisel
• We introduced a number of techniques to align characteristic curves on surfaces by directly placing and interactively moving points on the surface instead of specifying viewing and projection parameters. • Based on this, we presented an approach for simultaneous fairing of multiple families of characteristic surface curves which gives better results than a single-family fairing. The following issues remain open for future research: • Although the whole fairing process can be considered as a preprocess which is carried out once, the performance of the algorithm could be enhanced. • We have no general solution on the question how many families should be faired simultaneously to get optimal results. Clearly, increasing the number of families enhances the results but also increases the computing time linearly. In all our examples, four families were sufficient to ensure the fairness of all families. However, we do not have a theoretical confirmation of this statement yet.
References 1. Hahmann, S., Belyaev, A., Buse, L., Elber, G., Mourrain, B., Rössl, C.: Mathematics and Visualization. In: Shape Interrogation, vol. 1, pp. 1–52. Springer, Berlin (2008) 2. Klass, R.: Correction of local surface irregularities using reflection lines. Computer-Aided Design 12, 73–77 (1980) 3. Kaufmann, E., Klass, R.: Smoothing surfaces using reflection lines for families of splines. CAGD 20, 312–316 (1988) 4. Tosun, E., Gingold, Y., Reisman, J., Zorin, D.: Shape optimization using reflection lines. In: Symposium on Geometry Processing, pp. 193–202 (2007) 5. Hagen, H., Schreiber, T., Gschwind, E.: Methods for surface interrogation. IEEE Visualization, 187–193 (1990) 6. Poeschl, T.: Detecting surface irregularities using isophotes. CAGD 1, 163–168 (1984) 7. Theisel, H., Farin, G.: The curvature of characteristic curves on surfaces. IEEE Computer Graphics and Applications 17, 88–96 (1997) 8. Grinspun, E., Gingold, Y., Reisman, J., Zorin, D.: Computing discrete shape operators on general meshes. Eurographics 25, 547–556 (2006) 9. Theisel, H.: Are isophotes and reflection lines the same? Comput. Aided Geom. Des. 18, 711–722 (2001) 10. Nishiyama, Y., Nishimura, Y., Sasaki, T., Maekawa, T.: Surface faring using circular highlight lines. Computer-Aided Design and Application 7, 405–414 (2007) 11. Botsch, M., Pauly, M., Kobbelt, L., Alliez, P., Lévy, B., Bischoff, S., Rössl, C.: Geometric modeling based on polygonal meshes. In: SIGGRAPH Course Notes (2007) 12. Halstead, M.A., Barsky, B.A., Klein, S.A., Mandell, R.B.: Reconstructing curved surfaces from specular reflection patterns using spline surface fitting of normals. In: SIGGRAPH, pp. 335–342 (1996)
Author Index
Aarle, W. van I-700 Ababsa, Fakhreddine I-498 Adluru, Nagesh II-561 Agarwal, Ankit II-430 Agu, Emmanuel I-624 Aguilar, Wendy I-195 Ahmadov, Farid II-307 Ahmed, Abdelrehim H. I-793 Ahmed, Sohail I-540 Ahn, Seongjin II-813 Alba, Alfonso I-1040 Ali, Asem M. I-258 Allusse, Yannick II-430 Alonso, Mar´ıa C. II-85 Alvarado-Gonz´ alez, Alicia Montserrat I-195 Amayeh, Gholamreza II-541 Amayeh, Soheil II-541 Ambrosch, Kristian I-216 Amburn, P. I-867 Anderson, Dustin I-604 ´ Andres, Eric I-925 Anees, Asim II-879 Anton, L. II-602 Ara´ ujo, H. II-460 Arce-Santana, Edgar I-1040 Archibald, James I-460, II-934 Asai, Toshihiro II-897 Ayed, Ismail Ben I-268, II-181 Babbs, Justin I-418 Bahrami, Shahram II-267 Bai, Xiang II-561 Bais, Abdul II-879 Bal´ azs, P´eter II-1147 Barbosa, Jorge G. I-572 Barneva, Reneta P. I-669, I-678 Batenburg, K.J. I-700 Bawden, Gerald W. I-846 Bebis, George I-450 Beck, Fabian I-151 Begley, Se´ an II-692 Belaroussi, Rachid II-703, II-843 Bennamoun, Mohammed II-336
Bergeron, Vincent I-45 Bergevin, Robert I-45 Bergsneider, Marvin I-370 Bernardis, Elena I-393 Bernshausen, Jens II-307 Bertolini, Marina II-1125 Besana, GianMario II-1125 Billet, Eric I-594 Bilodeau, Guillaume-Alexandre I-1081 Bogen, Manfred I-478 Bohrer, Gil I-856 Boluda, Jose A. I-205 Bon´e, Romuald I-288 Bott, Felix I-151 Boulanger, P. II-1 Bouwmans, Thierry I-772 Boyer, Vincent I-1125 Brady, Rachael I-856 Brakke, Kenneth A. II-959 Breen, David E. II-959 Brimkov, Boris I-678 Brimkov, Valentin E. I-669, I-678 Bu, Wenyu I-540 Buckles, Bill P. II-246 Bueno, G. II-602 Bulka, Andy I-129 Burch, Michael I-151 Butt, M. Asif A. II-521 Cabestaing, Fran¸cois II-267 Cao, Jie II-450 Cardot, Hubert I-288 Carmant, Lionel I-1081 Carvalho, Paulo Cezar I-953 Castrillon, M. II-602 Cavalcante-Neto, Joaquim B. I-488 Chan, Ming-Yuen I-161, II-12 Chang, I-Cheng II-833 Chang, Yuchou I-460, II-934 Chantler, Mike I-743 Chapuis, Roland I-468 Chausse, Frederic I-468 Chen, Chao-I I-836 Chen, Chi-Fan II-802
1168
Author Index
Chen, Jing II-782 Chen, Jingying II-236 Chen, Mei-Juan II-551 Chen, Tsuhan II-317 Chen, Xiaolin II-420 Chen, Yisong I-815 Cheng, Fuhua (Frank) II-1034 Cheng, Irene I-1051 Cheng, Samuel II-905 Cheung, Benny II-915 Chiang, Yu-Chong I-171 Chiu, Alex II-356 Cho, Sang-Hyun II-980 Choi, Jongmoo II-924 Choi, Jung-Gil II-980 Chong, Lance I-891 Chrisochoides, Nikos I-594 Chu, Chee-Hung Henry II-644 Chung, Jin Wook II-813 Chung, Ka-Kei I-161 Chung, Kai Lun II-990 Chung, Yun-Su II-823 Cifuentes, Patricia II-1186 Coeurjolly, David II-792 Czech, Wojciech W. II-1011 Czedik-Eysenberg, H. II-368 d’Angelo, David I-478 Damarla, Thyagaraju II-246 Damiand, Guillaume II-792 Das, Dipankar II-133 Davis, Larry S. I-23 Debled-Rennesson, Isabelle I-688 De Floriani, Leila II-1000 Delaunoy, O. II-257 Deniz, O. II-602 Desbarats, P. II-1176 Didier, Jean-Yves I-498 Diehl, Stephan I-151 Djeraba, Chabane II-470 Doermann, David I-248 Donner, Ren´e I-338 Doshkov, Dimitar I-1144 Du, Shengzhi II-624 Duan, Ye I-552, II-95, II-743 Dubrofsky, Elan II-202 Ducrocq, Yann II-267 Duizer, P.T. I-428 Duvieubourg, Luc II-267 Dwyer, Tim II-22
Dyer, Charles I-614 Dyer, J. I-867 Ehlers, Manfred II-75 El Baf, Fida I-772 Elhabian, Shireen Y. I-793 Elibol, A. II-257 Eliuk, S. II-1 Eng, How-Lung I-406 Ernst, Ines I-228 Escolano, F. II-170 Fahmi, Rachid II-287 Falc˜ ao, Alexandre X. I-935 Fan, Gen-Hau II-942 Fan, Jianhua I-87 Fan, Jianping II-380 Fan, Jing II-1076 Fang, Rui I-381 Farag, Aly A. I-258, I-793, II-287 Farag, Amal A. I-793 Farooq, Muhammad II-879 Fedorov, Andriy I-594 Fernandes, Armando Manuel I-1, II-65 Fernandes, Jo˜ ao L. II-501 Ferri, Francesc J. II-592 Figueroa, Pablo A. II-1106 Fitzpatrick, P.J. I-867 Foursa, Maxim I-478 Franchi, Danilo II-612, II-1115 Frank-Bolton, Pablo I-195 Frauel, Yann I-195 Galante, Angelo II-1115 Galasso, Fabio I-803 Gallo, Pasquale II-612 Gao, Shan-shan II-672, II-1096 Garcia, R. II-257 Gattass, Marcelo I-520, I-953 Gaur, Utkarsh II-949 Gelder, Tim van I-129 Gennert, Michael A. I-634 Gerig, Guido I-562, I-594 Ghobadi, Seyed Eghbal II-307 Gilles, J´erˆ ome II-661 Godil, Afzal I-349, I-381, II-915 Goel, Sanjay II-949 Goldberg, Yair II-43 G´ omez-Pulido, Juan A. I-743 Gong, Minglun II-390
Author Index Gonzalez Castro, Gabriela II-723 Gonz´ alez-Matesanz, Francisco J. II-1186 Govindan, V.K. II-571 Gracias, N. II-257 Graf, Holger II-860 Grahn, H˚ akan I-1102 Greensky, James B.S.G. II-1011 Griesser, Rita I-1008 Gu, Songxiang I-634 Gu, Xianfeng I-965, II-743 Gu, Xianfeng David I-720, I-891 Guo, Li II-420 Guturu, Parthasarathy II-246 Hall, Peter M. I-318 Han, Dong I-913 Han, JungHyun I-1135 Hanbury, Allan I-338 Hansen, D.M. I-428 Hariharan, Srivats I-540 Hartmann, Klaus II-307 He, Qiang II-644 He, Qing I-552, II-743 He, Zhoucan I-328 Heckman, John I-998 Hempe, Nico II-1086 Hern´ andez, Jos´e Tiberio I-1018 Hincapi´e Ossa, Diego A. I-1018 Hiromoto, Robert E. II-1044 Hirschm¨ uller, Heiko I-228 Hofhauser, Andreas I-35 Hong, Yi I-460, II-934 Horain, Patrick II-430 Hotta, Kazuhiro II-278 Hsu, Ching-Ting II-551 Hsu, Gee-Sern II-317 Hu, Po II-450 Hu, Xiao I-370 Huang, Zhangjin I-815 Huber-M¨ ork, R. II-368 Huey, Ang Miin II-1137 Humenberger, Martin I-216 Hung, Yi-Ping II-317 Hurst, Nathan II-22 Hussain, Muhammad I-119 Hussain, Sajid I-1102 Ihaddadene, Nacim Ip, Horace I-815
II-470
Jagadish, Krishnaprasad I-359 Jain, Amrita II-949 Jiang, Xudong I-406 Jin, Miao I-720, I-965 Johan, Henry II-440 Johansson, Carina B. I-1071 Jung, Sung-Uk II-823 Kabin, K. II-1 Kalyani, T. I-945, II-1167 Kameyama, M. Charley II-1011 Kampel, Martin I-11 Karsch, Kevin I-552, II-743 Kaufman, Arie I-891 Kawamura, Takuma II-970 Keim, Daniel A. II-380 Keitler, Peter II-224 Kelley, Richard I-450 Kellogg, Louise H. I-846 Kerautret, B. I-710, II-1176 Khan, Muhammad U.K. II-879 Khan, Shoab A. II-521 Khattak, Tahir Jamil II-879 Khawaja, Yahya M. II-879 Kikinis, Ron I-594 Kim, Hakran II-1066 Kim, Jong Myon I-1092 Kim, Junho I-965 Kim, Myoung-Hee II-980 King, Christopher I-450 King, Michael A. I-634 Klinker, Gudrun II-55, II-224 Knox, Michael R. II-1011 Kobayashi, Yoshinori II-133 Kohlhammer, Joern II-31 Kojima, Yoshiko II-897 Konieczny, Jonathan I-998 Koppel, Dan I-836 K¨ oppel, Martin I-1144 Kostliv´ a, Jana I-107 Koyamada, Koji II-970 Kraftchick, Jerrod P. I-1061 Kreylos, Oliver I-846, I-901 Krish, Karthik II-157 Krithivasan, Kamala II-1137 Kubinger, Wilfried I-216 Kulikowski, Casimir A. I-278 Kuno, Yoshinori II-133, II-851 Kuo, C.-C. Jay I-65 Kuo, Paul II-214
1169
1170
Author Index
Lachaud, J.-O. I-710, II-1176 Lai, Shuhua II-1034 Lai, Yu-Chi I-614 Lakaemper, Rolf II-145, II-682 Langlois, J.M. Pierre I-1081 Largeteau-Skapin, Ga¨elle I-925 Lasenby, Joan I-75, I-803 Latecki, Longin Jan II-192, II-561 Le, Thang V. I-278 Lee, Chao-Hua I-75 Lee, Dah-Jye I-240, I-460, II-400, II-934 Lee, Hwee Kuan I-540, I-582 Lee, Jae-Kyu II-813 Lee, Jimmy Addison II-346 Lee, K.C. I-248 Lee, Ping-Han II-317 Lema, Pablo I-1081 Lensing, Paul Hermann I-658 Levesque, Maxime I-1081 Levy, Bruno I-139 Li, Baoxin I-825, II-634 Li, Bo II-440 Li, Chao I-782 Li, Hua II-733, II-762 Li, Huan I-782 Li, Huiping I-248 Li, Xiaolan I-349, I-381, II-915 Li, Xin II-682 Li, Zhidong II-782 Liang, Chia-Hsin II-802 Lietsch, Stefan I-658 Lillywhite, Kirt D. I-240 Lin, Shih-Yao II-833 Lin, Zhe I-23 Lindblad, Joakim I-1071 Lindsay, Clifford I-624, I-634 Liu, Feng I-614 Liu, Lipin I-248 Liu, Xu I-248 Loaiza, Manuel I-520 Loepprich, Omar Edmond II-307 Loffeld, Otmar II-307 Longo, Marcos I-856 L´ opez, Adrian A. II-105 Lorenzo, J. II-602 Lu, ChengEn II-192 Lu, Yue II-95 Luo, Feng I-720, I-965 Luo, Hangzai II-380 Luong, Huynh Van I-1092
Ma, Yunqian II-581 Majumdar, Angshul II-297 Mak, Wai-Ho I-161, II-12 Makris, Dimitrios II-214 Mallem, Malik I-498 Mallon, John II-692 Malpica, Jos´e A. II-85, II-105, II-1186 Mansoor, Atif Bin II-521 Mansur, Al II-133, II-851 Manyen, Mark I-998 Manzuri, Mohammad Taghi II-541 Marriott, Kim I-129 Martinez Esturo, Janick I-1157 Martins, N. II-460 Masilamani, V. II-1137 Masood, H. II-521 Mat´ yskov´ a, Martina I-107 Mathew, Abraham T. II-571 May, Thorsten II-31 McConville, David I-975 Mc Lane, Jonathan C. II-1011 McGraw, Tim I-1115 Medina, Jose I-97 Megherbi, Najla II-214 Merrick, Damian II-22 Meyer, Gary I-998 Mezghani, Neila II-493 Mian, Ajmal II-482 Miao, Junwei II-450 Miles, Judith II-743 Milgram, Maurice II-703 Minetto, Rodrigo II-113 Mitiche, Amar I-268, II-181, II-493 Miyatake, Mariko Nakano II-278 Moeslund, T.B. I-428 Monekosso, N. I-440 Moon, Ki-Young II-823 Moorhead, R.J. I-867 Morel, Guillaume II-843 Mori, Masahiko II-772, II-869 Morita, Satoru II-531 Morizet, Nicolas II-661 Moura, Daniel C. I-572 Muchnik, Ilya B. I-278 Mueller, Klaus I-891 Mumtaz, M. II-521 Naegel, B. I-710 Nagao, Tomoharu
II-752
Author Index Nagy, Benedek II-1157 Naito, Takashi II-897 Nakai, Hiroyuki II-713 Nakashima, Tomoharu I-753 Naseem, Imran II-336, II-482 Nasiopoulos, Panos II-297 Navab, Nassir I-35 Ndjiki-Nya, Patrick I-1144 Nebel, Jean-Christophe II-214 Nelson, Brent E. I-240 Neumann, Ulrich II-924 Newsam, Shawn II-356 Neylan, Christopher A. II-329 Nguyen, H.G. II-1176 Nguyen, Thanh Phuong I-688 Ni, Jiangqun II-400 Nicolescu, Mircea I-450 Nicolescu, Monica I-450 Niebling, Florian I-1008 Nielson, Gregory M. I-183 Ninomiya, Yoshiki II-897 Nishihara, Hiroaki II-752 Noh, Junyong I-646 Nunes, Rubens F. I-488 Olivares-Mercado, Jesus II-278 Olivier, Julien I-288 Ord´ on ˜ez Medina, Sergio A. I-1018 Osawa, Noritaka I-987 Ozdemir, Hasan I-248 Pan, Zhenkuan II-733 Panday, Rahul II-1011 Paniagua, Beatriz I-743 Papa, Jo˜ ao P. I-935 Papadakos, Panagiotis I-879 Pardo, Fernando I-205 Park, Hwajin II-1066 Park, Jiyoung II-980 Park, S. I-428 Pedrini, Helio II-113 Peng, Bo II-581 Peng, Jingliang I-65 Perez, Camilo A. II-1106 Perez-Meana, Hector II-278 Perez-Suay, Adrian II-592 Peter, J. Dinesh II-571 Peters, J¨ org I-87 Petkov, Kaloian I-891 Petpon, Amnart II-511
1171
Pizlo, Zygmunt II-561 Placidi, Giuseppe II-612, II-1115 Pless, Robert II-123 Prager, Daniel I-129 Prastawa, Marcel I-562, I-594 Proen¸ca, Hugo I-731 Qi, Guoyuan II-624 Qian, Gang II-581 Qiu, Feng I-891 Qu, Huamin I-161, II-12 Rabens, Marty I-998 Radmanesh, Alireza I-594 Raposo, Alberto I-520 Reis, Ana M. I-572 Remagnino, P. I-440 Rhee, Seon-Min II-924 Ritov, Ya’acov II-43 Rodr´ıguez, Carlos Francisco I-1018 Rodr´ıguez, Marc I-925 Roman, Nathaniel II-123 Rosenhahn, Bodo I-913 R¨ ossl, Christian I-1157 Rossmann, J¨ urgen II-1086 Rousselle, Jean-Jacques I-288 S´ a, Asla I-953 Saipriyadarshan, Cindula II-430 Sakamoto, Naohisa II-970 Salah, Mohamed Ben I-268 Salgian, Andrea II-329, II-889 Sanchez-Perez, Gabriel II-278 S´ anchez-P´erez, Juan M. I-743 Sanyal, J. I-867 ˇ ara, Radim I-107 S´ Sargent, Dusty I-836 Sarve, Hamid I-1071 Satpathy, Amit I-406 Sauvaget, Catherine I-1125 Savidis, Anthony I-879 Sbarski, Peter I-129 Scalzo, Fabien I-370 Schaefer, Gerald I-753 Schlegel, Michael II-224 Schwartz, William Robson II-113 Seidel, Hans-Peter I-913 Sekmen, Ali II-651 Senshu, Hiroki II-1011 Seol, Yeongho I-646
1172
Author Index
Shaffer, Eric II-1022 Shah, Syed Aamir Ali II-879 Sharif, Md. Haidar II-470 Shen, Yao II-246 Sheng, Yun II-723 Shi, Jianbo I-393 Shigeyama, Yoshihide II-713 Shimizu, Clement I-975, I-998 Shin, Min C. I-1061 Sijbers, J. I-700 Silva, Ricardo Dutra da II-113 Sim, Kang I-308 Sinclair, Christopher Walton I-418 Sinzinger, Eric I-359 Skurikhin, Alexei N. I-298 Sluzek, Andrzej II-346 Snyder, Wesley II-157 Soh, Yeng Chai I-308 Song, Qing I-308 Song, Yi-Zhe I-318 Sotgiu, Antonello II-1115 Souvenir, Richard I-418, I-1061 Sowers, Brian I-1115 Sowmya, Arcot I-508 Sridhar, Anuraag I-508 Srisuk, Sanun II-511 St¨ ottinger, Julian I-338 Steger, Carsten I-35 Strand, Robin II-1157 Stricker, Didier I-530 Strong, Grant II-390 Su, Chung-Yen II-942 Suau, P. II-170 Subramanian, K.G. II-1137 Sugiman, Yasutoshi II-531 Suma, Evan A. I-418 Sun, Chang II-762 Swanson, Kurt W. II-959 Sweety, F. I-945, II-1167 Szumilas, Lech I-338 Tai, Zhenfei II-905 Takahashi, Haruhisa II-278 Tang, Ying II-1076 Tavakkoli, Alireza I-450 Tavares, Jo˜ ao Manuel R.S. I-572 Teixeira, Lucas I-520 Teoh, Soon Tee II-1056 Terhorst, Jim I-975 Theisel, Holger I-1157
Thomas, D.G. I-945, II-1167 Tian, Xiaodong II-772, II-869 Tiddeman, Bernard II-236 Togneri, Roberto II-336 Toledo, Rodrigo de I-139 Tomono, Masahiro I-55 Torabi, Atousa I-1081 Torre˜ ao, Jos´e R.A. II-501 Trivedi, M.M. I-428 Trujillo, Noel I-468 Tsai, Chang-Ming I-836 Tsang, Wai Ming II-869 Tsang, Waiming II-772 Tu, Chunling II-624 Turrini, Cristina II-1125 Ueda, Yasuhiro II-713 Ueng, Shyh-Kuang I-171 Ugail, Hassan II-723 Um, Kiwon I-1135 Vachon, Bertrand I-772 Vasconcelos, Cristina N. I-953 Vega-Rodr´ıguez, Miguel A. I-743 Vegara, Francisco I-205 Velastin, S.A. I-440 Verma, Pramode II-905 Vidal, Creto A. I-488 Vieilleville, Fran¸cois de I-678 Wagan, Asim I-349, I-381, II-915 Wang, Bei I-65 Wang, Guoping I-815 Wang, Hong II-450 Wang, Hong-jun II-762 Wang, Qing I-328, II-410 Wang, Qiong I-763 Wang, Tao I-1051 Wang, Yuan-Fang I-836 Wang, Zhi Min I-308 Warfield, Simon K. I-594 Wei, Weibo II-733 Wei, Zhaoyi I-240 Weickert, Joachim I-913 Weiss, Kenneth II-1000 Wesche, Gerold I-478 Wheeler, Vincent M. II-1011 Whelan, Paul F. II-692 Wiegand, Thomas I-1144 Wientapper, Folker I-530
Author Index Williams, Jorge Luis II-1044 Willis, Phil II-723 Wilson, Robert W. II-889 Wischgoll, Thomas I-1028 Woessner, Uwe I-1008 Wood, Zo¨e I-604 Woodham, Robert J. II-202 Wu, Chun-Chih I-97 Wu, Fan I-624 Wu, Yingcai I-161 Wuest, Harald I-530 Wyk, Barend Jacobus van II-624 Wyk, M. Antonie van II-624 Xia, Tian II-1022 Xie, Ling II-356 Xiong, Zhang I-782 Xu, Peng I-370 Xu, Shuhua II-733 Yamaguchi, Koichiro II-897 Yamamoto, Shuhei II-713 Yamazaki, Kazuo II-772, II-869 Yang, Fenglei II-95 Yang, Heng II-410 Yang, Jingyu I-763 Yang, Xingwei II-561, II-682 Yao, Fenghui II-651 Yap, Choon Kong I-582 Yeh, Chia-Hung II-551 Yin, Xiaotian I-720, II-743 Yoo, Jang-Hee II-823 Yoon, Sang Min II-860 Yow, Kin-Choong II-346
Yu, Weimiao I-540 Yuen, David A. II-1011 Zaharieva, Maia I-11, II-368 Zargianakis, George I-879 Zavisek, Michal I-753 Zendjebil, Imane I-498 Zeng, JingTing II-145, II-682 Zeng, Xiuyuan II-410 Zhan, B. I-440 Zhang, Bin I-891 Zhang, Cai-ming II-672, II-1096 Zhang, Dong II-400 Zhang, Li I-614 Zhang, S. I-867 Zhang, Xi II-772, II-869 Zhang, Xiaolong I-825 Zhang, Xinghui II-624 Zhang, Yunfeng II-672 Zhao, Chunxia I-763 Zhao, Gang II-236 Zheng, Mai II-420 Zhong, Lin I-782 Zhongming, Ding II-970 Zhou, Aoying II-380 Zhou, Jin I-825, II-634 Zhou, Qian-Yi I-965 Zhou, Yuan-feng II-672, II-1096 Zhu, Guangxi II-192 Zhuo, Wei II-990 Zielinski, David J. I-856 Zinner, Christian I-216 Zordan, Victor B. I-97, I-488 Zuccarello, Pedro I-205
1173