Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6938
George Bebis Richard Boyle Bahram Parvin Darko Koracin Song Wang Kim Kyungnam Bedrich Benes Kenneth Moreland Christoph Borst Stephen DiVerdi Chiang Yi-Jen Jiang Ming (Eds.)
Advances in Visual Computing 7th International Symposium, ISVC 2011 Las Vegas, NV, USA, September 26-28, 2011 Proceedings, Part I
13
Volume Editors George Bebis, E-mail:
[email protected] Richard Boyle, E-mail:
[email protected] Bahram Parvin, E-mail:
[email protected] Darko Koracin, E-mail:
[email protected] Song Wang, E-mail:
[email protected] Kim Kyungnam, E-mail:
[email protected] Bedrich Benes, E-mail:
[email protected] Kenneth Moreland, E-mail:
[email protected] Christoph Borst, E-mail:
[email protected] Stephen DiVerdi, E-mail:
[email protected] Chiang Yi-Jen, E-mail:
[email protected] Jiang Ming, E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-24027-0 e-ISBN 978-3-642-24028-7 DOI 10.1007/978-3-642-24028-7 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011935942 CR Subject Classification (1998): I.3-5, H.5.2, I.2.10, J.3, F.2.2, I.3.5 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
It is with great pleasure that we welcome you to the proceedings of the 7th International Symposium on Visual Computing (ISVC 2011) which was held in Las Vegas, Nevada. ISVC provides a common umbrella for the four main areas of visual computing including vision, graphics, visualization, and virtual reality. The goal is to provide a forum for researchers, scientists, engineers and practitioners throughout the world to present their latest research findings, ideas, developments, and applications in the broader area of visual computing. This year, the program consisted of 12 oral sessions, 1 poster session, 5 special tracks, and 6 keynote presentations. The response to the call for papers was very good; we received over 240 submissions for the main symposium from which we accepted 68 papers for oral presentation and 46 papers for poster presentation. Special track papers were solicited separately through the Organizing and Program Committees of each track. A total of 30 papers were accepted for oral presentation in the special tracks. All papers were reviewed with an emphasis on potential to contribute to the state of the art in the field. Selection criteria included accuracy and originality of ideas, clarity and significance of results, and presentation quality. The review process was quite rigorous, involving two–three independent blind reviews followed by several days of discussion. During the discussion period we tried to correct anomalies and errors that might have existed in the initial reviews. Despite our efforts, we recognize that some papers worthy of inclusion may have not been included in the program. We offer our sincere apologies to authors whose contributions might have been overlooked. We wish to thank everybody who submitted their work to ISVC 2011 for review. It was because of their contributions that we succeeded in having a technical program of high scientific quality. In particular, we would like to thank the ISVC 2011 Area Chairs, the organizing institutions (UNR, DRI, LBNL, and NASA Ames), the government and industrial sponsors (Intel, DigitalPersona, Ford, Hewlett Packard, Mitsubishi Electric Research Labs, Toyota, Delphi, General Electric, Microsoft MSDN, and Volt), the international Program Committee, the special track organizers and their Program Committees, the keynote speakers, the reviewers, and especially the authors that contributed their work to the symposium. In particular, we would like to thank Mitsubishi Electric Research Labs for kindly sponsoring a “best paper award” this year. We sincerely hope that the proceedings of ISVC 2011 will offer opportunities for professional growth. July 2011
ISVC’11 Steering Committee and Area Chairs
Organization
ISVC 2011 Steering Committee Bebis George Boyle Richard Parvin Bahram Koracin Darko
University of Nevada, Reno, USA and King Saud University, Saudi Arabia NASA Ames Research Center, USA Lawrence Berkeley National Laboratory, USA Desert Research Institute, USA
ISVC 2011 Area Chairs Computer Vision Wang Song Kim Kyungnam (Ken)
University of South Carolina, USA HRL Laboratories, USA
Computer Graphics Benes Bedrich Moreland Kenneth
Purdue University, USA Sandia National Laboratory, USA
Virtual Reality Borst Christoph DiVerdi Stephen Visualization Chiang Yi-Jen Jiang Ming
University of Louisiana at Lafayette, USA Adobe, USA
Polytechnic Institute of New York University, USA Lawrence Livermore National Lab, USA
Publicity Albu Branzan Alexandra Pati Peeta Basa
University of Victoria, Canada CoreLogic, India
Local Arrangements Regentova Emma
University of Nevada, Las Vegas, USA
Special Tracks Sun Zehang
Apple, USA
VIII
Organization
ISVC 2011 Keynote Speakers Comaniciu Dorin Geist Robert Mueller Klaus Huang Thomas Li Fei-Fei Lok Benjamin
Siemens Corporate Research, USA Clemson University, USA Stony Brook University, USA University of Illinois at Urbana-Champaign, USA Stanford University, USA University of Florida, USA
ISVC 2011 International Program Committee (Area 1) Computer Vision Abidi Besma Abou-Nasr Mahmoud Agaian Sos Aggarwal J.K. Albu Branzan Alexandra Amayeh Gholamreza Agouris Peggy Argyros Antonis Asari Vijayan Athitsos Vassilis Basu Anup Bekris Kostas Belyaev Alexander Bensrhair Abdelaziz Bhatia Sanjiv Bimber Oliver Bioucas Jose Birchfield Stan Bourbakis Nikolaos Brimkov Valentin Campadelli Paola Cavallaro Andrea Charalampidis Dimitrios Chellappa Rama Chen Yang Cheng Hui Chowdhury Amit K. Roy Cochran Steven Douglas Chung Cremers Daniel
University of Tennessee at Knoxville, USA Ford Motor Company, USA University of Texas at San Antonio, USA University of Texas, Austin, USA University of Victoria, Canada Eyecom, USA George Mason University, USA University of Crete, Greece University of Dayton, USA University of Texas at Arlington, USA University of Alberta, Canada University of Nevada at Reno, USA Max-Planck-Institut f¨ ur Informatik, Germany INSA-Rouen, France University of Missouri-St. Louis, USA Johannes Kepler University Linz, Austria Instituto Superior Tecnico, Lisbon, Portugal Clemson University, USA Wright State University, USA State University of New York, USA Universit` a degli Studi di Milano, Italy Queen Mary, University of London, UK University of New Orleans, USA University of Maryland, USA HRL Laboratories, USA Sarnoff Corporation, USA University of California at Riverside, USA University of Pittsburgh, USA Chi-Kit Ronald, The Chinese University of Hong Kong, Hong Kong University of Bonn, Germany
Organization
Cui Jinshi Darbon Jerome Davis James W. Debrunner Christian Demirdjian David Duan Ye Doulamis Anastasios Dowdall Jonathan El-Ansari Mohamed El-Gammal Ahmed Eng How Lung Erol Ali Fan Guoliang Ferri Francesc Ferryman James Foresti GianLuca Fowlkes Charless Fukui Kazuhiro Galata Aphrodite Georgescu Bogdan Gleason Goh Wooi-Boon Guerra-Filho Gutemberg Guevara Gustafson David Hammoud Riad Harville Michael He Xiangjian Heikkil Janne Heyden Anders Hongbin Zha Hou Zujun Hua Gang Imiya Atsushi Jia Kevin Kamberov George Kampel Martin Kamberova Gerda Kakadiaris Ioannis Kettebekov Sanzhar Khan Hameed Ullah Kim Tae-Kyun Kimia Benjamin Kisacanin Branislav
IX
Peking University, China CNRS-Ecole Normale Superieure de Cachan, France Ohio State University, USA Colorado School of Mines, USA Vecna Robotics, USA University of Missouri-Columbia, USA National Technical University of Athens, Greece 510 Systems, USA Ibn Zohr University, Morocco University of New Jersey, USA Institute for Infocomm Research, Singapore Ocali Information Technology, Turkey Oklahoma State University, USA Universitat de Valencia, Spain University of Reading, UK University of Udine, Italy University of California, Irvine, USA The University of Tsukuba, Japan The University of Manchester, UK Siemens, USA Shaun, Oak Ridge National Laboratory, USA Nanyang Technological University, Singapore University of Texas Arlington, USA Angel Miguel, University of Porto, Portugal Kansas State University, USA DynaVox Systems, USA Hewlett Packard Labs, USA University of Technology, Sydney, Australia University of Oulu, Finland Lund University, Sweden Peking University, China Institute for Infocomm Research, Singapore IBM T.J. Watson Research Center, USA Chiba University, Japan IGT, USA Stevens Institute of Technology, USA Vienna University of Technology, Austria Hofstra University, USA University of Houston, USA Keane Inc., USA King Saud University, Saudi Arabia Imperial College London, UK Brown University, USA Texas Instruments, USA
X
Organization
Klette Reinhard Kokkinos Iasonas Kollias Stefanos Komodakis Nikos Kozintsev Kuno Latecki Longin Jan Lee D.J. Li Chunming Li Fei-Fei Li Xiaowei Lim Ser N Lin Zhe Lisin Dima Lee Seong-Whan Leung Valerie Leykin Alex Li Shuo Li Wenjing Liu Jianzhuang Loss Leandro Luo Gang Ma Yunqian Maeder Anthony Maltoni Davide Mauer Georg Maybank Steve McGraw Tim Medioni Gerard Melenchn Javier Metaxas Dimitris Miller Ron Ming Wei Mirmehdi Majid Monekosso Dorothy Mueller Klaus Mulligan Jeff Murray Don Nait-Charif Hammadi Nefian Ara Nicolescu Mircea Nixon Mark Nolle Lars
Auckland University, New Zeland Ecole Centrale Paris, France National Technical University of Athens, Greece Ecole Centrale de Paris, France Igor, Intel, USA Yoshinori, Saitama University, Japan Temple University, USA Brigham Young University, USA Vanderbilt University, USA Stanford University, USA Google Inc., USA GE Research, USA Adobe, USA VidoeIQ, USA Korea University, Korea ONERA, France Indiana University, USA GE Healthecare, Canada STI Medical Systems, USA The Chinese University of Hong Kong, Hong Kong Lawrence Berkeley National Lab, USA Harvard University, USA Honyewell Labs, USA University of Western Sydney, Australia University of Bologna, Italy University of Nevada, Las Vegas, USA Birkbeck College, UK West Virginia University, USA University of Southern California, USA Universitat Oberta de Catalunya, Spain Rutgers University, USA Wright Patterson Air Force Base, USA Konica Minolta Laboratory U.S.A., Inc., USA Bristol University, UK University of Ulster, UK Stony Brook University, USA NASA Ames Research Center, USA Point Grey Research, Canada Bournemouth University, UK NASA Ames Research Center, USA University of Nevada, Reno, USA University of Southampton, UK The Nottingham Trent University, UK
Organization
Ntalianis Klimis Or Siu Hang Papadourakis George Papanikolopoulos Nikolaos Pati Peeta Basa Patras Ioannis Petrakis Euripides Peyronnet Sylvain Pinhanez Claudio Piccardi Massimo Pietikinen Matti Porikli Fatih Prabhakar Salil Prati Andrea Prokhorov Danil Pylvanainen Timo Qi Hairong Qian Gang Raftopoulos Kostas Regazzoni Carlo Regentova Emma Remagnino Paolo Ribeiro Eraldo Robles-Kelly Antonio Ross Arun Samal Ashok Samir Tamer Sandberg Kristian Sarti Augusto Savakis Andreas Schaefer Gerald Scalzo Fabien Scharcanski Jacob Shah Mubarak Shi Pengcheng Shimada Nobutaka Singh Meghna Singh Rahul Skurikhin Alexei Souvenir Su Chung-Yen
XI
National Technical University of Athens, Greece The Chinese University of Hong Kong, Hong Kong Technological Education Institute, Greece University of Minnesota, USA CoreLogic, India Queen Mary University, London, UK Technical University of Crete, Greece LRDE/EPITA, France IBM Research, Brazil University of Technology, Australia LRDE/University of Oulu, Finland Mitsubishi Electric Research Labs, USA DigitalPersona Inc., USA University of Modena and Reggio Emilia, Italy Toyota Research Institute, USA Nokia, Filand University of Tennessee at Knoxville, USA Arizona State University, USA National Technical University of Athens, Greece University of Genoa, Italy University of Nevada, Las Vegas, USA Kingston University, UK Florida Institute of Technology, USA National ICT Australia (NICTA), Australia West Virginia University, USA University of Nebraska, USA Ingersoll Rand Security Technologies, USA Computational Solutions, USA DEI Politecnico di Milano, Italy Rochester Institute of Technology, USA Loughborough University, UK University of California at Los Angeles, USA UFRGS, Brazil University of Central Florida, USA The Hong Kong University of Science and Technology, Hong Kong Ritsumeikan University, Japan University of Alberta, Canada San Francisco State University, USA Los Alamos National Laboratory, USA Richard, University of North Carolina - Charlotte, USA National Taiwan Normal University, Taiwan
XII
Organization
Sugihara Kokichi Sun Zehang Syeda-Mahmood Tanveer Tan Kar Han Tan Tieniu Tavakkoli Alireza Tavares Teoh Eam Khwang Thiran Jean-Philippe Tistarelli Massimo Tong Yan Tsechpenakis Gabriel Tsui T.J. Trucco Emanuele Tubaro Stefano Uhl Andreas Velastin Sergio Verri Alessandro Wang C.L. Charlie Wang Junxian Wang Yunhong Webster Michael Wolff Larry Wong Kenneth Xiang Tao Xue Xinwei Xu Meihe Yang Ming-Hsuan Yang Ruigang Yi Lijun Yu Ting Yu Zeyun Yuan Chunrong Zabulis Xenophon Zhang Yan Cheng Shinko Zhou Huiyu
University of Tokyo, Japan Apple, USA IBM Almaden, USA Hewlett Packard, USA Chinese Academy of Sciences, China University of Houston - Victoria, USA Joao, Universidade do Porto, Portugal Nanyang Technological University, Singapore Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland University of Sassari, Italy University of South Carolina, USA University of Miami, USA Chinese University of Hong Kong, Hong Kong University of Dundee, UK DEI, Politecnico di Milano, Italy Salzburg University, Austria Kingston University London, UK Universit` a di Genova, Italy The Chinese University of Hong Kong, Hong Kong Microsoft, USA Beihang University, China University of Nevada, Reno, USA Equinox Corporation, USA The University of Hong Kong, Hong Kong Queen Mary, University of London, UK Fair Isaac Corporation, USA University of California at Los Angeles, USA University of California at Merced, USA University of Kentucky, USA SUNY at Binghampton, USA GE Global Research, USA University of Wisconsin-Milwaukee, USA University of T¨ ubingen, Germany Foundation for Research and Technology - Hellas (FORTH), Greece Delphi Corporation, USA HRL Labs, USA Queen’s University Belfast, UK
(Area 2) Computer Graphics Abd Rahni Mt Piah Abram Greg Adamo-Villani Nicoletta Agu Emmanuel
Universiti Sains Malaysia, Malaysia Texas Advanced Computing Center, USA Purdue University, USA Worcester Polytechnic Institute, USA
Organization
Andres Eric Artusi Alessandro Baciu George Balcisoy Selim Saffet Barneva Reneta Belyaev Alexander Berberich Eric Bilalis Nicholas Bimber Oliver Bohez Erik Bouatouch Kadi Brimkov Valentin Brown Ross Bruckner Stefan Callahan Steven Chen Min Cheng Irene Choi Min Comba Joao Crawfis Roger Cremer Jim Crossno Patricia Culbertson Bruce Debattista Kurt Deng Zhigang Dick Christian Dingliana John El-Sana Jihad Entezari Alireza Fabian Nathan Fiorio Christophe De Floriani Leila Gaither Kelly Gao Chunyu Geist Robert Gelb Dan Gotz David Gooch Amy Gu David Guerra-Filho Gutemberg Habib Zulfiqar Hadwiger Markus
XIII
Laboratory XLIM-SIC, University of Poitiers, France CaSToRC Cyprus Institute, Cyprus Hong Kong PolyU, Hong Kong Sabanci University, Turkey State University of New York, USA Max-Planck-Institut f¨ ur Informatik, Germany Max Planck Institute, Germany Technical University of Crete, Greece Johannes Kepler University Linz, Austria Asian Institute of Technology, Thailand University of Rennes I, IRISA, France State University of New York, USA Queensland University of Technology, Australia Vienna University of Technology, Austria University of Utah, USA University of Wales Swansea, UK University of Alberta, Canada University of Colorado at Denver, USA Universidade Federal do Rio Grande do Sul, Brazil Ohio State University, USA University of Iowa, USA Sandia National Laboratories, USA HP Labs, USA University of Warwick, UK University of Houston, USA Technical University of Munich, Germany Trinity College, Ireland Ben Gurion University of The Negev, Israel University of Florida, USA Sandia National Laboratories, USA Universit´e Montpellier 2, LIRMM, France University of Genoa, Italy University of Texas at Austin, USA Epson Research and Development, USA Clemson University, USA Hewlett Packard Labs, USA IBM, USA University of Victoria, Canada State University of New York at Stony Brook, USA University of Texas Arlington, USA COMSATS Institute of Information Technology, Lahore, Pakistan KAUST, Saudi Arabia
XIV
Organization
Haller Michael Hamza-Lup Felix Han JungHyun Hand Randall Hao Xuejun Hernandez Jose Tiberio Huang Jian Huang Mao Lin Huang Zhiyong Hussain Muhammad Joaquim Jorge Jones Michael Ju Tao Julier Simon J. Kakadiaris Ioannis Kamberov George Klosowski James Kobbelt Leif Kolingerova Ivana Kuan Hwee Lee Lai Shuhua Lee Chang Ha Lee Tong-Yee Levine Martin Lewis R. Robert Li Frederick Lindstrom Peter Linsen Lars Loviscach Joern Magnor Marcus Majumder Aditi Mantler Stephan Martin Ralph McGraw Tim Meenakshisundaram Gopi Mendoza Cesar Metaxas Dimitris Myles Ashish Nait-Charif Hammadi Nasri Ahmad Noma Tsukasa Okada Yoshihiro Olague Gustavo
Upper Austria University of Applied Sciences, Austria Armstrong Atlantic State University, USA Korea University, Korea Lockheed Martin Corporation, USA Columbia University and NYSPI, USA Universidad de los Andes, Colombia University of Tennessee at Knoxville, USA University of Technology, Australia Institute for Infocomm Research, Singapore King Saud University, Saudi Arabia Instituto Superior Tecnico, Portugal Brigham Young University, USA Washington University, USA University College London, UK University of Houston, USA Stevens Institute of Technology, USA AT&T Labs, USA RWTH Aachen, Germany University of West Bohemia, Czech Republic Bioinformatics Institute, A*STAR, Singapore Virginia State University, USA Chung-Ang University, Korea National Cheng-Kung University, Taiwan McGill University, Canada Washington State University, USA University of Durham, UK Lawrence Livermore National Laboratory, USA Jacobs University, Germany Fachhochschule Bielefeld (University of Applied Sciences), Germany TU Braunschweig, Germany University of California, Irvine, USA VRVis Research Center, Austria Cardiff University, UK West Virginia University, USA University of California-Irvine, USA NaturalMotion Ltd., USA Rutgers University, USA University of Florida, USA University of Dundee, UK American University of Beirut, Lebanon Kyushu Institute of Technology, Japan Kyushu University, Japan CICESE Research Center, Mexico
Organization
Oliveira Manuel M. Ostromoukhov Victor M. Pascucci Valerio Patchett John Peterka Tom Peters Jorg Qin Hong Rautek Peter Razdan Anshuman Renner Gabor Rosen Paul Rosenbaum Rene Rudomin Rushmeier Sander Pedro Sapidis Nickolas Sarfraz Muhammad Scateni Riccardo Schaefer Scott Sequin Carlo Shead Tinothy Sourin Alexei Stamminger Marc Su Wen-Poh Szumilas Lech Tan Kar Han Tarini Marco Teschner Matthias Tsong Ng Tian Umlauf Georg Vanegas Carlos Wald Ingo Wang Sen Wimmer Michael Woodring Jon Wylie Brian Wyman Chris Wyvill Brian Yang Qing-Xiong Yang Ruigang
XV
Universidade Federal do Rio Grande do Sul, Brazil University of Montreal, Canada University of Utah, USA Los Alamons National Lab, USA Argonne National Laboratory, USA University of Florida, USA State University of New York at Stony Brook, USA Vienna University of Technology, Austria Arizona State University, USA Computer and Automation Research Institute, Hungary University of Utah, USA University of California at Davis, USA Isaac, ITESM-CEM, Mexico Holly, Yale University, USA The Hong Kong University of Science and Technology, Hong Kong University of Western Macedonia, Greece Kuwait University, Kuwait University of Cagliari, Italy Texas A&M University, USA University of California-Berkeley, USA Sandia National Laboratories, USA Nanyang Technological University, Singapore REVES/INRIA, France Griffith University, Australia Research Institute for Automation and Measurements, Poland Hewlett Packard, USA Universit` a dell’Insubria (Varese), Italy University of Freiburg, Germany Institute for Infocomm Research, Singapore HTWG Constance, Germany Purdue University, USA University of Utah, USA Kodak, USA Technical University of Vienna, Austria Los Alamos National Laboratory, USA Sandia National Laboratory, USA University of Calgary, Canada University of Iowa, USA University of Illinois at Urbana, Champaign, USA University of Kentucky, USA
XVI
Organization
Ye Duan Yi Beifang Yin Lijun Yoo Terry Yuan Xiaoru Zhang Jian Jun Zara Jiri Zordan Victor
University of Missouri-Columbia, USA Salem State College, USA Binghamton University, USA National Institutes of Health, USA Peking University, China Bournemouth University, UK Czech Technical University in Prague, Czech University of California at Riverside, USA
(Area 3) Virtual Reality Alcaiz Mariano Arns Laura Azuma Robert Balcisoy Selim Behringer Reinhold Bilalis Nicholas Blach Roland Blom Kristopher Boulic Ronan Brady Rachael Brega Jose Remo Ferreira Brown Ross Bruce Thomas Bues Matthias Chen Jian Cheng Irene Coquillart Sabine Craig Alan Cremer Jim Egges Arjan Encarnacao L. Miguel Figueroa Pablo Fox Jesse Friedman Doron Gregory Michelle Gupta Satyandra K. Haller Michael Hamza-Lup Felix Hinkenjann Andre Hollerer Tobias Huang Jian Julier Simon J. Kiyokawa Kiyoshi
Technical University of Valencia, Spain Purdue University, USA Nokia, USA Sabanci University, Turkey Leeds Metropolitan University UK Technical University of Crete, Greece Fraunhofer Institute for Industrial Engineering, Germany University of Barcelona, Spain EPFL, Switzerland Duke University, USA Universidade Estadual Paulista, Brazil Queensland University of Technology, Australia The University of South Australia, Australia Fraunhofer IAO in Stuttgart, Germany Brown University, USA University of Alberta, Canada INRIA, France NCSA University of Illinois at Urbana-Champaign, USA University of Iowa, USA Universiteit Utrecht, The Netherlands University of Louisville, USA Universidad de los Andes, Colombia Stanford University, USA IDC, Israel Pacific Northwest National Lab, USA University of Maryland, USA FH Hagenberg, Austria Armstrong Atlantic State University, USA Bonn-Rhein-Sieg University of Applied Sciences, Germany University of California at Santa Barbara, USA University of Tennessee at Knoxville, USA University College London, UK Osaka University, Japan
Organization
Klosowski James Kozintsev Kuhlen Torsten Lee Cha Liere Robert van Livingston A. Mark Majumder Aditi Malzbender Tom Mantler Stephan Molineros Jose Muller Stefan Olwal Alex Paelke Volker Papka Michael Peli Eli Pettifer Steve Piekarski Wayne Pugmire Dave Qian Gang Raffin Bruno Raij Andrew Reiners Dirk Richir Simon Rodello Ildeberto Sandor Christian Santhanam Anand Sapidis Nickolas Schulze Sherman Bill Slavik Pavel Sourin Alexei Steinicke Frank Su Simon Suma Evan Stamminger Marc Srikanth Manohar Stefani Oliver Sun Hanqiu Varsamidis Thomas Vercher Jean-Louis Wald Ingo Wither Jason
XVII
AT&T Labs, USA Igor, Intel, USA RWTH Aachen University, Germany University of California, Santa Barbara, USA CWI, The Netherlands Naval Research Laboratory, USA University of California, Irvine, USA Hewlett Packard Labs, USA VRVis Research Center, Austria Teledyne Scientific and Imaging, USA University of Koblenz, Germany MIT, USA Institut de Geom`atica, Spain Argonne National Laboratory, USA Harvard University, USA The University of Manchester, UK Qualcomm Bay Area R&D, USA Los Alamos National Lab, USA Arizona State University, USA INRIA, France University of South Florida, USA University of Louisiana, USA Arts et Metiers ParisTech, France University of Sao Paulo, Brazil University of South Australia, Australia University of California at Los Angeles, USA University of Western Macedonia, Greece Jurgen, University of California - San Diego, USA Indiana University, USA Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore University of M¨ unster, Germany Geophysical Fluid Dynamics Laboratory, NOAA, USA University of Southern California, USA REVES/INRIA, France Indian Institute of Science, India COAT-Basel, Switzerland The Chinese University of Hong Kong, Hong Kong Bangor University, UK Universit´e de la M´editerrane, France University of Utah, USA University of California, Santa Barbara, USA
XVIII
Organization
Yu Ka Chun Yuan Chunrong Zachmann Gabriel Zara Jiri Zhang Hui Zhao Ye
Denver Museum of Nature and Science, USA University of T¨ ubingen, Germany Clausthal University, Germany Czech Technical University in Prague, Czech Republic Indiana University, USA Kent State University, USA
(Area 4) Visualization Andrienko Gennady Avila Lisa Apperley Mark Balzs Csbfalvi Brady Rachael Benes Bedrich Bilalis Nicholas Bonneau Georges-Pierre Brown Ross Bhler Katja Callahan Steven Chen Jian Chen Min Cheng Irene Chourasia Amit Coming Daniel Dana Kristin Daniels Joel Dick Christian Doleisch Helmut Duan Ye Dwyer Tim Ebert David Entezari Alireza Ertl Thomas De Floriani Leila Fujishiro Issei Geist Robert Goebel Randy Gotz David Grinstein Georges Goebel Randy Gregory Michelle Hadwiger Helmut Markus Hagen Hans
Fraunhofer Institute IAIS, Germany Kitware, USA University of Waikato, New Zealand Budapest University of Technology and Economics, Hungary Duke University, USA Purdue University, USA Technical University of Crete, Greece Grenoble Universit´e , France Queensland University of Technology, Australia VRVIS, Austria University of Utah, USA Brown University, USA University of Wales Swansea, UK University of Alberta, Canada University of California - San Diego, USA Desert Research Institute, USA Rutgers University, USA University of Utah, USA Technical University of Munich, Germany VRVis Research Center, Austria University of Missouri-Columbia, USA Monash University, Australia Purdue University, USA University of Florida, USA University of Stuttgart, Germany University of Maryland, USA Keio University, Japan Clemson University, USA University of Alberta, Canada IBM, USA University of Massachusetts Lowell, USA University of Alberta, Canada Pacific Northwest National Lab, USA VRVis Research Center, Austria Technical University of Kaiserslautern, Germany
Organization
Hamza-Lup Felix Heer Jeffrey Hege Hans-Christian Hochheiser Harry Hollerer Tobias Hong Lichan Hotz Ingrid Joshi Alark Julier Simon J. Kao David Kohlhammer Jrn Kosara Robert Laramee Robert Lee Chang Ha Lewis R. Robert Liere Robert van Lim Ik Soo Linsen Lars Liu Zhanping Ma Kwan-Liu Maeder Anthony Majumder Aditi Malpica Jose Masutani Yoshitaka Matkovic Kresimir McCaffrey James McGraw Tim Melanon Guy Miksch Silvia Monroe Laura Morie Jacki Mueller Klaus Museth Ken Paelke Volker Papka Michael Pettifer Steve Pugmire Dave Rabin Robert Raffin Bruno Razdan Anshuman Rhyne Theresa-Marie Rosenbaum Rene Santhanam Anand Scheuermann Gerik
XIX
Armstrong Atlantic State University, USA Armstrong University of California at Berkeley, USA Zuse Institute Berlin, Germany University of Pittsburgh, USA University of California at Santa Barbara, USA Palo Alto Research Center, USA Zuse Institute Berlin, Germany Yale University, USA University College London, UK NASA Ames Research Center, USA Fraunhofer Institut, Germany University of North Carolina at Charlotte, USA Swansea University, UK Chung-Ang University, Korea Washington State University, USA CWI, The Netherlands Bangor University, UK Jacobs University, Germany University of Pennsylvania, USA University of California-Davis, USA University of Western Sydney, Australia University of California, Irvine, USA Alcala University, Spain The University of Tokyo Hospital, Japan VRVis Forschungs-GmbH, Austria Microsoft Research / Volt VTE, USA West Virginia University, USA CNRS UMR 5800 LaBRI and INRIA Bordeaux Sud-Ouest, France Vienna University of Technology, Austria Los Alamos National Labs, USA University of Southern California, USA Stony Brook University, USA Link¨ oping University, Sweden Institut de Geom`atica, Spain Argonne National Laboratory, USA The University of Manchester, UK Los Alamos National Lab, USA University of Wisconsin at Madison, USA INRIA, France Arizona State University, USA North Carolina State University, USA University of California at Davis, USA University of California at Los Angeles, USA University of Leipzig, Germany
XX
Organization
Shead Tinothy Shen Han-Wei Sips Mike Slavik Pavel Sourin Alexei Thakur Sidharth Theisel Holger Thiele Olaf Toledo de Rodrigo Tricoche Xavier Umlauf Georg Viegas Fernanda Wald Ingo Wan Ming Weinkauf Tino Weiskopf Daniel Wischgoll Thomas Wylie Brian Yeasin Mohammed Yuan Xiaoru Zachmann Gabriel Zhang Hui Zhao Ye Zhukov Leonid
Sandia National Laboratories, USA Ohio State University, USA Stanford University, USA Czech Technical University in Prague, Czech Republic Nanyang Technological University, Singapore Renaissance Computing Institute (RENCI), USA University of Magdeburg, Germany University of Mannheim, Germany Petrobras PUC-RIO, Brazil Purdue University, USA HTWG Constance, Germany IBM, USA University of Utah, USA Boeing Phantom Works, USA Courant Institute, New York University, USA University of Stuttgart, Germany Wright State University, USA Sandia National Laboratory, USA Memphis University, USA Peking University, China Clausthal University, Germany Indiana University, USA Kent State University, USA Caltech, USA
ISVC 2011 Special Tracks 1. 3D Mapping, Modeling and Surface Reconstruction Organizers Nefian Ara Edwards Laurence Huertas Andres
Carnegie Mellon University/NASA Ames Research Center, USA NASA Ames Research Center, USA NASA Jet Propulsion Lab, USA
Program Committee Bradski Gary Zakhor Avideh Cavallaro Andrea Bouguet Jean-Yves
Willow Garage, USA University of California at Berkeley, USA University Queen Mary, London, UK Google, USA
Organization
XXI
2. Best Practices in Teaching Visual Computing Organizers Albu Alexandra Branzan Bebis George
University of Victoria, Canada University of Nevada, Reno, USA and King Saud University, Saudi Arabia
Program Committee Antonacopoulos Apostolos Bellon Olga Regina Pereira Bowyer Kevin Crawfis Roger Hammoud Riad Kakadiaris Ioannis Llads Josep Sarkar Sudeep
University of Salford, UK Universidade Federal do Parana, Brazil University of Notre Dame, USA Ohio State University, USA DynaVox Systems, USA University of Houston, USA Universitat Autonoma de Barcelona, Spain University of South Florida, USA
3. Immersive Visualization Organizers Sherman Bill Wernert Eric OLeary Patrick Coming Daniel
Indiana University, USA Indiana University, USA University of Calgary, Canada Desert Research Institute, USA
Program Committee Su Simon Folcomer Samuel Brady Rachael Johnson Andy Kreylos Oliver Will Jeffrey Moreland John Leigh Jason Schulze Jurgen Sanyal Jibonananda Stone John Kuhlen Torsten
Princeton University, USA Brown University, USA Duke University, USA University of Illinois at Chicago, USA University of California at Davis, USA Valparaiso University, USA Purdue University, Calumet, USA University of Illinois, Chicago, USA University of California, San Diego, USA Mississippi State University, USA University of Illinois, Urbana-Champaign, USA Aachen University, Germany
4. Computational Bioimaging Organizers Tavares Joo Manuel R.S. Natal Jorge Renato Cunha Alexandre
University of Porto, Portugal University of Porto, Portugal Caltech, USA
XXII
Organization
Program Committee Santis De Alberto Reis Ana Mafalda Barrutia Arrate Muoz Calvo Begoa Constantinou Christons Iacoviello Daniela Ushizima Daniela Ziou Djemel Pires Eduardo Borges Sgallari Fiorella Perales Francisco Qiu Guoping Hanchuan Peng Pistori Hemerson Yanovsky Igor Corso Jason Maldonado Javier Melenchn Marques Jorge S. Aznar Jose M. Garca Vese Luminita Reis Lus Paulo Thiriet Marc Mahmoud El-Sakka Hidalgo Manuel Gonzlez Gurcan Metin N. Dubois Patrick Barneva Reneta P. Bellotti Roberto Tangaro Sabina Silva Susana Branco Brimkov Valentin Zhan Yongjie
Universit` a degli Studi di Roma “La Sapienza”, Italy Instituto de Ciˆencias Biom´edicas Abel Salazar, Portugal University of Navarra, Spain University of Zaragoza, Spain Stanford University, USA Universit` a degli Studi di Roma “La Sapienza”, Italy Lawrence Berkeley National Lab, USA University of Sherbrooke, Canada Instituto Superior T´ecnico, Portugal University of Bologna, Italy Balearic Islands University, Spain University of Nottingham, UK Howard Hughes Medical Institute, USA Dom Bosco Catholic University, Brazil Jet Propulsion Laboratory, USA SUNY at Buffalo, USA Open University of Catalonia, Spain Instituto Superior T´ecnico, Portugal University of Zaragoza, Spain University of California at Los Angeles, USA University of Porto, Portugal Universit´e Pierre et Marie Curie (Paris VI), France The University of Western Ontario London, Canada Balearic Islands University, Spain Ohio State University, USA Institut de Technologie M´edicale, France State University of New York, USA University of Bari, Italy University of Bari, Italy University of Lisbon, Portugal State University of New York, USA Carnegie Mellon University, USA
5. Interactive Visualization in Novel and Heterogeneous Display Environments Organizers Rosenbaum Rene Tominski Christian
University of California, Davis, USA University of Rostock, Germany
Organization
XXIII
Program Committee Isenberg Petra Isenberg Tobias Kerren Andreas Majumder Aditi Quigley Aaron Schumann Heidrun Sips Mike Slavik Pavel Weiskopf Daniel
INRIA, France University of Groningen, The Netherlands and CNRS/INRIA, France Linnaeus University, Sweden University of California, Irvine, USA University of St. Andrews, UK University of Rostock, Germany GFZ Potsdam, Germany Czech Technical University in Prague, Czech Republic University of Stuttgart, Germany
Additional Reviewers Payet Nadia Hong Wei
Hewlett Packard Labs, USA Hewlett Packard Labs, USA
XXIV
Organization
Organizing Institutions and Sponsors
Table of Contents – Part I
ST: Computational Bioimaging EM+TV Based Reconstruction for Cone-Beam CT with Reduced Radiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming Yan, Jianwen Chen, Luminita A. Vese, John Villasenor, Alex Bui, and Jason Cong A Localization Framework under Non-rigid Deformation for Robotic Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiang Xiang Global Image Registration by Fast Random Projection . . . . . . . . . . . . . . . Hayato Itoh, Shuang Lu, Tomoya Sakai, and Atsushi Imiya
1
11
23
EM-Type Algorithms for Image Reconstruction with Background Emission and Poisson Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ming Yan
33
Region-Based Segmentation of Parasites for High-throughput Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asher Moody-Davis, Laurent Mennillo, and Rahul Singh
43
Computer Graphics I Adaptive Coded Aperture Photography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oliver Bimber, Haroon Qureshi, Anselm Grundh¨ ofer, Max Grosse, and Daniel Danch
54
Display Pixel Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clemens Birklbauer, Max Grosse, Anselm Grundh¨ ofer, Tianlun Liu, and Oliver Bimber
66
Image Relighting by Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao Teng and Tat-Jen Cham
78
Generating EPI Representations of 4D Light Fields with a Single Lens Focused Plenoptic Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sven Wanner, Janis Fehr, and Bernd J¨ ahne
90
MethMorph: Simulating Facial Deformation Due to Methamphatamine Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mahsa Kamali, Forrest N. Iandola, Hui Fang, and John C. Hart
102
XXVI
Table of Contents – Part I
Motion and Tracking I Segmentation-Free, Area-Based Articulated Object Tracking . . . . . . . . . . . Daniel Mohr and Gabriel Zachmann
112
An Attempt to Segment Foreground in Dynamic Scenes . . . . . . . . . . . . . . . Xiang Xiang
124
From Saliency to Eye Gaze: Embodied Visual Selection for a Pan-Tilt-Based Robotic Head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matei Mancas, Fiora Pirri, and Matia Pizzoli
135
Adaptive Two-Step Adjustable Partial Distortion Search Algorithm for Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yonghoon Kim, Dokyung Lee, and Jechang Jeong
147
Feature Trajectory Retrieval with Application to Accurate Structure and Motion Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Cordes, Oliver M¨ uller, Bodo Rosenhahn, and J¨ orn Ostermann
156
Distortion Compensation for Movement Detection Based on Dense Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josef Maier and Kristian Ambrosch
168
Segmentation Free Boundary Conditions Active Contours with Applications for Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michal Shemesh and Ohad Ben-Shahar
180
Evolving Content-Driven Superpixels for Accurate Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard J. Lowe and Mark S. Nixon
192
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillaume Cerutti, Laure Tougne, Antoine Vacavant, and Didier Coquin
202
Avoiding Mesh Folding in 3D Optimal Surface Segmentation . . . . . . . . . . Christian Bauer, Shanhui Sun, and Reinhard Beichel
214
High Level Video Temporal Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruxandra Tapu and Titus Zaharia
224
Embedding Gestalt Laws on Conditional Random Field for Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olfa Besbes, Nozha Boujemaa, and Ziad Belhadj
236
Table of Contents – Part I
Higher Order Markov Networks for Model Estimation . . . . . . . . . . . . . . . . Toufiq Parag and Ahmed Elgammal
XXVII
246
Visualization I Interactive Object Graphs for Debuggers with Improved Visualization, Inspection and Configuration Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthony Savidis and Nikos Koutsopoulos
259
GPU-Based Ray Casting of Stacked Out-of-Core Height Fields . . . . . . . . . Christopher Lux and Bernd Fr¨ ohlich
269
Multi-View Stereo Point Clouds Visualization . . . . . . . . . . . . . . . . . . . . . . . Yi Gong and Yuan-Fang Wang
281
Depth Map Enhancement Using Adaptive Steering Kernel Regression Based on Distance Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sung-Yeol Kim, Woon Cho, Andreas Koschan, and Mongi A. Abidi Indented Pixel Tree Browser for Exploring Huge Hierarchies . . . . . . . . . . . Michael Burch, Hansj¨ org Schmauder, and Daniel Weiskopf
291
301
ST: 3D Mapping, Modeling and Surface Reconstruction I Towards Realtime Handheld MonoSLAM in Dynamic Environments . . . . Samunda Perera and Ajith Pasqual Registration of 3D Geometric Model and Color Images Using SIFT and Range Intensity Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryo Inomata, Kenji Terabayashi, Kazunori Umeda, and Guy Godin Denoising Time-Of-Flight Data with Adaptive Total Variation . . . . . . . . . Frank Lenzen, Henrik Sch¨ afer, and Christoph Garbe Efficient City-Sized 3D Reconstruction from Ultra-High Resolution Aerial and Ground Video Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandru N. Vasile, Luke J. Skelly, Karl Ni, Richard Heinrichs, and Octavia Camps Non-Parametric Sequential Frame Decimation for Scene Reconstruction in Low-Memory Streaming Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Knoblauch, Mauricio Hess-Flores, Mark A. Duchaineau, Kenneth I. Joy, and Falko Kuester
313
325 337
347
359
XXVIII
Table of Contents – Part I
Biomedical Imaging Ground Truth Estimation by Maximizing Topological Agreements in Electron Microscopy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huei-Fang Yang and Yoonsuck Choe Segmentation and Cell Tracking of Breast Cancer Cells . . . . . . . . . . . . . . . Adele P. Peskin, Daniel J. Hoeppner, and Christina H. Stuelten
371 381
Registration for 3D Morphological Comparison of Brain Aneurysm Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carl Lederman, Luminita Vese, and Aichi Chien
392
An Interactive Editing Framework for Electron Microscopy Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huei-Fang Yang and Yoonsuck Choe
400
Retinal Vessel Extraction Using First-Order Derivative of Gaussian and Morphological Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, Christopher G. Owen, Alicja R. Rudnicka, and S.A. Barman
410
Computer Graphics II High-Quality Shadows with Improved Paraboloid Mapping . . . . . . . . . . . . Juraj Vanek, Jan Navr´ atil, Adam Herout, and Pavel Zemˇc´ık
421
Terramechanics Based Terrain Deformation for Real-Time Off-Road Vehicle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ying Zhu, Xiao Chen, and G. Scott Owen
431
An Approach to Point Based Approximate Color Bleeding with Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher J. Gibson and Zo¨e J. Wood
441
3D Reconstruction of Buildings with Automatic Facade Refinement . . . . C. Larsen and T.B. Moeslund Surface Reconstruction of Maltese Cisterns Using ROV Sonar Data for Archeological Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Forney, J. Forrester, B. Bagley, W. McVicker, J. White, T. Smith, J. Batryn, A. Gonzalez, J. Lehr, T. Gambin, C.M. Clark, and Z.J. Wood
451
461
ST: Interactive Visualization in Novel and Heterogeneous Display Environments Supporting Display Scalability by Redundant Mapping . . . . . . . . . . . . . . . Axel Radloff, Martin Luboschik, Mike Sips, and Heidrun Schumann
472
Table of Contents – Part I
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device for Construction and Browsing of Human-Reachable Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-Tung Kuo and Wen-Hsiang Tsai Physical Navigation to Support Graph Exploration on a Large High-Resolution Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anke Lehmann, Heidrun Schumann, Oliver Staadt, and Christian Tominski An Extensible Interactive 3D Visualization Framework for N-Dimensional Datasets Used in Heterogeneous Software Display Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nathaniel Rossol, Irene Cheng, John Berezowski, and Iqbal Jamal Improving Collaborative Visualization of Structural Biology . . . . . . . . . . . Aaron Bryden, George N. Phillips Jr., Yoram Griguer, Jordan Moxon, and Michael Gleicher Involve Me and I Will Understand!–Abstract Data Visualization in Immersive Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ren´e Rosenbaum, Jeremy Bottleson, Zhuiguang Liu, and Bernd Hamann
XXIX
484
496
508 518
530
Object Detection and Recognition I Automated Fish Taxonomy Using Evolution-COnstructed Features . . . . . Kirt Lillywhite and Dah-Jye Lee
541
A Monocular Human Detection System Based on EOH and Oriented LBP Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yingdong Ma, Xiankai Chen, Liu Jin, and George Chen
551
Using the Shadow as a Single Feature for Real-Time Monocular Vehicle Pose Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dennis Rosebrock, Markus Rilk, Jens Spehr, and Friedrich M. Wahl
563
Multi-class Object Layout with Unsupervised Image Classification and Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ser-Nam Lim, Gianfranco Doretto, and Jens Rittscher
573
Efficient Detection of Consecutive Facial Expression Apices Using Biologically Based Log-Normal Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zakia Hammal
586
DTTM: A Discriminative Temporal Topic Model for Facial Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lifeng Shang, Kwok-Ping Chan, and Guodong Pan
596
XXX
Table of Contents – Part I
Visualization II Direct Spherical Parameterization of 3D Triangular Meshes Using Local Flattening Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogdan Mocanu and Titus Zaharia
607
Segmentation and Visualization of Multivariate Features Using Feature-Local Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kenny Gruchalla, Mark Rast, Elizabeth Bradley, and Pablo Mininni
619
Magic Marker: A Color Analytics Interface for Image Annotation . . . . . . Supriya Garg, Kshitij Padalkar, and Klaus Mueller BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julian Heinrich, Robert Seifert, Michael Burch, and Daniel Weiskopf Visualizing Translation Variation: Shakespeare’s Othello . . . . . . . . . . . . . . Zhao Geng, Robert S. Laramee, Tom Cheesman, Alison Ehrmann, and David M. Berry
629
641 653
ST: 3D Mapping, Modeling and Surface Reconstruction II 3D Object Modeling with Graphics Hardware Acceleration and Unsupervised Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Felipe Montoya–Franco, Andr´es F. Serna–Morales, and Flavio Prieto
664
Event-Based Stereo Matching Approaches for Frameless Address Event Stereo Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J¨ urgen Kogler, Martin Humenberger, and Christoph Sulzbachner
674
A Variational Model for the Restoration of MR Images Corrupted by Blur and Rician Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pascal Getreuer, Melissa Tong, and Luminita A. Vese
686
Robust Classification of Curvilinear and Surface-Like Structures in 3d Point Cloud Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mahsa Kamali, Matei Stroila, Jason Cho, Eric Shaffer, and John C. Hart Orthographic Stereo Correlator on the Terrain Model for Apollo Metric Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taemin Kim, Kyle Husmann, Zachary Moratto, and Ara V. Nefian
699
709
Table of Contents – Part I
XXXI
Motion and Tracking II Collaborative Track Analysis, Data Cleansing, and Labeling . . . . . . . . . . . George Kamberov, Gerda Kamberova, Matt Burlick, Lazaros Karydas, and Bart Luczynski
718
Time to Collision and Collision Risk Estimation from Local Scale and Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shrinivas Pundlik, Eli Peli, and Gang Luo
728
Visual Tracking Based on Log-Euclidean Riemannian Sparse Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi Wu, Haibin Ling, Erik Blasch, Li Bai, and Genshe Chen
738
Panoramic Background Generation and Abnormal Behavior Detection in PTZ Camera Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sang-Hyun Cho and Hang-Bong Kang
748
Computing Range Flow from Multi-modal Kinect Data . . . . . . . . . . . . . . . Jens-Malte Gottfried, Janis Fehr, and Christoph S. Garbe
758
Real-Time Object Tracking on iPhone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amin Heidari and Parham Aarabi
768
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
779
Table of Contents – Part II
ST: Immersive Visualization Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John E. Stone, Kirby L. Vandivort, and Klaus Schulten The OmegaDesk: Towards a Hybrid 2D and 3D Work Desk . . . . . . . . . . . Alessandro Febretti, Victor A. Mateevitsi, Dennis Chau, Arthur Nishimoto, Brad McGinnis, Jakub Misterka, Andrew Johnson, and Jason Leigh Disambiguation of Horizontal Direction for Video Conference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mabel Mengzi Zhang, Seth Rotkin, and J¨ urgen P. Schulze Immersive Visualization and Interactive Analysis of Ground Penetrating Radar Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthew R. Sgambati, Steven Koepnick, Daniel S. Coming, Nicholas Lancaster, and Frederick C. Harris Jr.
1 13
24
33
Handymap: A Selection Interface for Cluttered VR Environments Using a Tracked Hand-Held Touch Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mores Prachyabrued, David L. Ducrest, and Christoph W. Borst
45
Virtual Interrupted Suturing Exercise with the Endo Stitch Suturing Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukitti Punak, Sergei Kurenov, and William Cance
55
Applications New Image Steganography via Secret-Fragment-Visible Mosaic Images by Nearly-Reversible Color Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . Ya-Lin Li and Wen-Hsiang Tsai
64
Adaptive and Nonlinear Techniques for Visibility Improvement of Hazy Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saibabu Arigela and Vijayan K. Asari
75
Linear Clutter Removal from Urban Panoramas . . . . . . . . . . . . . . . . . . . . . . Mahsa Kamali, Eyal Ofek, Forrest Iandola, Ido Omer, and John C. Hart
85
Efficient Starting Point Decision for Enhanced Hexagonal Search . . . . . . . Do-Kyung Lee and Je-Chang Jeong
95
XXXIV
Table of Contents – Part II
Multiview 3D Pose Estimation of a Wand for Human-Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X. Zabulis, P. Koutlemanis, H. Baltzakis, and D. Grammenos
104
Object Detection and Recognition II Material Information Acquisition Using a ToF Range Sensor for Interactive Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abdul Mannan, Hisato Fukuda, Yoshinori Kobayashi, and Yoshinori Kuno A Neuromorphic Approach to Object Detection and Recognition in Airborne Videos with Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Chen, Deepak Khosla, David Huber, Kyungnam Kim, and Shinko Y. Cheng
116
126
Retrieval of 3D Polygonal Objects Based on Multiresolution Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Lam and J.M. Hans du Buf
136
3D Facial Feature Detection Using Iso-Geodesic Stripes and Shape-Index Based Integral Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James Allen, Nikhil Karkera, and Lijun Yin
148
Hybrid Face Recognition Based on Real-Time Multi-camera Stereo-Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Hensler, K. Denker, M. Franz, and G. Umlauf
158
Learning Image Transformations without Training Examples . . . . . . . . . . Sergey Pankov
168
Virtual Reality Investigation of Secondary Views in a Multimodal VR Environment: 3D Lenses, Windows, and Mirrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phanidhar Bezawada Raghupathy and Christoph W. Borst
180
Synthesizing Physics-Based Vortex and Collision Sound in Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damon Shing-Min Liu, Ting-Wei Cheng, and Yu-Cheng Hsieh
190
BlenSor: Blender Sensor Simulation Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . Michael Gschwandtner, Roland Kwitt, Andreas Uhl, and Wolfgang Pree
199
Fuzzy Logic Based Sensor Fusion for Accurate Tracking . . . . . . . . . . . . . . . Ujwal Koneru, Sangram Redkar, and Anshuman Razdan
209
Table of Contents – Part II
A Flight Tested Wake Turbulence Aware Altimeter . . . . . . . . . . . . . . . . . . . Scott Nykl, Chad Mourning, Nikhil Ghandi, and David Chelberg A Virtual Excavation: Combining 3D Immersive Virtual Reality and Geophysical Surveying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Albert Yu-Min Lin, Alexandre Novo, Philip P. Weber, Gianfranco Morelli, Dean Goodman, and J¨ urgen P. Schulze
XXXV
219
229
ST: Best Practices in Teaching Visual Computing Experiences in Disseminating Educational Visualizations . . . . . . . . . . . . . . Nathan Andrysco, Paul Rosen, Voicu Popescu, Bedˇrich Beneˇs, and Kevin Robert Gurney Branches and Roots: Project Selection in Graphics Courses for Fourth Year Computer Science Undergraduates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.D. Jones Raydiance: A Tangible Interface for Teaching Computer Vision . . . . . . . . Paul Reimer, Alexandra Branzan Albu, and George Tzanetakis
239
249
259
Poster Session Subvoxel Super-Resolution of Volumetric Motion Field Using General Order Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koji Kashu, Atsushi Imiya, and Tomoya Sakai Architectural Style Classification of Building Facade Windows . . . . . . . . . Gayane Shalunts, Yll Haxhimusa, and Robert Sablatnig Are Current Monocular Computer Vision Systems for Human Action Recognition Suitable for Visual Surveillance Applications? . . . . . . . . . . . . Jean-Christophe Nebel, Michal Lewandowski, J´erˆ ome Th´evenon, Francisco Mart´ınez, and Sergio Velastin Near-Optimal Time Function for Secure Dynamic Visual Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V. Petrauskiene, J. Ragulskiene, E. Sakyte, and M. Ragulskis Vision-Based Horizon Detection and Target Tracking for UAVs . . . . . . . . Yingju Chen, Ahmad Abushakra, and Jeongkyu Lee Bag-of-Visual-Words Approach to Abnormal Image Detection In Wireless Capsule Endoscopy Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sae Hwang
270
280
290
300
310
320
XXXVI
Table of Contents – Part II
A Relevance Feedback Framework for Image Retrieval Based on Ant Colony Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guang-Peng Chen, Yu-Bin Yang, Yao Zhang, Ling-Yan Pan, Yang Gao, and Lin Shang A Closed Form Algorithm for Superresolution . . . . . . . . . . . . . . . . . . . . . . . Marcelo O. Camponez, Evandro O.T. Salles, and M´ ario Sarcinelli-Filho A Parallel Hybrid Video Coding Method Based on Noncausal Prediction with Multimode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cui Wang and Yoshinori Hatori Color-Based Extensions to MSERs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aaron Chavez and David Gustafson 3D Model Retrieval Using the Histogram of Orientation of Suggestive Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sang Min Yoon and Arjan Kuijper Adaptive Discrete Laplace Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe Fiorio, Christian Mercat, and Fr´ed´eric Rieux Stereo Vision-Based Improving Cascade Classifier Learning for Vehicle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonghwan Kim, Chung-Hee Lee, Young-Chul Lim, and Soon Kwon Towards a Universal and Limited Visual Vocabulary . . . . . . . . . . . . . . . . . . Jian Hou, Zhan-Shen Feng, Yong Yang, and Nai-Ming Qi
328
338
348 358
367 377
387 398
Human Body Shape and Motion Tracking by Hierarchical Weighted ICP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jia Chen, Xiaojun Wu, Michael Yu Wang, and Fuqin Deng
408
Multi-view Head Detection and Tracking with Long Range Capability for Social Navigation Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Razali Tomari, Yoshinori Kobayashi, and Yoshinori Kuno
418
A Fast Video Stabilization System Based on Speeded-up Robust Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minqi Zhou and Vijayan K. Asari
428
Detection of Defect in Textile Fabrics Using Optimal Gabor Wavelet Network and Two-Dimensional PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Srikaew, K. Attakitmongcol, P. Kumsawat, and W. Kidsang
436
Introducing Confidence Maps to Increase the Performance of Person Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Zweng and Martin Kampel
446
Table of Contents – Part II
XXXVII
Monocular Online Learning for Road Region Labeling and Object Detection from a Moving Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chung-Ching Lin and Marilyn Wolf
456
Detection and Tracking Faces in Unconstrained Color Video Streams . . . Corn´elia Janayna P. Passarinho, Evandro Ottoni T. Salles, and M´ ario Sarcinelli-Filho
466
Model-Based Chart Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ales Mishchenko and Natalia Vassilieva
476
Kernel-Based Motion-Blurred Target Tracking . . . . . . . . . . . . . . . . . . . . . . . Yi Wu, Jing Hu, Feng Li, Erkang Cheng, Jingyi Yu, and Haibin Ling
486
Robust Foreground Detection in Videos Using Adaptive Color Histogram Thresholding and Shadow Removal . . . . . . . . . . . . . . . . . . . . . . . Akintola Kolawole and Alireza Tavakkoli
496
Deformable Object Shape Refinement and Tracking Using Graph Cuts and Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mehmet Kemal Kocamaz, Yan Lu, and Christopher Rasmussen
506
A Non-intrusive Method for Copy-Move Forgery Detection . . . . . . . . . . . . Najah Muhammad, Muhammad Hussain, Ghulam Muhamad, and George Bebis An Investigation into the Use of Partial Face in the Mobile Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. Mallikarjuna Rao, Praveen Kumar, G. Vijaya Kumari, Amit Pande, and G.R. Babu
516
526
Optimal Multiclass Classifier Threshold Estimation with Particle Swarm Optimization for Visual Object Recognition . . . . . . . . . . . . . . . . . . Shinko Y. Cheng, Yang Chen, Deepak Khosla, and Kyungnam Kim
536
A Parameter-Free Locality Sensitive Discriminant Analysis and Its Application to Coarse 3D Head Pose Estimation . . . . . . . . . . . . . . . . . . . . . A. Bosaghzadeh and F. Dornaika
545
Image Set-Based Hand Shape Recognition Using Camera Selection Driven by Multi-class AdaBoosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Ohkawa, Chendra Hadi Suryanto, and Kazuhiro Fukui
555
Image Segmentation Based on k -Means Clustering and Energy-Transfer Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Gaura, Eduard Sojka, and Michal Krumnikl
567
SERP: SURF Enhancer for Repeated Pattern . . . . . . . . . . . . . . . . . . . . . . . Seung Jun Mok, Kyungboo Jung, Dong Wook Ko, Sang Hwa Lee, and Byung-Uk Choi
578
XXXVIII
Table of Contents – Part II
Shape Abstraction through Multiple Optimal Solutions . . . . . . . . . . . . . . . Marlen Akimaliev and M. Fatih Demirci
588
Evaluating Feature Combination in Object Classification . . . . . . . . . . . . . . Jian Hou, Bo-Ping Zhang, Nai-Ming Qi, and Yong Yang
597
Solving Geometric Co-registration Problem of Multi-spectral Remote Sensing Imagery Using SIFT-Based Features toward Precise Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mostafa Abdelrahman, Asem Ali, Shireen Elhabian, and Aly A. Farag Color Compensation Using Nonlinear Luminance-RGB Component Curve of a Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sejung Yang, Yoon-Ah Kim, Chaerin Kang, and Byung-Uk Lee Augmenting Heteronanostructure Visualization with Haptic Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michel Abdul-Massih, Bedˇrich Beneˇs, Tong Zhang, Christopher Platzer, William Leavenworth, Huilong Zhuo, Edwin R. Garc´ıa, and Zhiwen Liang An Analysis of Impostor Based Level of Detail Approximations for LIDAR Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chad Mourning, Scott Nykl, and David Chelberg UI Generation for Data Visualisation in Heterogenous Environment . . . . Miroslav Macik, Martin Klima, and Pavel Slavik An Open-Source Medical Image Processing and Visualization Tool to Analyze Cardiac SPECT Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Roberto Pereira de Paula, Carlos da Silva dos Santos, Marco Antonio Gutierrez, and Roberto Hirata Jr. CollisionExplorer: A Tool for Visualizing Droplet Collisions in a Turbulent Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.V. Rohith, Hossein Parishani, Orlando Ayala, Lian-Ping Wang, and Chandra Kambhamettu A Multi Level Time Model for Interactive Multiple Dataset Visualization: The Dataset Sequencer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Beer, Gerrit Garbereder, Tobias Meisen, Rudolf Reinhard, and Torsten Kuhlen Automatic Generation of Aesthetic Patterns with the Use of Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Gdawiec, Wieslaw Kotarski, and Agnieszka Lisowska
607
617
627
637
647
659
669
681
691
Table of Contents – Part II
XXXIX
A Comparative Evaluation of Feature Detectors on Historic Repeat Photography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Gat, Alexandra Branzan Albu, Daniel German, and Eric Higgs Controllable Simulation of Particle System . . . . . . . . . . . . . . . . . . . . . . . . . . Muhammad Rusdi Syamsuddin and Jinwook Kim
701
715
3D-City Modeling: A Semi-Automatic Framework for Integrating Different Terrain Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mattias Roup´e and Mikael Johansson
725
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
735
EM+TV Based Reconstruction for Cone-Beam CT with Reduced Radiation Ming Yan1 , Jianwen Chen2 , Luminita A. Vese1 , John Villasenor2 , Alex Bui3 , and Jason Cong4 1
Department of Mathematics, University of California, Los Angeles Department of Electrical Engineering, University of California, Los Angeles 3 Department of Radiological Sciences, University of California, Los Angeles 4 Department of Computer Science, University of California, Los Angeles Los Angeles, CA 90095, United States
2
Abstract. Computerized tomography (CT) plays a critical role in modern medicine. However, the radiation associated with CT is significant. Methods that can enable CT imaging with less radiation exposure but without sacrificing image quality are therefore extremely important. This paper introduces a novel method for enabling image reconstruction at lower radiation exposure levels with convergence analysis. The method is based on the combination of expectation maximization (EM) and total variation (TV) regularization. While both EM and TV methods are known, their combination as described here is novel. We show that EM+TV can reconstruct a better image using much fewer views, thus reducing the overall dose of radiation. Numerical results show the efficiency of the EM+TV method in comparison to filtered backprojection and classic EM. In addition, the EM+TV algorithm is accelerated with GPU multicore technology, and the high performance speed-up makes the EM+TV algorithm feasible for future practical CT systems. Keywords: expectation maximization, CT variation, GPU medical image processing.
1
reconstruction,
total
Introduction
As a class of methods for reconstructing two-dimensional and three-dimensional images from the projections of an object, iterative reconstruction has many applications, including computerized tomography (CT), positron emission tomography (PET), and magnetic resonance imaging (MRI). Iterative reconstruction is quite different from the filtered back projection (FBP) method [1,2], the algorithm most commonly used by manufacturers of commercial imaging equipment. The main advantages of iterative reconstruction over FBP are reduced sensitivity to noise and increased data collection flexibility [3]. For example, the data can be collected over any set of lines, the projections do not have to be distributed uniformly, and the projections can even be incomplete (limited angle). There are many available algorithms for iterative reconstruction. Most of these algorithms are based on solving a system of linear equations Ax = b where G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 1–10, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
M. Yan et al.
x = (x1 , · · · , xN )T ∈ RN is the original unknown image represented as a vector; b = (b1 , · · · , bM )T ∈ RM is the given measurement; A is a M × N matrix describing the direct transformation from the original image to the measurements. A depends on the imaging modality used; for example, in CT, A is the discrete Radon transform, with each row describing an integral along one straight line, and all the elements of A are nonnegative. One example of iterative reconstruction uses an expectation maximization (EM) algorithm [4]. The noise can be presented in b as Poisson noise. Then if x is given and A is known, the conditional probability of b using Poisson e−(Ax)i ((Ax)i )bi distribution is P (b|Ax) = M . Given an initial guess x0 , the EM i bi ! iteration for n = 0, · · · , is as follows: bi i (aij ( (Axn )i )) n n+1 xj = xj . (1) i aij All the summation in i and j are from 1 to M and N , respectively. The total-variation regularization method was originally proposed by Rudin, Osher and Fatemi [5] to remove noise in an image while preserving edges. This technique is widely used in image processing and can be expressed in terms of minimizing an energy functional of the form: minx Ω |∇x| + α Ω F (Ax, b), where x is viewed as a two- or three-dimensional image with spatial domain Ω, A is usually a blurring operator, b is the given noisy-blurry image, and F (Ax, b) is a data-fidelity term. For example, for Gaussian noise, F (Ax, b) = Ax − b22 . In the present paper we combine the EM algorithm with TV regularization. While each of these methods has been described individually in the literature, the combination of these two methods is new in CT reconstruction. The assumption is that the reconstructed image cannot have a large total-variation (thus noise and reconstruction artifacts are removed). For related relevant work, we refer to [6,7,8,9,10] and [11] for issues related to compressive sensing.
2
The Proposed Method (EM+TV)
In the classic EM algorithm, no prior information about the solution is provided. However, if we are given a priori knowledge that the solution has homogeneous regions and sharp edges, the objective is to apply this information to reconstruct an image with both minimal total-variation and maximal probability. Thus, we can consider finding a Pareto optimal point by solving a scalarization of these two objective functions, and the problem becomes: minimize E(x) := β Ω |∇x| + i ((Ax)i − bi log(Ax)i ) , x (2) subject to xj ≥ 0, j = 1, · · · , N, where β > 0 is a parameter for balancing the two terms, TV and EM. This is a convex constraint problem, and we can find the optimal solution by solving the Karush-Kuhn-Tucker (KKT) conditions [12]:
EM+TV Based Reconstruction for Cone-Beam CT with Reduced Radiation
∇x bi −βdiv + ) − yj = 0, aij (1 − |∇x| j (Ax)i i yj ≥ 0,
xj ≥ 0,
j = 1, · · · , N,
3
j = 1, · · · , N,
y T x = 0.
By positivity of {xj }, {yj } and the complementary slackness condition y T x = 0, we have xj yj = 0 for all j = 1, · · · , N . After multiplying by xj , we obtain: bi a ( ) ij i (Ax)i xj ∇x −β div + xj − xj = 0, j = 1, · · · , N. a |∇x| a j i ij i ij The last term on the left hand side is an EM step (1), which we can replace as , and we finally obtain: xEM j ∇x xj −β div + xj − xEM = 0, j a |∇x| j ij i
j = 1, · · · , N.
To solve the above equation in x, with xEM fixed from the previous step, we use j a semi-implicit iterative scheme for several steps, alternated with the EM step. The algorithm is shown below, and the convergence of the proposed algorithm is shown in next section. Input: x0 = 1; for Out=1:IterMax do /* IterMax: number of outer iterations */ x0,0 = xOut−1 ; for k = 1:1:K do /* K: number of EMupdates */ xk,0 = EM (xk−1,0 ); /* Including one Ax and one AT y */ end xOut = T V (xK,0 ); end Algorithm 1. Proposed EM+TV algorithm
3
Convergence Analysis of the Proposed Algorithm
We will show, in this section, that EM+TV algorithm is equivalent to an EM algorithm with a priori information and provide the convergence analysis of EM+TV algorithm. The EM algorithm is a general approach for maximizing a posterior distribution when some of the data is missing [13]. It is an iterative method that alternates between expectation (E) steps and maximization (M) steps. For image reconstruction, we assume that the missing data is {zij }, describing the intensity of pixel (or voxel) j observed by detector i. Therefore the observed data are bi = j zij . We can have the assumption that z is a realization of multi-value random variable Z, and for each (i, j) pair, zij follows a
4
M. Yan et al.
Poisson distribution with expected value aij xj , because the summation of two Poisson distributed random variables also follows a Poisson distribution, whose expected value is summation of the two expected values. The original E-step is to find the expectation of the log-likelihood given the present variables xk : Q(x|xk ) = Ez|xk ,b log p(x, z|b). Then, the M-step is to choose xk+1 to maximize the expected log-likelihood Q(x|xk ) found in the E-step: xk+1 = argmax Ez|xk ,b log p(x, z|b) = argmax Ez|xk ,b log(p(b, z|x)p(x)) x x (zij log(aij xj ) − aij xj ) − βJ(x) = argmax Ez|xk ,b x
= argmin x
i,j
(aij xj − Ez|xk ,b zij log(aij xj )) + βJ(x).
(3)
i,j
From (3), what we need before solving it is just {Ez|xk ,b zij }. Therefore we comk pute the expectation of missing data {zij } given present x and the condition bi = j zij , denoted this as an E-step. Because for fixed i, {zij } are Poisson variables with mean {aij xkj }, then the distribution of zij , is binomial distribu tion bi , aij xkj /(Axk )i , thus we can find the expectation of zij with all these conditions by the following E-step k+1 ≡ Ez|xk ,y zij = aij xkj bi /(Axk )i . zij
(4)
After obtaining the expectation for all zij , then we can solve the M-step (3). We will show that EM-Type algorithms are exactly the described EM algorithms with a priori information. Recalling the definition of xEM , we have k+1 xEM = zij / aij . j i
i
Therefore, M-step is equivalent to k+1
x
= argmin x
(aij xj −
i,j
k+1 zij
log(aij xj )) + β
Ω
|∇xk |
EM = argmin ( aij )(xj − xj log(xj )) + β |∇xk |. x
j
i
Ω
Then we will show, in the following theorem, that the log-likelihood is increasing. Theorem 1. The objective functional (negative log-likelihood) E(x) in (2) with xk given by Algorithm 1 will decrease until it attaints a minimum.
EM+TV Based Reconstruction for Cone-Beam CT with Reduced Radiation
5
k Proof. For all k and i, we always have the constraint j zij = bi . Therefore, we have the following inequality bi log (Axk+1 )i − bi log (Axk )i ⎞ ⎞ ⎛ ⎛ k+1 aij xk+1 zij aij xk+1 j j ⎠ = bi log ⎝ ⎠ = bi log ⎝ k k) (Ax b a x i i ij j j j k+1 k+1 zij aij xj ≥ bi log (Jensen’s inequality) bi aij xkj j k+1 k+1 = zij log(aij xk+1 )− zij log(aij xkj ). (5) j j
j
This inequality gives us E p (xk+1 ) − E p (xk ) =
((Axk+1 )i − bi log(Axk+1 )i ) +
i
|∇xk+1 |
Ω
k k − (Ax )i − bi log(Ax )i − β |∇xk | ≤
i,j
−
Ω
i k+1 (aij xk+1 − zij log(aij xk+1 )) + j j
i,j
k+1 (aij xkj − zij log(aij xkj )) − β
Ω
Ω
|∇xk+1 | |∇xk | ≤ 0.
The first inequality comes from (5) and the second inequality comes from the M-step (3). When E(xk+1 ) = E(xk ), these two equalities have to be satisfied. The first equality is satisfied if and only if xk+1 = αxkj for all j with α being j a constant, while the second one is satisfied if and only if xk and xk+1 are minimizers of the M-step (3). Since the functional to be minimized in M-step (3) is strictly convex, which means that α has to be 1 and k+1 aij xkj − zij = 0, j = 1, · · · , N. βxkj ∂J(xk )j + i
i
After plugging the E-step (4) into these equations, we have βxkj ∂J(xk )j + (Axk )i −
aij xkj bi i
(Axk )i
= 0,
Therefore, xk is the minimizer of the original problem.
j = 1, · · · , N.
The log-likelihood function will increase for each iteration until the solution is found, and from the proof, we do not fully use the M-step. Even if the M-step is not solved exactly, it will still increase so far as Q(xk+1 |xk ) > Q(xk |xk ) is satisfied.
6
4
M. Yan et al.
GPU Implementation
In this section, we consider a fast graphics processing unit (GPU)-based implementation of the computationally challenging EM+TV algorithm. As forward projection (Ax) and backward projection (AT y) are about 95% of the computational complexity of the entire algorithm, we focus on these two projections. For forward projection, as illustrated in Fig. 1, for each source and detector pair, it is only necessary to calculate the approximate line integral, without updating the pixels. However, for backward projection, if ray tracing is used, there will be memory conflicts when it is parallelized: different threads may update the same pixel at the same time because for a given source-detector pair, and all the pixels intersecting the ray will be updated.
Fig. 1. Forward/Backward Projection
The forward projection can be parallelized on the GPU platform easily. A large number of threads will operate on the forward ray tracer simultaneously for different source and detector pairs. For backward projection, as there are dependencies and conflicts when two threads access one pixel, parallelization is possible, but more challenging. Compute unified device architecture (CUDA) provides atomic functions to guarantee the mutual exclusion of an address in memory, and can be used to handle such potential data conflicts. However the cost of using atomic functions is very large, therefore, in order to avoid the atomic operations and still can update all the pixels, new backward projection algorithms should be exploited. Here we propose our new backward projection algorithm in brief. For each view, we select the detectors that are far enough and set them to one group, mathematically there will be no conflicts within the group and all tracers in one group can be processed in parallel. As illustrated in Fig. 1, we can choose the tracer lines in the same color. In our case, we choose the distance between two adjoint detectors to be 6. Finally, the EM+TV algorithm has been ported on the GPU platform. The medical image data is transferred from the host memory to device memory at the beginning of the routine; once the data is moved to device memory, the computation process is activated. And the reconstructed image is written back to the host memory when all the computation are finalized.
EM+TV Based Reconstruction for Cone-Beam CT with Reduced Radiation
5
7
Experimental Results
In two dimensions, we compare the reconstruction results obtained by the proposed EM+TV method with those obtained by filtered back projection (FBP). For the numerical experiments, we choose the two dimensional Shepp-Logan phantom of dimension 256x256. The projections are obtained using Siddon’s algorithm. The sinogram data is corrupted with Poisson noise. With the FBP method, we present results using 36 views (every 10 degrees), 180 views, and 360 views; for each view there are 301 measurements. In order to show that we can reduce the number of views by using EM+TV, we only use 36 views for the proposed method. The reconstruction results are shown in Figure 2. We notice the much improved results obtained with EM+TV using only 36 views (both visually and according to the root- mean-square-error (RMSE) between the original and reconstructed images, scaled between 0 and 255), by comparison with FBP using 36, 180 or even 360 views. Using the proposed EM+TV method, with only few samples we obtain sharp results and without artifacts.
FBP 36 (51.1003) FBP 180 (14.3698) FBP 360 (12.7039) EM+TV 36 (3.086) Fig. 2. Reconstruction results by FBP with 36, 180, 360 views and EM+TV with 36 views (RMSE numbers are shown in parenthesis)
The EM+TV algorithm was also tested on the 3D 128x128x128 Shepp-Logan phantom. First, we obtained the projections using Siddon’s algorithm [14]. Only 36 views were taken (every 10 degrees), and for each view there were 301x255 measurements. The code is implemented on the GPU platform (Tesla C1060) with a single-precision floating-point data type. The inner loop of EMupdate has three iterations and the EMupdate and TVupdate will repeat for 100 iterations. For forward projection, 512x64 blocks were used, and for each block there were 288 threads. For backward projection, 24 blocks are used and each block has 64x5x1 threads. Compared with the single-thread implementation on a CPU platform (Intel i7-920,2.66GHz), implementation on the GPU provides more than 26x speed-up for forward projection, and 20x speed-up for backward projection, and the overall reconstruction time is about 330 seconds. The reconstructed image of the EM+TV algorithm on the GPU platform with RMSE is provided in Fig. 3, compared with the result of EM without regularization after 1000 iterations. We can see that the result of EM+TV with only 36 views delivers a very good quality compared with the EM method without TV regularization.
8
M. Yan et al.
original
EM+TV (3.7759)
EM (31.902)
Fig. 3. Reconstructed images (RMSE numbers are shown in parenthesis)
Additionally, EM+TV was applied to a larger 256x256x256 phantom, which has smaller features. Different numbers of views and choices of parameter β are chosen, and the results with RMSE are in Fig. 4. β is the parameter used to balance the EM and TV steps, if the parameter is too large, the result will blurred, otherwise if the parameter is too small, the result will have artifacts. The problem of choosing the best β is under investigation. The numerical experiments above show the efficiency of EM+TV for CT reconstruction using fewer views. However these chosen views are equally spaced. The next experiment is to show that we can choose some special views and
original
36 views (6.710)
36 views (5.466)
36 views (5.127)
40 views (5.430)
40 views (5.080)
50 views (4.353)
50 views (3.881)
Fig. 4. Reconstructed data. From left to right, top to bottom: The middle slice of original image, result by 36 views with 1/β = 2.5, 10, 30; 40 views with 1/β = 10, 30; 50 views with 1/β = 10, 30 (RMSE numbers are shown in parenthesis).
EM+TV Based Reconstruction for Cone-Beam CT with Reduced Radiation
18 views (9.9785)
18+6 (4.9658)
9
24 views (4.2191) 18+3U3D (3.6834)
18+1U1D (6.8468) 18+2U2D (5.4875) 18+1U3D (4.0720) 18+3U1D (3.9824) Fig. 5. Adaptive study of the views. From left to right, top to bottom: reconstruction results from 18 equal views, 6 equal views added onto 18 equal views, 24 equal views, 6 special views added onto 18 views, 2 views (90, 270) added, 4 views (70, 110, 250, 290) added, 4 views (90, 250, 270, 290) added, 4 views (70, 90, 110, 270) added (RMSE numbers are shown in parenthesis).
further reduce the number of views. This test is done on 2D phantom data, but can also be applied to 3D case. Instead of choosing 24 equally spaced views, we can add six (or fewer) views to 18 equally spaced views (every 20 degrees). The six views are chosen at 70, 90, 110, 250, 270 and 290 degrees, and the results with corresponding RMSE are shown in Fig 5. From the results, we can see that, if the six chosen views are equally spaced, the result is worse than that of 24 equally spaced views. From the phantom, we can see that there are many small objects (features) along the middle vertical line. Therefore, we choose additional views from the top and bottom, which gives better results. The results with 22 views (three views have sources from up and one view has source from down (3U1D) and 1U3D) are even better than those with 24 equal views, both visually and from the RMSE calculation.
6
Conclusion
In this paper, a method that combines EM and TV for CT image reconstruction is proposed. This method can provide very good results using much fewer number of views. It requires fewer measurements to obtain a comparable image, which results in significant decrease of the radiation dose. The method is extended to three dimensions and can be used for real data. One of the challenges in EM+TV is long computation time. We have demonstrated that by suitable parallel algorithm design and efficient implementation EM+TV on a
10
M. Yan et al.
GPU platform, execution time can be reduced by well over an order of magnitude. In addition, we believe there are opportunities for further optimizations in areas such as memory access, instruction flow, and parallelization of the backward algorithm that can further improve execution time. We believe that, as demonstrated, the combination of algorithms and optimized implementation on appropriate platforms has the potential to enable high-quality image reconstruction with reduced radiation exposure, while also enabling relatively fast image reconstruction times. Acknowledgement. This work was supported by the Center for DomainSpecific Computing (CDSC) under the NSF Expeditions in Computing Award CCF-0926127.
References 1. Shepp, L., Logan, B.: The Fourier reconstruction of a head section. IEEE Transaction on Nuclear Science 21, 21–34 (1974) 2. Pan, X., Sidky, E., Vannier, M.: Why do commercial CT scanners still employ traditional, filtered back-projection for image reconstruction? Inverse Problems 25, 123009 (2009) 3. Kak, A., Slaney, M.: Principles of Computerized Tomographic Imaging. Society of Industrial and Applied Mathematics, Philadelphia (2001) 4. Shepp, L., Vardi, Y.: Maximum likelihood reconstruction for emission tomography. IEEE Transaction on Medical Imaging 1, 113–122 (1982) 5. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys D 60, 259–268 (1992) 6. Sidky, E., Pan, X.: Image reconstruction in circular cone-beam computed tomography by total variation minimization. Physics in Medicine and Biology 53, 4777–4807 (2008) 7. Brune, C., Sawatzky, A., Wubbeling, F., Kosters, T., Burger, M.: An analytical view on EM-TV based methods for inverse problems with Poisson noise. Preprint, University of M¨ unster (2009) 8. Jia, X., Lou, Y., Li, R., Song, W., Jiang, S.: GPU-based fast cone beam CT reconstruction from undersampled and noisy projection data via total variation. Medical Physics 37, 1757–1760 (2010) 9. Yan, M., Vese, L.A.: Expectation maximization and total variation based model for computed tomography reconstruction from undersampled data. In: Proceeding of SPIE Medical Imaging: Physics of Medical Imaging, vol. 7961, p. 79612X (2011) 10. Chen, J., Yan, M., Vese, L.A., Villasenor, J., Bui, A., Cong, J.: EM+TV for reconstruction of cone-beam CT with curved detectors using GPU. In: Proceedings of International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine, pp. 363–366 (2011) 11. Compressive Sensing Resources, http://dsp.rice.edu/cs 12. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 13. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B 39, 1–38 (1977) 14. Siddon, R.: Fast calculation of the exact radiological path for a three-dimensional CT array. Medical Physics 12, 252–255 (1986)
A Localization Framework under Non-rigid Deformation for Robotic Surgery Xiang Xiang* Key Lab of Intelligent Information Processing of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, China
[email protected] http://www.jdl.ac.cn/user/xxiang Abstract. In surgery, it is common to open large incisions to remove tiny tumors. Now, robotic surgery has been well recognized for high precision. However, target localization is still a challenge, owing to non-rigid deformations. Thus, we propose a precise and flexible localization framework for an MRIcompatible needle-insertion robot. We primarily address with two problems: 1) How to predict the position after deformation? 2) How to turn MRI coordinate to real-world one? Correspondingly, the primary novelty is the non-rigid position transformation model based on Thin-Plate Splines. A minor contribution lies in the data acquisition for coordinate correspondences. We validate the precision of the whole framework, and each procedure of coordinate acquisition and position transformation. It is proven that the system under our framework can predict the position with a good approximation to the target’s real position.
1 Introduction Robotic surgery has been well recognized because of high precision and potential for telesurgery. The insertion of needles/catheters are common procedures in modern clinical practice, ranging from superficial needle pricks to deep-seated tumor removing [1]. Doctors used to open large incisions, while now needle-insertion manipulators have been designed [2]. Then, target localization is a key issue. Generally, it is guided by medical imaging, e.g. Magnetic Resonance Imaging (MRI), X-ray Computed Tomography (CT), ultrasonography, etc. MRI is well-suited for lack of harmful effects on patient/operator. Although MRI rooms are highly restrictive for foreign objects, MRI-compatible manipulators have been developed [3,4,5]. However, accurate localization is still a challenge, owing to non-rigid organ/tissue deformations, and foreign objects’ interventions. Thus, we are facing two problems: 1) How to turn MRI coordinate to real-world one? 2) How to predict the position after deformation? To address the problems, we present a precise and flexible localization framework (see Fig. 1), which is robust to non-rigid deformation. Correspondingly, it consists of two key modules: 1) coordinate acquisition; 2) movement prediction (i.e. position prediction). We respectively handle the issue of correspondence and transformation. *
This work was done while the author was a summer intern at Bio-Medical Precision Engineering Lab, the University of Tokyo.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 11–22, 2011. © Springer-Verlag Berlin Heidelberg 2011
12
X. Xiang
Fig. 1. Framework overview. (a) Surgery procedure. First, the doctor guides the patient to do MRI scan and marker measurement; then, markers are used to establish a non-rigid mapping model; finally, the target is localized using the model and the needle is inserted. (b) Our MRI scanner. (c) Our robot system. The sensor can be MRI-compatible.
In detail, the coordinate correspondence is built between a target’s coordinate in the patient coordinate system and that in the image coordinate system; the position transformation is built from a target’s position before deformation to that after deformation. Since the target is inaccessible before surgery, we turn to utilizing marker (landmark). For the former issue, a marker can be automatically detected by position sensors (e.g. optical, magnetic, mechanical ones) with the 3-D real-world coordinate captured. In association with this real-world coordinate, we detect a marker from MR images and acquire its 3-D MRI coordinate. Now, we have established the imagepatient coordinate correspondence. Afterwards, for the latter issue, we continuously capture the real-world coordinates of a set of markers to build a mapping model, and concurrently update a position transformation in the time domain based on that model. We hope the model fits the non-rigid deformation well, namely the bending is smooth. For one thing, the more markers we utilize, the better the model fits the deformation. For another thing, if we use the second derivative of spatial mapping to minimize a bending energy, we must find a spline able to be decomposed into global affine and local non-affine subspaces [10], i.e. the Thin-Plate Splines (TPS) [12,13]. Now, we assume that a target’s position also obeys the transformation established using markers. This assumption is approximately valid in practice, for markers are densely distributed near the target in a local stable region. As a result, during a minute period, we first use the correspondence to acquire a target’s real-world coordinate before a minute deformation, and then use the transformation to predict its real-world
A Localization Framework under Non-rigid Deformation for Robotic Surgery
13
coordinate after the deformation. This processing is continuously done, so we can predict the position with a good approximation to the target’s real position. In this framework, the primary novelty is the non-rigid transformation model based on TPS. Notably, the rigid, affine and non-rigid transformations are incorporated in the same framework. Besides, a minor contribution lies in the coordinate acquisition for establishing correspondences: 1) semi-auto 2-D MRI coordinate acquisition; 2) two scan in different free-chosen directions to avoid inaccuracy of 3-D MRI coordinate due to the limitation of MRI resolution (slice thickness and interval). In the following, related work is first discussed in Sec. 1. Then, our localization framework is elaborated in Sec. 2. Finally, experiments and evaluation are presented in Sec. 3, with conclusion followed in Sec. 5.
2 Related Work For the correspondence issue, we adopt the landmark- based physical method using markers as reliable and distinct feature points. Another way is the visual method. Actually, a calibrated camera can also be judged as a position sensor. We can first build an image-image correspondence based on feature point/region matching (e.g. SIFT [18], SURF [19], MSER [20], etc.) and then convert it to image-patient one using calibration parameters. However, mismatches are common due to deformations, for most feature matching methods are only robust to rotation/affine transformation. For the transformation issue, we want to establish a non-rigid point-point mapping model. Actually, non-rigid registration has been discussed a lot [6,7,8,9,10]. Existing methods can be sorted as: 1) 2) 3)
4)
Fitting the deformation surface using polynomial, e.g. least squares regression [7], semi-varying coefficient model [7], implicit polynomial [11], etc.; Fitting surfaces using combination of basis functions, e.g. Fourier, wavelet [7]; Corresponding control points via splines models, e.g. B-Splines [7], TPS [12,13], etc. The Cubic B-Splines [14] does not support point registration, for the control points (i.e. markers) need to lie on a regular grid. However, TPS fits well for non-rigid point-point mapping. Theory in [12] has been validated in practice [10,15]. Simulating the deformation using physical models, e.g. elastic model [7], fluid model [7], mechanical model [7], optical flow model [7], etc.
3 A Localization Framework 3.1 Image-Patient Registration As illustrated in Fig. 2, there are three coordinate systems: the MR Image one (I for short), Position Sensor one (S) and Manipulator one (M). Thus, there are three transformations: the rigid one S TI from I to S, the rigid one M TS from S to M, and STtime (⋅) fitting the deformation along the time domain. And also, we have four types of points: needle tip (ndl) and benchmarks (bmk) in the manipulator, markers (mkr) and target
14
X. Xiang
Fig. 2. Robotic surgery scenario and symbol denotation
(tgt) in the patient. If coordinates are written in the homogenous coordinate form (e.g. [ x, y, z,1]Τ ), then we can express one point's coordinate as a 4 1 column vector V , and further N points’ as a 4 N matrix Σ (different to symsum). In the manipulator, the benchmarks’ position M ∑ bmk and needle tip’s position
×
×
Vndl are given as parameters. Before an MRI scan, the sensor measures M and S ∑ mkr . Now we have ∑ bmk = M TS ⋅S ∑ bmk M
M
S
∑ bmk , SVndl (1)
Vndl = M TS ⋅S Vndl
(2)
Then, we acquire I ∑ mkr and IVtgt from MR images. So S
∑ mkr = S TI ⋅ I ∑ mkr
(3)
Vtgt = S TI ⋅I Vtgt
(4)
S
where SVtgt is unknown till now. After the scan, markers’ positions are measured again. Due to deformation, S Σ mkr , SVtgt changes to S Σ' mkr , SV 'tgt . Coordinates before S
and after the scan are connected as
Σ'mkr = S Ttime ( S ∑ mkr )
V 'tgt = M Ttime ( S Vtgt )
S
(5) (6)
We aim to predict MV 'tgt to guide the manipulator. Similar with Eqn. 2, there exists M
V 'tgt = M TS ⋅S V 'tgt
ΔVndl = M V 'tgt − M Vndl We combine all the above equations to compute M ΔV ndl as
Finally, the needle’s movement
M
M
(7) (8)
ΔVndl = M TS [ STtime ( SVtgt ) − SVndl ] = M TS [ STtime ( STI ⋅ IVtgt ) − SVndl ] −1 1 I = M ∑ bmk ⋅ S ∑ bmk [ S Ttime ( S ∑ mkr ⋅ I ∑ −mkr ⋅ Vtgt ) − S V ndl ]
(9)
A Localization Framework under Non-rigid Deformation for Robotic Surgery
15
where ∑ −1 is the pseudo inverse (generalized inverse) matrix ∑ and computed via Singular Value Decomposition (SVD). In Eqn. 9, only STtime (⋅) is unknown. As a unified form, STtime (⋅) can be rigid, affine or non-rigid. No deformation/rigid transformation: S Ttime ( ∑) = ∑ . Then Eqn. 9 is simplified as M
−1 −1 I ΔVndl = M ∑ bmk ⋅S ∑ bmk ( S ∑ mkr ⋅ I ∑ mkr ⋅ Vtgt − S Vndl )
(10)
Affine transformation: If there is a limited amount of shape variability for the patient (e.g. in brain), namely S Ttime can be judged as linear (e.g. affine one). then: S
where R is a 3
⎡R Ttime (∑) = Tlinear ⋅ ∑ = ⎢ ⎣0
P⎤ ⋅∑ 1 ⎥⎦
(11)
×3 rotation matrix and P is a 3×1 translation vector.
S
∑'mkr = S Tlinear ⋅S ∑ mkr via SVD. With Eqn. 11 substituted, Eqn. 9 becomes S
calculated by solving M
Tlinear can be
(12)
−1 1 S −1 I ΔVndl = M ∑ bmk ⋅ S ∑ bmk ( S ∑' mkr ⋅ S ∑ −mkr ⋅ ∑ mkr ⋅ I ∑ mkr ⋅ Vtgt − S Vndl ) −1 S −1 S −1 I = M ∑ bmk ⋅ S ∑ bmk [ ∑' mkr ⋅( S ∑ mkr ⋅ ∑ mkr )⋅ I ∑ mkr ⋅ Vtgt − S Vndl ] −1 −1 I = M ∑ bmk ⋅ S ∑ bmk ( S ∑' mkr ⋅ I ∑ mkr ⋅ Vtgt − S Vndl )
(13)
ΔV ndl has no relabefore the affine deformation. So we only
Notably, Eqn. 13 has the same form with Eqn. 9. This means that
M
tionship with the markers’ positions ∑ mkr need to measure markers’ positions S ∑ ' mkr after the affine deformation. In a word, we S
can consider no deformation, rigid or affine transformation as the same case: if we want to predict a target’s position at a certain time point, then just measure the markers’ positions exactly at that time point. Non-rigid transformation: Finally, if there exists obvious deformation such as contraction, expansion or irregular distortion (e.g. in lung, liver, stomach, etc.), namely S Ttime is non-rigid, then S Ttime need to be approximated by TPS as S
Ttime( S xtgt , S ytgt , S ztgt ) = Λ ⋅ B( S xtgt , S ytgt , S ztgt ) = A( S x'tgt , S y'tgt , S z'tgt )
(14)
where Λ , B , A are all matrices: Λ is the coefficient matrix, B (Before deformation), A (After deformation) can be directly expressed using markers’ coordinates. The calculation of Λ , B , A will be discussed in the next section. Now, we substitute Eqn. −1 1 I 14 into Eqn. 9 to obtain M Δ V ndl = M ∑ bmk ⋅ S ∑ bmk [ Λ ⋅ B ( S ∑ mkr ⋅ I ∑ −mkr ⋅ V tgt ) − S V ndl ] =
M
−1 ∑ bmk ⋅ S ∑ bmk [ Λ ⋅ B ( SVtgt ) − SV ndl ]
(15)
16
X. Xiang
3.2 Non-rigid Mapping Model The coefficient matrix Λ can be calculated by building a TPS-based non-rigid mapping model using N markers’ coordinates ( S xmkr ( i ) , S y mkr (i ) , S z mkr (i ) ) before the deformation and ( S x' mkr ( i ) , S y ' mkr ( i ) , S z ' mkr (i ) ) after the deformation, where i = 1,2,..., N . In the following, the denotation will be simplified to ( S xi , S yi , S z i ) and ( S x'i , S y 'i , S z 'i ) respectively. According to the TPS, the model is set as ⎧ N ⎪ ∑ F j rij + a 0 + a1 x i + a 2 y i + a3 z i = x 'i ⎪ jN=1 ⎪ G j rij + b0 + b1 xi + b2 y i + b3 z i = y ' i ⎪ ∑ j =1 ⎪ N ⎪ H j rij + c 0 + c1 xi + c 2 y i + c 3 z i = z ' i ⎪ ∑ j =1 ⎪⎪ N N N ⎫ ⎨ ∑ Fi = 0 ∑ Gi = 0 ∑ H i = 0 ⎪ ⎪ i =1 i =1 i =1 ⎪ N N ⎪ N ⎪ ⎪ ∑ Fi x i = 0 ∑ Gi xi = 0 ∑ H i xi = 0 ⎪ i =1 i =1 ⎪ iN=1 ⎬C N N ⎪ Fy =0 Gi y i = 0 ∑ H i y i = 0 ⎪ ∑ ∑ i i ⎪ i =1 ⎪ i =1 i =1 ⎪ N N N ⎪ ⎪ ∑ Fi z i = 0 ∑ Gi z i = 0 ∑ H i z i = 0 ⎪ i =1 i =1 ⎭ ⎩⎪ i =1
(16)
where the equation group C is the smoothness constraint, F , G , H are all warping coefficients ( i, j = 1,2,..., N ), and rij is a radial basis and called TPS kernel. Note that C is necessary, for it discourages overly arbitrary mappings. Thus, we can flexibly define C by choosing a specific smoothness measure based on prior knowledge. Besides, rij = Pi − Pj is the Euclidean distance between a marker point Pi ( xi , yi , z i )
and another one Pj ( x j , y j , z j ) other than Pi among the N markers. Namely, (17)
rij2 = ( x i − x j ) 2 + ( y i − y j ) 2 + ( z i − z j ) 2
Then, a set of rij represents the internal structural relationships of the point set. Now, we will illustrate the calculation of Λ , B and A . We rewrite Eqn. 16 in the form of matrices multiplication as ⎡ F1 ⎢F ⎢ 2 ⎢ : ⎢ ⎢ FN ⎢ a0 ⎢ ⎢ a1 ⎢a ⎢ 2 ⎢⎣ a3
G1 G2 : GN b0 b1 b2 b3
H1 ⎤ H 2 ⎥⎥ : ⎥ ⎥ HN ⎥ c0 ⎥ ⎥ c1 ⎥ c2 ⎥ ⎥ c3 ⎥⎦
Τ
⎡0 ⎢r ⎢ 12 ⎢ : ⎢ ⎢r1N ⎢1 ⎢ ⎢ x1 ⎢y ⎢ 1 ⎢⎣ z1
r21
... rN1
1
x1
y1
0
... rN 2
1
x2
y2
: r2 N
: ...
: 0
: : 1 xN
: yN
1
...
1
0
0
0
x2 y2
... x N ... y N
0 0
0 0
0 0
z2
... z N
0
0
0
z1 ⎤ ⎡ x'1 z 2 ⎥⎥ ⎢⎢ x' 2 : ⎥ ⎢ : ⎥ ⎢ z N ⎥ ⎢ x' N = 0⎥ ⎢ 0 ⎥ ⎢ 0⎥ ⎢ 0 0⎥ ⎢ 0 ⎥ ⎢ 0 ⎥⎦ ⎢⎣ 0
y '1 y'2 : y' N 0 0 0 0
z '1 ⎤ z ' 2 ⎥⎥ : ⎥ ⎥ z' N ⎥ 0 ⎥ ⎥ 0 ⎥ 0 ⎥ ⎥ 0 ⎥⎦
Τ
(18)
From left to right, the matrix is Λ , B , A respectively. Note that B , A can be exT ⎡Φ '⎤ . For B , ⎡ Κ Φ⎤ , pressed in a simple form (using partitioned matrices): A= B= ⎢Φ Τ ⎣
0 ⎥⎦
⎢0⎥ ⎣ ⎦
Φ is known and Κ can be computed using Eqn. 17; for A , Φ ' is also know. Thus, the
A Localization Framework under Non-rigid Deformation for Robotic Surgery
17
equation Λ ⋅ B = A can be solved using SVD for Λ . Back to Eqn. 15, with Λ known, M ΔV ndl can be computed as long as we construct B( sVtgt ) according to
B(Sxtgt,Sytgt,Sztgt) =[rtgt, mkr1 rtgt, mkr2 ... rtgt, mkrN 1 xtgt ytgt ztgt]Τ
(19)
4 Experiments and Evaluation The accuracy of the localization discussed above is greatly affected by data accuracy. Since the precision of the position sensor and manipulator is fixed, we attempt to improve the precision of the MRI coordinate acquisition (Sec. 4.1). As shown in Fig. 3, we conduct experiments to validate the precision of data acquisition (Sec. 4.2), transformation (Sec. 4.3), operation (Sec. 4.4) and the whole framework (Sec. 4.5). 4.1 Coordinate Acquisition Acquisition of 2-D MRI coordinates: As illustrated in Fig. 4, a fast and semi-auto localization strategy based on edge detection [16] and corner detection [17] is designed to obtain the 2-D coordinate (i, j ) of interested target and markers, which may have various shapes (simple template matching will not work). After filtering noises, the background of MR images is neat. Then, the only interactive work for the doctor is to select the objects of interest among several candidates. Acquisition of 3-D MRI coordinates: MR images (e.g. DICOM) record a lot of information, e.g. Image Position (coordinate of initial slice’s origin) ( x0 , y 0 , z 0 ) , Slice Number k , Pixel Spacing Δl pixel and Slice Spacing Δl slice . Normally, the 3-D coordinate of a pixel (i, j ) in the k th slice can be obtained after the scan in one direction, such as the direction perpendicular to the transverse plane (t for short, see Fig. 4): ( x t 0 +it ⋅ Δl pixel
,
y t 0 + jt ⋅ Δl pixel
, z t 0 + k t ⋅ Δl slice )
(20)
Fig. 3. Validation experiments. (a) Grape fruit for validating system accuracy. (b) Phantom fully filled with agar for validating rigid transformation. (c) Carton for validating affine transformation. (d) Balloon for validating non-rigid transformation. (e) Manipulator coordinate system and test points for validating sensor-manipulator transformation.
18
X. Xiang
Fig. 4. Left: 2-D MRI coordinate localization in a phantom test. Right: default MRI coordinate system & scanning directions. A: Abdominal; P: Posterior; L: Left; R: Right; H: Head; F: Foot.
However in that way, the z coordinate (containing Δlslice ) is not precise, for slices have thicknesses and intervals. Consequently, we scan the patient in at least two directions, which can be freely chosen. For non-orthogonal directions, their orientation information will be used to project the coordinate; however usually, we prefer those directions perpendicular to MRI scanner’s default coordinate planes, such as the transverse plane, coronal plane (c) and sagittal plane (s) (see Fig. 4). For the latter two, the 3-D coordinate of P can be respectively obtained as ( x c 0 +ic ⋅ Δl pixel
, z c 0 + jc ⋅ Δl pixel )
(21)
( x s 0 + k s ⋅ Δl slice , y s 0 +is ⋅ Δl pixel , z s 0 + j s ⋅ Δl pixel )
(22)
,
y c 0 + k c ⋅ Δl slice
For instance, if we choose to scan twice respectively paralleling to the coronal plane and the sagittal plane, then x coordinate can be know from Eqn. 21, y coordinate can be know from Eqn. 22 and z coordinate can be know either from Eqn. 21 or Eqn. 22. Note that the 3-D MRI coordinate system can be defined freely, as long as the position descriptions of the target and markers follow the same system. Here, for the convenience, we adopt MRI scanner’s default coordinate system (see Fig. 4). 4.2 Coordinate Accuracy Accuracy of 2-D MRI coordinates: Same with [3], we utilize a © Hitachi Open MRI scanner (MRH-500), yet with a magnetic field strength of 0.2 Tesla. The accuracy is shown in Table 1 (ground truth is manually measured). We transform 2-D coordinates in pixel to 3-D in millimeter. Two factors affect the accuracy: one is image quality (resolution); the other is that in some cases, the detected corner point is not exactly the marker (see Fig. 5 (c)).
A Localization Framework under Non-rigid Deformation for Robotic Surgery
19
Fig. 5. 2-D MRI coordinate acquisition. Upper row: original MR images. Lower row: target/marker detection results. Results in a, b are accurate, while that in c is not so accurate. The red circle in (c1) labels the accurate marker position measured manually. Table 1. Accuracy of MR image coordinates Unit mm
Average 5.30
Standard Deviation 2.14
Max Error 9.17
Min Error 0
Accuracy of 3-D MRI coordinates: As we can see in Table 2, the corresponding coordinates in different scan series match well. If we just scan once, then there will be one coordinate axis with approximate coordinates. Table 2. 3-D coordinate acquisition in the grape fruit test
Mkr 1 Mkr 2 Mkr 3 Mkr 4 Mkr 5 Target
AL (transverse) X Y -46.26 4.48 -7.59 4.48 41.63 8 1.79 8 -1.73 -23.64 -4.07 6.83
HL (coronal) X Z -46.26 111.61 -8.76 154.97 42.80 116.30 2.96 76.45 -2.90 118.64 -4.07 118.64
HP (Sagittal) Y Z 4.48 112.78 2.14 154.97 10.34 118.64 5.66 76.45 -35.36 122.16 6.83 113.95
Accuracy of patient coordinates: All coordinates in the sensor coordinate system are measured by the sensor. In our experiments, we use a mechanical one named MicroScribe (© Immersion Corp.). Its measurement error is “< 0.0508mm”. In practice, the average error is 3.00 mm, for manual measurement is instable. This error will be brought to following procedures. In real surgery, we can have more stable measurements using an optical or magnetic sensor. 4.3 Registration Accuracy Accuracy of S TI : We verify this accuracy with a phantom fully filled with agar, as showed in Fig. 3 (b). At first, we use 4 markers’ coordinates I Σ mkr and S Σ mkr to compute S TI according to S ∑ mkr = S TI ⋅I ∑ mkr . Afterwards, we use it to predict the target’s position SVtgt ( pre) according to SVtgt ( pre) = STI ⋅ IVtgt . In the meantime, we also use
20
X. Xiang
the sensor to measure SVtgt ( mea) as the ground truth. In the end, the average position deviation is 9.77 mm. Note that the diameter of the target’s tip itself is 5.00 mm. Accuracy of M TS : In Fig. 3 (e), 4 red round points are benchmarks and blue square point is for testing. We combine M ∑ bmk = M TS ⋅S ∑ bmk and M Vtest = M TS ⋅ S Vtest to predict M
Vtest , while ground truth is manually measured. Average deviation is 2.89 mm.
Accuracy of S Ttime : 1) Rigid or affine transformation: In Fig. 3 (c), we handle rigid/affine transformation in one case: not only distorting the carton but also changing its position/orientation. On the surface, we label 4 markers and 1 target. From S ∑'mkr = S Tlinear ⋅S ∑ mkr and S
V 'tgt = S Tlinear ⋅ S Vtgt , we calculate S V 'tgt . Its average position deviation is 4.43 mm.
2) Non-rigid transformation: In Fig. 3 (d), we use a balloon to verify our nonrigid mapping model’s prediction accuracy. At first, we label 12 markers and 4 targets on the surface and measure all’s coordinates using the sensor; after that, we let some air out, so the balloon becomes smaller; at last, all’s coordinates are measured again. For model building, we first use Λ ⋅ B (mkr ) = A(mkr ' ) to calculate Λ and then use Λ ⋅ B (tgt ) to predict the 4 targets’ position. At last, the average deviation is 3.12 mm. 4.4 Operation Accuracy As we can see from Fig. 2, our manipulator has 6 degrees of freedom (DOF): XYZ axis for positioning, rotation and latitude angle for orientation and depth control for needle insertion. Each DOF’s accuracy is listed in Table 3. Table 3. The Manipulator’s Positioning Accuracy ( μm ) Position Accuracy Backlash
X 13.9 ± 39.5 450 ± 49.7
Y 51.8 ± 39.7 270 ± 13.9
Z 12.4 ± 9.7 317 ± 36.3
Latitude 16.6 ± 15.7 645 ± 30.8
Rotation 32.0 ± 6.7 499 ± 36.7
Depth 6.26 ± 5.2 2430
± 1460
4.5 System Accuracy No deformation. We still use the phantom test to verify the system accuracy when there’s no deformation. Then, we can predicted needle movement through M
ΔVndl =
M
−1 ∑ bmk ⋅ S ∑ bmk [ SVtgt ( pre) − SVndl ]
where SVtgt ( pre) has been calculated in Sec. 4.3. Next, we measure the actual distance between the inserted needle’s tip and the target. The position deviation is 10.14 mm. Note again that the diameter of the target’s tip itself is 5.00 mm. Non-rigid deformation: This experiment is conducted on a grape fruit (Fig. 3 (a)), with a seed is chosen as the target. Dehydration in an appropriate degree results in contraction, one type of non-rigid deformation. According to Eqn. 15
A Localization Framework under Non-rigid Deformation for Robotic Surgery
21
Fig. 6. Automatic insertion. (a) Initial state. (b)(c)(d) X/Y/Z movement. (e)(f) Inserting. A video record is provided at http://www.jdl.ac.cn/user/xxiang/regdef/ . Also released are details of image processing and data computing.
Fig. 7. Grape fruit’s MR images. Left: before deformation and insertion. Right: after them. M
−1 −1 I Δ V ndl = M ∑ bmk ⋅ S ∑ bmk [ Λ ⋅ B ( S ∑ mkr ⋅ I ∑ mkr ⋅ Vtgt ) − S V ndl ]
At first, we solve Λ ⋅ B (mkr ) = A(mkr ' ) for Λ to set up the non-rigid mapping model. −1 Then, we obtain S Σ mkr , I Σ mkr and I Vtgt to calculate Λ ⋅ B(tgt ) , Next, M Σ bmk , S Σ bmk , Σ ndl are used to calculate ΔVndl . Fig. 6 displays the needle insertion process; Fig. 7 compares the MR image before the deformation and insertion with the image after them. Finally, the average position deviation is 5.17 mm. S
M
5 Conclusion MRI-guided surgery navigation has been well-recognized, but is still far from the clinical practice. There remain many issues to be discussed, such as target localization, path planning, real-time performance, device compatibility, etc. In this paper, we have presented an accurate and flexible localization framework robust to rigid, affine and non-rigid transformations. Also displayed is our MRI-compatible needle-insertion robot. With real-time MRI scanner, this framework enables real-time navigation and operation. In the future, we will further develop the robot under this framework to make it usable for clinical applications.
22
X. Xiang
Acknowledgements. The author sincerely thanks Ichiro Sakuma, Etsuko Kobayashi, Hongen Liao, Deddy Nur Zaman for valuable guidance, CMSI@UTokyo Program for the arrangement, JSPS for the GCOE scholarship, and Xilin Chen for his support.
References 1. Dimaio, S., Salcudean, S.: Needle Insertion Modelling and Simulation. In: Proc. IEEE ICRA (2002) 2. Yamauchi, Y., Ohta, Y., Dohi, T., Kawamura, H., Tanikawa, T., Isekim, H.: A Needle Insertion Manipulator for X-ray CT Image Guided Neurosurgery. J. of LST 5-4, 814–821 (1993) 3. Masamune, K., Kobayashi, E., Masutani, Y., Suzuki, M., Dohi, T., Iseki, H., Takakura, K.: Development of an MRI-compatible Needle Insertion Manipulator for Stereotactic Neurosurgery. J. Image Guid. Surg. 1(4), 242–248 (1995) 4. Chinzei, K., Hata, N., Jolesz, F., Kikinis, R.: MR Compatible Surgical Assist Robot: System Integration and Preliminary Feasibility Study. In: Delp, S.L., DiGoia, A.M., Jaramaz, B. (eds.) MICCAI 2000. LNCS, vol. 1935, pp. 921–930. Springer, Heidelberg (2000) 5. Fischer, G., Iordachita, I., Csoma, C., Tokuda, J., Mewes, P., Tempany, C., Hata, N., Fichtinger, G.: Pneumatically Operated MRI-Compatible Needle Placement Robot for Prostate Interventions. In: Proc. IEEE ICRA (2008) 6. Lester, H.: A Survey of Hierarchical Non-linear Medical Image Registration. Pattern Recognition 32(1), 129–149 (1999) 7. Hajnal, J., Hawkes, D., Hill, D.: Medical Image Registration. CRC Press, Boca Raton (2001) 8. Rueckert, D., Sonoda, L., Hayes, C., Hill, D., Leach, M., Hawkes, D.: Nonrigid Registration Using Free-form Deformations: Applications to Breast MR Images. IEEE T-MI 18(8), 712–721 (1999) 9. Schnabel, J., et al.: A General Framework for Non-rigid Registration Based on Nonuniform Multi-level Free-form Deformations. In: Niessen, W.J., Viergever, M.A. (eds.) MICCAI 2001. LNCS, vol. 2208, p. 573. Springer, Heidelberg (2001) 10. Chui, H., Rangarajan, A.: A New Point Matching Algorithm for Non-rigid Registration. CVIU 89(2-3), 114–141 (2003) 11. Zheng, B., Takamatsu, J., Ikeuchi, K.: An Adaptive and Stable Method for Fitting Implicit Polynomial Curves and Surfaces. IEEE T-PAMI 32(3), 561–568 (2010) 12. Bookstein, F.: Principal Warps: Thin-plate Splines and the Decomposition of Deformations. IEEE T-PAMI 11, 567–585 (1989) 13. Donato, G., Belongie, S.: Approximate Thin Plate Spline Mappings. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part III. LNCS, vol. 2352, pp. 21–31. Springer, Heidelberg (2002) 14. Medioni, G., Yasumoto, Y.: Corner Detection and Curve Representation Using Cubic BSplines. In: Proc. IEEE ICRA (1986) 15. Rohr, K., et al.: Landmark-based Elastic Registration Using Approximating Thin-plate Splines. IEEE T-MI 20(6), 526–534 (2001) 16. Canny, J.: A Computational Approach to Edge Detection. IEEE T-PAMI 8(6), 679–698 (1986) 17. Harris, C., et al.: A Combined Corner and Edge Detection. In: Proc. 4th AVC, pp. 147–151 (1988) 18. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. IJCV 60, 91–110 (2004) 19. Bay, H., Tuytelaars, T., Van Gool, L.: Speeded-Up Robust Features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006) 20. Matas, J., Chum, O., Urba, M., Pajdla, T.: Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In: Proc. BMVC (2002)
Global Image Registration by Fast Random Projection Hayato Itoh1 , Shuang Lu1 , Tomoya Sakai2 , and Atsushi Imiya3 1
2
School of Advanced Integration Science, Chiba University Yayoicho 1-33, Inage-ku, Chiba, 263-8522, Japan Department of Computer and Information Sciences, Nagasaki University Bunkyo-cho, Nagasaki, Japan 3 Institute of Media and Information Technology, Chiba University Yayoicho 1-33, Inage-ku, Chiba, 263-8522, Japan
Abstract. In this paper, we develop a fast global registration method using random projection to reduce the dimensionality of images. By generating many transformed images from the reference, the nearest neighbour based image registration detects the transformation which establishes the best matching from generated transformations. To reduce computational cost of the nearest nighbour search without significant loss of accuracy, we first use random projection. To reduce computational complexity of random projection, we second use spectrumspreading technique and circular convolution.
1
Introduction
In this paper, we develop a fast global-image-registration algorithm using fast random projection [1,2,3]. Image registration overlays two or more template images, which are same sense observed at different times, from different viewpoints, and/or by different sensors, on a reference image. In general, there are parallel, rotation, scaling, and shearing spatial relationships between the reference image and template images. Therefore, image registration is a process to estimate these geometric transformations that transform all points or most points of the template images to points of the reference image. To estimate these spatial transformations, various methods are developed [4,5]. Image registration method generally classified into local image registration and global image registration. For global alignment images, linear transformation x = Ax + b. which minimises the criterion
D(f, g) = R2
|f (x) − g(x )|dx,
(1)
(2)
for a reference image f (x) and the template image g(x), is used to relate two images. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 23–32, 2011. c Springer-Verlag Berlin Heidelberg 2011
24
H. Itoh et al.
Reference image Spatial Transform
. . . Register
Template image
Fig. 1. Affine Transformation and Flow of Global Registration m For sampled affine transforms {Ai }m i=1 and translation vectors {bj }j=1 , We generate transformed reference images such that
g ij = g(Ai x + bj ), i, j = 1, 2, · · · , m.
(3)
˜ Then, minimisation of R in eq. (2) is achieved by computing the criterion R N
R = min ij
ij |fpq − gpg |
(4)
p,q=1
ij where fpq and gpq are discrete version of f and g ij . We can use the nearest neighbour search (NNS) to find minimiser N ij |fpq − gpg | . (5) g∗ij = arg min ij
p,q=1
Figure 1 shows the procedure of global registration The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, preserving track of the ”best so far.” This algorithm, sometimes referred to as the naive approach, has a computational cost O(N d) where N and d are the cardinality of a set of points in a metric space and the dimensionality of the metric space, respectively. In this paper, using the efficient random projection, We introduce a method to reduce the computational cost of this primitive of the NNS-based image registration.
2
Random Projection
Setting ED (A) to be the expectation of the event A on the domain D, for k n from u ∈ Rn , we compute the vector v ∈ Rk such that
Global Image Registration by Fast Random Projection
v=
n T R u, ERk (|v|2 ) = ERn (|u|2 ), k
25
(6)
where R is a uniform random orthonormal matrix and the scale nk ensures the expected squared length of v equal to the squared length of u. The transform of eq. (6) is called the random projection [6]. The random projection approximately preserves pairwise distances with high probability. The Johnson-Lindenstrauss lemma describes this geometric property. Lemma 1. (Johnson − Lindenstrauss) [7] For any 12 > > 0, and any 4 set of points X in Rn , setting k ≥ 2 /2− 3 /3 log n, every pair u and v satisfies the relation (1 − )|u − v|2 ≤ |f (u) − f (v)|2 ≤ (1 + )|u − v||2 with probability at least respectively.
1 2,
(7)
where f (u) and f (v) are the projections of u and v,
If the projection matrix R is a random orthonormal matrix, that is, each entry of the random matrix is selected independently from the standard normal distribution N (0, 1) with mean 0 and variance 1, we have the next lemma [8]. Lemma 2. Let each entry of an n × k matrix R be chosen independently from N(0,1). Let v = √1k RT u for u ∈ Rn . For any > 0, (1)E(|v|2 ) = |u|2 . 2 3 1 (2)P (|v|2 − |u|2 | ≤ ||u||2 ) < 2e−( − ) 4 . Figure 2 shows the distance preserving property of the random projection from R3 to R2 . Two points P1 and P2 in 3-dimensional space correspond to points P1 and P2 , respectively in 2-dimensional space after random projection. Using these lemmas, we can compress images using the Random projection. By expressing an (n × n)-pixel image as an (n × n) matrix A = (a1 , a1 , . . . , an ) where ai ∈ Rn , we first transform A to U , 2
n u = (a 1 , a1 , · · · , an ) ∈ R ,
d1
(8)
P2
P1
d2
P2
P1
Fig. 2. Geometry of random projection. (a) Random Projection from 3D-space to 2Dspace. (b) 2 points P1 and P2 in 3-dimensional space correspond to points P1 and P2 , respectively in 2-dimensional space after random projection.
26
H. Itoh et al.
Using a random projection matrix R ∈ Rk N (0, 1), we compute
and transform v to
2
×n2
which satisfies the relation
1 R u = v = v1 , v2 , · · · , vk k
⎛
v1 vk+1 v2k+1 ⎜ .. .. .. B=⎝ . . . vk v2k v3k
⎞ · · · vk(k−1) ⎟ .. k×k .. , ⎠∈R . . · · · vkk
(9)
(10)
Then, we obtain compressed image B from image A. Setting f and g ij to be the random projection of the reference and template images, respectively, and N ij ij g∗ = arg min |fpq − gpg | , (11) ij
p,q=1
these lemmas imply the relations P (|g∗ij − g∗ij | ≤ ε) > 1 − δ,
(12)
for small positive numbers ε and δ, where P (A) is the probability of the event A. Using this property, first we reduce the computational cost of the NNS search for the global registration of image.
3
Efficient Random Projection
Setting w = (w1 , . . . , wk ) to be an independent stochastic vector such that ERk [w] = 0, we have the relation ERk×k [ww ] = γ 2 I. Furthermore, setting the (i − 1)-time shifting of w to be ci = (wi , . . . , wk , w1 , . . . , wi−1 ) ,
(13)
we define the matrix ⎡ ⎤ w1 c 1 ⎢ w2 ⎢ ⎥ ⎢ C = ⎣ ... ⎦ = ⎢ . ⎣ .. c k wk ⎡
⎤ · · · wk−1 wk · · · wk w1 ⎥ ⎥ . .. ⎥ . .. . .. . ⎦ w1 · · · wk−2 wk−1 w2 w3 .. .
Since ER (c i cj ) = {
γ 2 d (i = j) 0 (i = j)
(14)
(15)
Global Image Registration by Fast Random Projection
27
we have the relation k k k 2 x) ] = x E k×k [c c ]x = x γ 2 Ix = kγ 2 |x|2 . ER [|y|22 ] = ER [ (c i R i i i=1
i=1
i=1
(16) √ We set γ = 1/ k. Setting s = (s1 , . . . , sk ) to be be an independent stochastic vector such that ER [s] = 0 and ERk×k [ss ] = σ 2 I, a dense vector ζ is computed as ζ = s x = [S]x,
(17)
from a sparse vector x where [S] is the diagonal matrix whose diagonals are {si }ki=1 fro d ≤ n. Then, we compute y = C[S]x
(18)
from a sparse vector x, where [S] is the diagonal matrix whose diagonals are {si }ki=1 for k ≤ n. The expectation of the norm is (19) ERk [|y|22 ] = kγ 2 σ 2 |x|22 . √ To preserve E[|y|22 ] = |x|22 , we set γ = 1/ k and σ = 1. Since Cη is achieved by the cyclic convolution between ci and η, we can compute C[S]x using the Fast Fourier Transform. These algebraic property derive the next theorem. Theorem 1. The vector x is projected to the vector y using O(d) memory area and O(nd log d) calculation time.
4
Numerical Examples
Using the effective random projection proposed in the previous section [3], we propose a fast algorithm for global image registration. The flow charts are shown in Fig. 3. We evaluated performance of our method using data set of [9]. We first numerical accuracy of the algorithm assuming that the transforms are rotations around the centroid of images. In this case we can set bj = 0. Using a reference image A0 in Fig. 4 and seven images for three template images A, B, C, and D in Figs. 4 and 5. We have evaluated the rotation angle and the n ˜ overlapping ratio r = min R/M , where M = p,q=1 |fpq |, for each result. Table 1 shows the results. The results show that the method detects rotation angles with sufficient overlapping ratio. Next, we show the computational cost the method according to the dimension of the original images. Figures 6(a) and 6 (b) show the relation between accuracy of the estimated parameters against the dimensions of the projected images and computation times of the NNS after the efficient random projection. These graphs show that in an appropriate dimension, the algorithm preserves speed of computation and accuracy of the results.
28
H. Itoh et al. Reference image
Template image(Input) Preparation
Centering Centering Affine transformation with
Different affine transformations . . . . . .
Random Projection . . .
Nearest Neighbor Search
compute the diffierence between and : The difference of image energy between
and
Fig. 3. Algorithm Flow
Table 1. Numerical Result for Rotation Images. R = minij M= N p,q=1 |fpq |. 0 Ground Truth Reference A0 A0 Template 0 estimation min R/M 0.0007688 Reference A0 B0 is Template 0 estimation 0.2008 min R/M Reference A0 C0 Template 359 estimation 0.2694 min R/M Reference A0 D0 Template 2 estimation 0.4794 min R/M
5 A0 A5 5 0.0130 A0 B5 5 0.1973 A0 C5 4 0.2669 A0 D5 7 0.4760
10 A0 A10 10 0.0128 A0 B10 10 0.1971 A0 C10 10 0.2682 A0 D10 9 0.4800
15 A0 A15 15 0.0129 A0 B15 15 0.1972 A0 C15 15 0.2682 A0 DS15 14 0.4798
30 A0 A30 30 0.0125 A0 B30 30 0.1973 A0 C30 30 0.2682 A0 D30 29 0.4800
N p,q=1
ij |fpq − gpg — and
60 180 A0 A0 A60 A180 60 180 0.0125 0.0007688 A0 A0 B60 B180 60 180 0.1972 0.2008 A0 A0 C60 C180 60 180 0.2682 0.2694 A0 A0 D60 D180 59 183 0.4796 0.4825
Global Image Registration by Fast Random Projection
(a) A0
(b) A5
(c) A10
(d) A15
(e) A30
(f) A60
(g) A180
(h) B0
(i) B5
(j) B10
(k) B15
(l) B30
(m) B60
(n) B180
29
Fig. 4. Test Images 1
(a) C0
(b) C5
(c) C10
(d) C15
(e) C30
(f) C60
(g) C180
(h) D0
(i) D5
(j) D10
(k) D15
(l) D30
(m) D60
(n) D180
Fig. 5. Test Images 2
Finally, we show the registration results under various dimensionality reduction. Figure 7 shows results of registration under various dimensionality reduction in random projection. Equation (20) shows the estimated affine transformations under various dimensionality reduction in random projection. The estimated parameters in matrix forms are ⎛ ⎛ ⎞ ⎞ 0.8572 −0.5150 0 0.8660 −0.5000 0 A16384 = ⎝ 0.5150 0.8572 0 ⎠ , A8192 = ⎝ 0.5000 0.8660 0 ⎠ , 0 0 1 0 0 1
30
H. Itoh et al.
100
0.15
95
0.125
85
Time−NNS [s]
Accracy rate [%]
90
80 75 70 65 60
0.1
0.075
0.05
0.025
55 50
0
2000
4000
6000
8000
10000
12000
14000
16000
0
2000
4000
6000
8000
10000
12000
14000
16000
Dimension
Dimension
(a) Relationship between dimension and accuracy
(b) Relationship between dimension and the NNS
Fig. 6. Computational Costs and Accuracy of the Method
⎛
⎞ ⎛ ⎞ 0.8746 −0.4848 0 0.8572 −0.5150 0 A4096 = ⎝ 0.4848 0.8746 0 ⎠ , A2048 = ⎝ 0, 5150 0.8572 0 ⎠ , 0 0 1 0 0 1 ⎛ ⎞ ⎛ ⎞ 0.8829 −0.4695 0 0.8746 −0.4848 0 A1024 = ⎝ 0.4695 0.8829 0 ⎠ , A512 = ⎝ 0.4848 0.8746 0 ⎠ , 0 0 1 0 0 1 ⎛ ⎞ ⎛ ⎞ 0.9063 −0.4226 0 −0.9063 0.3432 0 A32 = ⎝ 0.4226 0.9063 0 ⎠ , A16 = ⎝ −0.3432 −0.9063 0 ⎠ , 0 0 1 0 0 1
where Ak =
Ai bj o 0
(20)
.
(21)
Therefore, in these experiments bj = 0 and P (|A i A − I|2 < ε) = 1 − δ for the Frobenius norm |A|2 of the matrix A. Table 2 lists the estimated angles for various dimensional random-projected images. Table 2. Estimeted Angles for Various Dimensional Images 30 30 30 30 30 30 30 30 Grount Truth Estimated Angle 31 30 29 31 28 29 25 200 Dimension 16384 8192 4096 2048 1024 512 32 16
These results show that if the dimension of the projected subspace is too small, the registration will be failed and the estimated affine parameter maybe not accurate. Furthermore, if the projected the dimension of subspace is not too small, the computational cost of global registration can be reduced by the efficient random projection without significant loss of accuracy. The complexities of the naive and fast random projection methods are shown in table 3 These relations show that in total the method reduces the computational time of fot global registration with random projection.
Global Image Registration by Fast Random Projection
(a) Reference D0
(c) Result: to 16384 dimension
(d) Result: to 8192 dimension
(e) Result: to 4096 dimension
31
(b) Template D30
(f) Result: to 1024 dimension
(g) Result: to 512 dimension
(h) Result: to 32 dimension
(i) Result: to 16 dimension
Fig. 7. Result Images for Various Dimensionality Reduction. Image A and D30 are used. Table 3. Theoretical Comparison Computational Complexities. (a) memeory area for the random matrix. (b) memeory area for the random vectors. Computation Time Memory Memory Pre-Process NNS of Projection of Operation of Projected Data of NNS naive O(nd log n) O(d log n) (a) kn n O(nd log d) O(d) (b) kn O(n log n) O(log n) fast log n : log d d log n : d 1:1 0 : n log n n : log n ratio(n:f)
5
Conclusions
In this paper, using a fast random projection, we developed an efficient algorithm that establishes global image registration. We introduced to use spectrum spreading and circular convolution to reduce computational cost of random projection. After establishing global registration, we apply elastic registration by solving the minimisation problem1 1
This is a discrete expression of variational image registration which is achieved by minimising the criterion J(u) =
R2
{(f (x − g(x + u))2 + λQ(u)}dx.
32
H. Itoh et al.
J(u, v) =
N ij |fpq − [g∗(p+u,q+v )]pq |2 + λ[Q(u, v)]pq
(22)
p,q=1
for an appropriate prior Q, where [T fmn ]pq expresses resampling of fmn after applying the transformation T to fpq . Extensions of the method to range images and 3D volumetric images are straight forward. Acknowledgement. This research was supported by ”Computational anatomy for computer-aided diagnosis and therapy: Frontiers of medical image sciences” funded by Grant-in-Aid for Scientific Research on Innovative Areas, MEXT, Japan, Grants-in-Aid for Scientific Research founded by Japan Society of the Promotion of Sciences and Grant-in-Aid for Young Scientists (A), NEXT, Japan.
References 1. Healy, D.M., Rohde, G.K.: Fast global image registration using random projection. In: Proc. Biomedical Imaging: From Nano to Macro, pp. 476–479 (2007) 2. Sakai, T.: An efficient algorithm of random projection by spectrum spreading and circular convolution, Inner Report IMIT Chiba University (2009) 3. Sakai, T., Imiya, A.: Practical algorithms of spectral clustering: Toward large-scale vision-based motion analysis. In: Wang, L., Zhao, G., Cheng, L., Pietik¨ ainen, M. (eds.) Machine Learning for Vision-Based Motion Analysis Theory and Techniques. Advances in Pattern Recognition. Springer, Heidelberg (2011) 4. Zitov´ a, B., Flusser, J.: Image registration methods: A Survey. Image Vision and Computing 21, 977–1000 (2003) 5. Modersitzki, J.: Numerical Methods for Image Registration. In: CUP (2004) 6. Vempala, S.S.: The Random Projection Method. DIMACS, vol. 65 (2004) 7. Johnson, W., Lindenstrauss, J.: Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics 26, 189–206 (1984) 8. Frankl, P., Maehara, H.: The Johnson-Lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory Series A 44, 355–362 (1987) 9. van Ginneken, B., Stegmann, M.B., Loog, M.: Segmentation of anatomical structures in chest radiographs using supervised methods: a comparative study on a public database. Medical Image Analysis 10, 19–40 (2006)
EM-Type Algorithms for Image Reconstruction with Background Emission and Poisson Noise Ming Yan Department of Mathematics, University of California, Los Angeles 520 Portola Plaza, Los Angeles, CA 90095-1555, United States
[email protected]
Abstract. Obtaining high quality images is very important in many areas of applied sciences. In this paper, we proposed general robust expectation maximization (EM)-Type algorithms for image reconstruction when the measured data is corrupted by Poisson noise. This method is separated into two steps: EM and regularization. In order to overcome the contrast reduction introduced by some regularizations, we suggested EM-Type algorithms with Bregman iteration by applying a sequence of modified EM-Type algorithms. The numerical experiments show the effectiveness of these methods in different applications. Keywords: background emission, expectation maximization, image reconstruction, Poisson noise.
1
Introduction
Obtaining high quality images is very important in many areas of applied sciences, such as medical imaging, optical microscopy and astronomy. Though for some applications such as positron-emission-tomography (PET) and computed tomography (CT), analytical methods for image reconstruction are available. For instance, filtered back projection (FBP) is the most commonly used method for image reconstruction from CT by manufacturers of commercial imaging equipment [1]. However, it is sensitive to noise and suffers from streak artifacts (star artifacts). An alternative to analytical reconstruction is the use of iterative reconstruction technique, which is quite different from FBP. The main advantages of iterative reconstruction technique over FBP are insensitivity to noise and flexibility [2]. The data can be collected over any set of lines, the projections do not have to be distributed uniformly in angle, and the projections can be even incomplete (limited angle). We will focus on the iterative reconstruction technique. Image reconstruction in the applications mentioned above can be formulated as linear inverse and ill-posed problems, y = Ax + b + n.
(1)
This work was supported by the Center for Domain-Specific Computing (CDSC) under the NSF Expeditions in Computing Award CCF-0926127.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 33–42, 2011. c Springer-Verlag Berlin Heidelberg 2011
34
M. Yan
Here, y is the measured data (vector in RM for discrete case), A is a compact operator (matrix in RM×N for the discrete case), which is different for different applications, x is the desired exact image (vector in RN for discrete case), b is the background and n is the noise. We will consider only the case with background emission (b = 0) in this paper. In astronomy, this is due to sky emission [3,4]. In fluorescence microscopy, it is due to auto-fluorescence and reflections of the excitation light. The computation of u directly by finding the inverse of A is not reasonable because (1) is ill-posed and n is unknown. Therefore regularization techniques are needed for solving these problems efficiently. One powerful technique for applying regularization is the Bayesian model, and a general Bayesian model for image reconstruction was proposed by Geman and Geman [5], and Grenander [6]. The idea is to use a priori information about the image x to be reconstructed. In the Bayesian approach, we assume that measured data y is a realization of a multi-value random variable, denoted by Y and the image x is also considered as a realization of another multi-value random variable, denoted by X. Therefore the Bayesian formula gives us pX (x|y) =
pY (y|x)pX (x) . pY (y)
(2)
This is a conditional probability of having X = x given that y is the measured data. After inserting the detected value of y, we obtain a posterior probability distribution of X. Then we can find x∗ such that pX (x|y) is maximized, as maximum a posterior (MAP) likelihood estimation. In general, X is assigned as a Gibbs random field, which is a random variable with the following probability distribution pX (x) ∼ e−βJ(x) ,
(3)
where J(x) is a given energy functional, and β is a positive parameter. There are many different choices for J(x) depends on the applications. Some examples are instance quadratic penalization J(x) = x22 /2 [7,8], quadratic Laplacian J(x) = ∇x22 /2 [9], total variation (TV) J(x) = |∇x|1 [10,11,12], and Good’s roughness penalization J(x) = |∇x|2 /x1 [13]. If the random variable Y of the detected values y follows a Poisson distribution [14] with an expectation value provided by Ax + b, we have yi ∼ Poisson{(Ax + b)i },
i.e. pY (y|x) ∼
(Ax + b)yi i
i
yi !
e−(Ax+b)i .
(4)
By minimizing the negative log-likelihood function (− log pX (x|y)), we obtain the following optimization problem minimize (Ax + b)i − yi log(Ax + b)i + βJ(x). (5) x≥0
i
In this paper, we will focus on solving (5). It is easy to see that the objective function in (5) is convex if J(x) is convex. Additionally, with suitably chosen
EM-Type Algorithms for Image Reconstruction
35
regularization J(x), the objective function is strictly convex, and the solution to this problem is unique. The work is organized as follows. In section 2, we will give a short introduction of expectation maximization (EM), or Richardson-Lucy, used in image reconstruction with background emission. In section 3, we will propose a general EM-Type algorithm for image reconstruction when the measured data is corrupted by Poison noise. This is based on the maximum a posteriori likelihood estimation and EM. However, for some regularizations such as TV, the reconstructed image will lose contrast, EM-Type algorithm with Bregman iteration is introduced in section 4. Some numerical experiments are given in section 5 to show the efficiency of the EM-Type algorithm. We will end this work by a short conclusion section.
2
Expectation Maximization (EM)
A maximum likelihood (ML) method for image reconstruction based on Poisson data was introduced by Shepp and Vardi [14] in 1982 for applications in emission tomography. In fact, this algorithm was originally proposed by Richardson [15] in 1972 and Lucy [16] in 1974 for astronomy. Here, we consider the special case without regularization term, i.e. J(x) is a constant, we do not have any a priori information about the image. From equation (4), for given measured data y, we have a function of x, the likelihood of x, defined by pY (y|x). Then a ML estimate of the unknown image is defined as any maximizer x∗ of pY (y|x). By taking the negative log-likelihood, one obtains, up to an additive constant (Ax + b)i − yi log(Ax + b)i , (6) f0 (x) = i
and the problem is to minimize this function f0 (x) on the nonnegative orthant, because we have the constraint that the image x is nonnegative. In fact, we have f (x) = DKL (Ax + b, y) ≡
yi log i
yi + (Ax + b)i − yi = f0 (x) + C, (Ax + b)i
where DKL (Ax + b, y) is the Kullback-Leibler (KL) divergence of Ax + b from y, and C is a constant independent of x. The KL divergence is considered as a data-fidelity function for Poisson data. It is convex, nonnegative and coercive on the nonnegative orthant, so the minimizers exist and are global. The well-known EM algorithm, or Richardson-Lucy algorithm is yi a ( ) ij i (Axk +b)i = xkj . (7) xk+1 j i aij Shepp and Vardi showed in [14] that when b = 0, this is an example of the EM algorithm proposed by Dempster, Laird and Rubin [17].
36
3
M. Yan
EM-Type Algorithms for Image Reconstruction
The method shown in last section is also called maximum-likelihood expectationmaximization (ML-EM) reconstruction, because it is a maximum likelihood approach without any Bayesian assumption on the images. If additional a priori information about the image is given, we have maximum a posteriori probability (MAP) approach [18,19], which is the case with regularization term J(x). Again we assume here that the detected data is corrupted by Poisson noise, and the regularization problem is minimize E p (x) ≡ βJ(x) + i ((Ax + b)i − yi log(Ax + b)i ) , x (8) subject to xj ≥ 0, j = 1, · · · , N. This is a convex constraint optimization problem and we can find the optimal solution by solving the Karush-Kuhn-Tucker conditions:
yi ) − sj = 0, j = 1, · · · , N, β∂J(x)j + aij (1 − (Ax + b)i i sj ≥ 0,
xj ≥ 0, s x = 0.
j = 1, · · · , N,
T
Here sj is the Lagrangian multiplier corresponding to the constraint xj > 0. By the positivity of {xj }, {sj } and the complementary slackness condition sT x = 0, we have sj xj = 0 for every j = 1, · · · , N . Thus we obtain
yi βxj ∂J(x)j + ) xj = 0, j = 1, · · · , N, aij (1 − (Ax + b)i i or equivalently yi a ( ) ij i (Ax+b)i xj β ∂J(x)j + xj − xj = 0, a i ij i aij
j = 1, · · · , N.
Notice that the last term on the left hand side is an EM step (7). After plugging the EM step into the KKT condition[20], we obtain xj β ∂J(x)j + xj − xEM = 0, j a ij i
j = 1, · · · , N,
which is the optimality for the following optimization problem minimize E1p (x, xEM ) ≡ βJ(x) + x
( aij ) xj − xEM log xj . j j
(9)
i
Therefore we propose the general EM-Type algorithm in Algorithm 1. If J(x) 1 is constant, the second step is just xk = xk+ 2 and this is exactly the ML-EM in
EM-Type Algorithms for Image Reconstruction
37
last section. For the case J(x) is not constant, we have to solve an optimization problem for each iteration. In general, the problem can not be solved analytically, and we have to use iterative methods to solve it. In practical, we do not have to solve it exactly by stopping it after a few iterations and the algorithm will also converge without solving it exactly.
Input: Given x0 ; for k=1:IterMax do 1 xk− 2 = EM (xk−1 ) using (7) ; 1 xk = argmin E1p (x, xk− 2 ) by solving (9); end Algorithm 1. Proposed EM-Type algorithm
4
EM-Type Algorithms with Bregman Iteration
In previous section, the EM-Type algorithms are presented to solve the problem (8). However, the regularization may lead to reconstructed images suffering from contrast reduction [21]. Therefore, we suggest a contrast improvement in EM-Type algorithms by Bregman iteration, which is introduced in [22,23,24]. An iterative refinement is obtained from a sequence of modified EM-Type algorithms. For the problem with Poisson noise, we start with the basic EM-Type algorithm, i.e. finding the minimum x1 of (8). After that, variational problems with a modified regularization term xk+1 = argmin β(J(x) − pk , x ) + ((Ax + b)i − yi log(Ax + b)i ) (10) x
i
where pk ∈ ∂J(xk ), are solved sequentially. From the optimality of (10), we have the following formula for updating pk+1 from pk and xk+1 .
1 T y k+1 k p =p − A 1− . (11) β Axk+1 + b Therefore the algorithm with Bregman iteration is described in Algorithm 2.
5
Numerical Experiments
In this section, we will illustrate the proposed EM-Type algorithms for image reconstruction (more specifically, image deblurring). The regularization we used is total variation (TV) regularization, and we present some deblurring results with the proposed EM-TV algorithm and the Bregman version of it. The first image used is a synthetic 200×200 phantom. It consists of circles with intensities
38
M. Yan
Input: Given x0 , δ, , k = 1 and p0 = 0; while k ≤ Num outer & DKL (Axk−1 + b, y) < δ do xtemp,0 = xk−1 ; l = 0; while l ≤ Num inner & xtemp,l − xtemp,l−1 ≤ do l = l + 1; 1 xtemp,l− 2 = EM (xtemp,l−1 ) using (7) ; 1 xtemp,l = argmin E1p (x, xtemp,l− 2 ) with J(x) − pk−1 , x ; x
end xk = xtemp,l ; pk = pk−1 − 1 AT β k = k + 1; end
1−
y
Axk + b
;
Algorithm 2. Proposed EM-Type algorithm with Bregman iteration
(a)
(b)
(c)
180
12
original image blurred image result without Bregman result with Bregman
RMSE 11
160
10
140
9 120 8 100 7 80 6 60 5 40
4
20
3
2
0
10
20
30
40
50
(d)
60
70
80
90
100
0
0
20
40
60
80
100
120
140
160
180
200
(e)
Fig. 1. (a) The result without Bregman iteration. (b) The result with 25 Bregman iterations. (c) The result with 100 Bregman iterations. (d) The plot of RMSE versus Bregman iterations. (e) The lineouts of original image, blurred image, the results with and without Bregman iterations. Some parameters chosen are β = 0.001, Num inner = 100 and Num outer = 100.
EM-Type Algorithms for Image Reconstruction
(a)
(b)
(c)
(d)
(e)
(f)
1400
39
13.6
13.4
1300 13.2
13
1200 12.8
1100
12.6
12.4
1000 12.2
12
900 11.8
800
0
5
10
15
(g)
20
25
30
0
5
10
15
20
25
30
(h)
Fig. 2. (a) The original image. (b) The PSF image. (c) The blurred image. (d) The noisy blurred image. (e) Initial guess from CG. (f) The result of EM-Type algorithm with Bregman iteration. (g) The plot of KL versus Bregman iteration. (h) The RMSE versus Bregman iteration. Some parameters chosen are β = 1, Num inner = 200 and Num outer = 30.
65, 110 and 170, enclosed by a square frame of intensity 10. For the experiment, we choose the background b = 20. Firstly, we consider the case without noise. The blurred image is obtained from original image using a Gaussian blur kernel K with standard deviation σ = 100. To illustrate the advantage of Bregman iterations, we show the results in Figure 1. The RMSE for 1(a), 1(b) and 1(c) are 11.9039, 5.0944 and 2.5339, respectively, and the corresponding KL distances DKL (Ax + b, y) are 93.0227, 0.8607 and 0.0172, respectively.
40
M. Yan
30 μm
21 μm
13 μm
10 μm
30.0 μm
21.6 μm
12.4 μm
9 μm
30.0 μm
21.6 μm
NA
NA
Fig. 3. Top row shows raw lensfree fluorescent images of different pairs of particles. The distances betweens these two particles are 30μm, 21μm, 13μm and 9μm, from left to right. Middle row shows the results of EM-Type algorithm with p = 0.5, Bottom row shows the results for EM (or Richardson-Lucy) method.
How to choose a good β is important for algorithm 1, but not for algorithm 2 with Bregman iteration. For this example, though β is not chosen to be optimal for algorithm 1, the result of Bregman iteration shows that we can still obtain good result after some iterations. From the lineouts we can see that the result with Bregman iteration fits the original image very well. Next experiment is to perform deconvolution on an image of a satellite (Figure 2(a)), and the point spread function (PSF) is shown in Figure 2(b). In order to make the algorithm fast, we choose the initial guess x0 to be the result from solving Ax = y − b using conjugate gradient (CG). The negative values are changed into zero before applying EM-TV algorithm. The corresponding RMSE for x0 and the result are 13.6379 and 11.8127, respectively. By using the EM-TV with Bregman iteration, we get a better image with sharp edges and remove the artifacts. Though convergence analysis of EM-Type algorithms can be shown if J(x) is convex. When J(x) is not convex, the functional may have several local minimums and the algorithm will converge to one of them. For the last experiment
EM-Type Algorithms for Image Reconstruction
41
(Figure 3), we will try to separate the spare objects in lensfree fluorescent imaging [25] using EM-Type algorithm with a non-convex J(x). Though EM (or Richardson-Lucy) method will separate the particles when the distance is large (30μm and 20μm), but it can not separate them when they are close to each other (13μm and 9μm in this experiment). However, we can choose J(x) = j |xj |p for p ∈ (0, 1), and these two particles can be separated even when the distance is very small. For the numerical experiment, top row show the lensfree raw images. As the distance between particles become smaller, their signatures become indistinguishable to the bare eye. The the PSF is measured using small diameter fluorescent particles that are imaged at a low concentration. We choose the same numbers of iterations for EM-lp and EM method, and the results show that with p = 0.5, we can obtain better results (them are separated).
6
Conclusion
In this paper, we proposed general robust EM-Type algorithms for image reconstruction when the measured data is corrupted by Poisson noise: iteratively performing EM and regularization in the image domain. If J(x) is convex, the algorithm will converge to the global minimum of the object functional. However the problem with some regularizations such as total variation will lead to contrast reduction in the reconstructed images. Therefore, in order to improve the contrast, we suggested EM-Type algorithms with Bregman iteration by applying a sequence of modified EM-Type algorithms. To illustrate the effectiveness of these algorithms, we choose total variation (TV) regularization first, and EMTV algorithm with Bregman iteration is applied to image deblurring. The results show the performance of these algorithms. Also, when J(x) is non-convex, this algorithm will also converge to a local minimum, and the numerical example in lensfree fluorescent imaging shows better results with lp regularization (p < 1) than without regularization in separating two particles.
References 1. Shepp, L., Logan, B.: The Fourier reconstruction of a head section. IEEE Transaction on Nuclear Science 21, 21–34 (1974) 2. Kak, A., Slaney, M.: Principles of Computerized Tomographic Imaging. Society of Industrial and Applied Mathematics, Philadelphia (2001) 3. Brune, C., Sawatzky, A., Wubbeling, F., Kosters, T., Burger, M.: An analytical view on EM-TV based methods for inverse problems with Poisson noise. Preprint, University of M¨ unster (2009) 4. Politte, D.G., Snyder, D.L.: Corrections for accidental coincidences and attenuation in maximum-likelihood image reconstruction for positron-emission tomography. IEEE Transaction on Medical Imaging 10, 82–89 (1991) 5. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741 (1984) 6. Grenander, U.: Tutorial in pattern theory. Lecture Notes Volume, Division of Applied Mathematics. Brown University (1984)
42
M. Yan
7. Conchello, J.A., McNally, J.G.: Fast regularization technique for expectation maximization algorithm for optical sectioning microscopy. In: Proceeding of SPIE Symposium on Electronic Imaging Science and Technology, vol. 2655, pp. 199–208 (1996) 8. Markham, J., Conchello, J.A.: Fast maximum-likelihood image-restoration algorithms for three-dimensional fluorescence microscopy. Journal of the Optical Society America A 18, 1052–1071 (2001) 9. Zhu, D., Razaz, M., Lee, R.: Adaptive penalty likelihood for reconstruction of multi-dimensional confocal microscopy images. Computerized Medical Imaging and Graphics 29, 319–331 (2005) 10. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D 60, 259–268 (1992) 11. Dey, N., Blanc-Feraud, L., Zimmer, C., Roux, P., Kam, Z., Olivo-Marin, J.C., Zerubia, J.: Richardson-Lucy algorithm with total variation regularization for 3D confocal microscope deconvolution. Microscopy Research and Technique 69, 260– 266 (2006) 12. Yan, M., Vese, L.A.: Expectation maximization and total variation based model for computed tomography reconstruction from undersampled data. In: Proceeding of SPIE Medical Imaging: Physics of Medical Imaging, vol. 7961, p. 79612X (2011) 13. Joshi, S., Miller, M.I.: Maximum a posteriori estimation with Good’s roughness for optical sectioning microscopy. Journal of the Optical Society of America A 10, 1078–1085 (1993) 14. Shepp, L., Vardi, Y.: Maximum likelihood reconstruction for emission tomography. IEEE Transaction on Medical Imaging 1, 113–122 (1982) 15. Richardson, W.H.: Bayesian-based iterative method of image restoration. Journal of the Optical Society America 62, 55–59 (1972) 16. Lucy, L.B.: An iterative technique for the rectification of observed distributions. Astronomical Journal 79, 745–754 (1974) 17. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B 39, 1–38 (1977) 18. Hurwitz, H.: Entropy reduction in Bayesian analysis of measurements. Physics Review A 12, 698–706 (1975) 19. Levitan, E., Herman, G.T.: A maximum a posteriori probability expectation maximization algorithm for image reconstruction in emission tomography. IEEE Transactions on Medial Imaging 6, 185–192 (1987) 20. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 21. Meyer, Y.: Oscillating Patterns in Image Processing and in some Nonlinear Evolution Equations. American Mathematical Society, Providence (2001) 22. Bregman, L.: The relaxation method for finding common points of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics 7, 200–217 (1967) 23. Osher, S., Mao, Y., Dong, B., Yin, W.: Fast linearized Bregman iteration for compressed sensing and sparse denoising. Communications in Mathematical Sciences 8, 93–111 (2010) 24. Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for l1- minimization with applications to compressed sensing. Journal on Imaging Sciences 1, 143–168 (2008) 25. Coskun, A.F., Sencan, I., Su, T.W., Ozcan, A.: Lensless wide field fluorescent imaging on a chip using compressive decoding of sparse objects. Optics Express 18, 10510–10523 (2010)
Region-Based Segmentation of Parasites for Highthroughput Screening Asher Moody-Davis1, Laurent Mennillo2, and Rahul Singh1,* 1
Department of Computer Science, San Francisco State University, San Francisco, CA 94132 2 Universite De La Mediterranee Aux-Marseille II, 13007 Marseille, France
[email protected]
Abstract. This paper proposes a novel method for segmenting microscope images of schisotsomiasis. Schistosomiasis is a parasitic disease with a global impact second only to malaria. Automated analysis of the parasite’s reaction to drug therapy enables high-throughput drug discovery. These reactions take the form of phenotypic changes that are currently evaluated manually via a researcher viewing the video and assigning phenotypes. The proposed method is capable of handling the unique challenges of this task including the complex set of morphological, appearance-based, motion-based, and behavioral changes of parasites caused by putative drug therapy. This approach adapts a region-based segmentation algorithm designed to quickly identify the background of an image. This modified implementation along with morphological post-processing provides accurate and efficient segmentation results. The results of this algorithm improve the correctness of automated phenotyping and provide promise for high-throughput drug screening.
1 Introduction Schistosomiasis is one of the seven neglected tropical diseases as defined by the Centers for Disease Control [1]. Diseases in this category affect over one billion people and occur in developing nations. In particular, schistosomiasis is largely found in African countries where infection rates are reported above 50% [2] and in sections of East Asia and the subcontinent. The World Health Organization published in 1999 that all of the reported deaths and 99.8% of those disabled by the disease were considered low or middle income [3]. This disease is caused by several parasites of the genus Schistosoma. Freshwater snails are used as vector by the parasites. Once the larvae leave the snail they can infect anyone who comes in contact with the contaminated water. The parasites infect the host by penetrating the skin and traveling through the blood stream. Schistosomiasis targets the liver and bladder. If not treated over a period of time studies have shown an increased chance of bladder cancer [4]. Large pharmaceutical companies do not invest in finding cures for these diseases because there is little or no profit to be gained. One of the barriers to rapid drug *
Corresponding author
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 43–53, 2011. © Springer-Verlag Berlin Heidelberg 2011
44
A. Moody-Davis, L. Mennillo, and R. Singh
discovery is the time it takes to analyze the complex reactions of the parasites to various drugs. High-throughput screening (HTS) is a recent technique in drug discovery that allows screening a large number of drugs in parallel against a target. In the given problem context HTS plates can contain up to 384 samples, thereby allowing the effect of a number of drug candidates to be tested in parallel. The reaction of the parasite is videotaped using a camera connected to the microscope. Researchers have described six different phenotypic responses to putative drugs resulting in changes in shape, color, texture, and movement [5]. It is highly probably that more phenotypes exist but have not been identified. Currently, researchers review these videos manually to assign phenotypes, and these assignments are used to determine the effectiveness of a specific treatment. Requiring a professional to analyze each individual video sample is a long and tedious process. Implementing a HTS process requires algorithms that can accurately and efficiently identify these changes. Segmentation is the first step in the process to achieving high accuracy. With approximately 50 individuals in a well, parasites tend to touch, causing traditional segmentation methods to often merge them into a single segmented object. Parasites also exhibit complex shapes, textures, and movement. Finally, a proposed method also needs to allow for variability in imaging conditions. We organize the paper as follows: Sec. 2 details the challenges faced when attempting to segment schistosomula in the context of the state of the art in biological image analysis methods. Sec. 3 provides an overview of recent research. This is followed by a brief discussion of the original Active Mask [6] segmentation method in Sec. 4, which is designed for segmentation of human cells, and how the region-based distributing function of this method is modified in the proposed method. Next is an explanation of the implementation of the proposed method. Finally, Sec. 5 contains the results of the implementation and a comparison to other morphology based methods.
2 Problem Characteristics and Challenges Segmentation of parasites requires an understanding of their color, texture, shape, and movement. Each of these comes with possible challenges that need to be handled skillfully in the segmentation process to avoid inaccurate or unusable results. Among others, illumination conditions and image-capture technologies can vary. No assumptions can be made about the quality or composition of parasite data if the intention is to make the solution widely applicable. Traditional thresholding techniques are generally not effective in these situations. Additionally, due to illumination, crowding, or the nature of the substrate on which the parasite lives, the body boundaries can be obscure. Several aspects of the parasite’s movement and shape also cause difficulty in acquiring accurate segmentation results. Unlike in the original implementation of the Active Mask segmentation algorithm [6], a rounded shape cannot be assumed. The assumption of “roundedness” of shape is inherent in many methods developed for biological segmentation. Schistosomes movement are based on elongation and contraction of the musculature. This changes the shape of the body from narrow and straight edged on the sides to a more rounded shape within a single movement cycle. Additionally, parasites contain visible “inner” anatomical structures that complicate segmentation by creating edges that do not correspond to the boundaries of the body.
Region-Based Segmentation of Parasites for High-throughput Screening
45
Lastly, it is important to o note that these parasites tend to “stick” together. Thhus, video data shows parasites that touch. Once two or more parasites touch, they tendd to stay stuck to one another for f a significant number of frames. This often resultss in poor splitting of the parasiites where more than one parasite is considered a sinngle region. The leftmost imagee in Fig. 1 shows an example of two parasites that mightt be segmented as a single regiion. All images in Fig. 1 display the various shapes and textures present in a single video. v
Fig. 1. Illustrations of the seg gmentation challenges: Touching parasites in the leftmost im mage cause merged regions. Note the t “inner” dark structures that cause additional edges insidee the body of the parasite. The centter image displays the variance of the texture causing the left eedge of the body to become unclearr. Finally, the parasite’s body is shown elongated in the leftm most image and rounded in the righttmost image.
3 Review of Recent Research R While segmentation of paarasites has not been deeply investigated, there is a rrich literature in segmentation of o cells in biological images. An accurate automated ssegmentation algorithm as parrt of a high-throughput system capable of handling laarge datasets was shown in [7] to successfully measure 14 phenotypes of approximattely 8.3 million human cells. Completing C this type of analysis manually is unimaginabble. In [8], a dataset generated using a Cellnomics ArrayScan VTi system (Cellnom mics, Pittsburgh, PA) for high content screening (HCS) resulted in poor segmentationn of HCS over 50% of the SK-BR-3 (breast carcinoma) cells. Researchers found that the H system had difficulty handlling the splitting and merging of cells which in turn signnificantly skewed the analysis of the dimension and shape of the cells. They noted the importance of discovering new techniques for image segmentation but were limiited by the inability to integrate new software into the existing HCS system. The watershed segmentaation algorithm is known to over-segment images, whhich led the authors of [9] to implement i a derivative called the marker-controlled w watershed algorithm. The alg gorithm was seeded using a combination of fine and coaarse filters to approximately ideentify the cell’s center as a starting point for the watershed process. This approach waas able to handle touching cells but relied on the ellipttical shape of the cells and two defined d thresholds based on testing to automate the process. Leukocytes, or white bllood cells, were effectively segmented using a shape and size constrained active conttour approach in [10]. Researchers noted that the approoach was only feasible due to a priori p knowledge that leukocytes are approximately cirrcular and their size is relativelly stable.
46
A. Moody-Davis, L. Mennillo, and R. Singh
In [6] researchers proposed a more sophisticated version of active-contour segmentation called active mask segmentation. This method was created to more accurately segment fluorescence microscope images of human cells. These experiments result in grayscale images with dark backgrounds and bright punctuate patterns that outline the cell and its components. The end result of segmentation is a collection of masks where each mask is the binary representation of a single region in the original image. The following section will briefly describe the overall method and go into more detail regarding the region-based distributing function which is the starting point of our method.
4 Proposed Method 4.1 Active Mask Overview The Active Mask algorithm [6] incorporates multiscale and multiresolution blocks which encapsulate the region-based and voting-based distributing functions. The original image is padded, the scale is reduced, and the resolution decreased. This smaller low resolution image is used to apply weights to the collection of masks using a region-based distributing function to identify the background and then using the voting-based distributing function to split merged regions. The region-based function is applied to the first mask which is designated the background mask. Higher weights represent pixels identified as background pixels in the first mask. Next, the votingbased distributing function applies weights to each pixel based on the mask assignments of the neighborhood. As a result of region growing, each mask is propagated based on these weights. Subsequently, the weighting functions are iteratively applied at the current scale and resolution until convergence. The scale and resolution are increased and the process starts again until the image returns to its original scale and resolution. In [6], the authors were able to accurately split merged regions using the voting-based distributing function, given that human cells are rounded. 4.2 Region-Based Distributing Function The region-based distributing function constitutes the first step of the proposed solution for segmenting schistosomula. The purpose of this function is to rapidly identify the background by assigning a higher weight to background pixels. Initially, a lowpass filter is applied to the original image to remove noise and smooth the image. Next the average border intensity γ is subtracted and the image is multiplied by the harshness of the threshold β. β is inversely proportional to the difference of the average background and foreground intensities and is assigned a higher weight the smaller the difference. The result is asymptotically bounded using a sigmoid function and finally a skewing factor α is applied. This results in pixel values below the average border intensity γ being skewed towards the background because they have a higher weight. Following [6] we define the region-based distributing function R1 as: .
(1)
Where α (-1, 0) is the weight of the region-based distributing function, β = 4/(highlow) is the harshness of the threshold, and γ = (high+low)/2 is the average border
Region-Based Segmentation of Parasites for High-throughput Screening
47
intensity. High and low are set to the average region and background intensity values respectively. The lowpass filter h and the sigmoid function as defined in [6] as: | | /
erf
2 √
. /
.
(2)
The lowpass filter removes high frequencies from the frequency domain that correspond to details and noise. This results in a blurred or smoothed image. The sigmoid function defines horizontal asymptotes that restrict the possible value range to between ±1. 4.3 Region-Based Distributing Function Adaptation The region-based distributing function from [6] displayed the ability to not only roughly identify the background but also to accurately locate each parasite in frame. In studies carried out by us, even though it surpassed all previous methods at locating every parasite it did not provide accurate segmentation using the suggested parameter values based on the average background and foreground intensities. Tests were conducted by converting the image to grayscale, inverting it to create a dark background and supplying high and low intensity values based on the converted image. Various values were tested for the initial number of masks, the scale parameter, and function weights. See [6] for more details regarding these parameters. While the region-based distributing function is not intended to fully segment an image, it is able to efficiently determine the background by weighting the image in the frequency domain. In our approach it is used to over-segment the frame and is followed by post-processing to remove unnecessary regions. This method has been shown to provide efficient and accurate results through tests reviewed in Sec. 5. We propose a modified version of equation (1): 1
.
(3)
Where we remove α and β whose weights are not necessary without the voting-based distributing function. Additionally, becomes our threshold representing the distance in intensity between the background and foreground. By adjusting this threshold it is possible to adjust the quality of the results as can be seen in Fig. 2.
Fig. 2. The results of the region-based distributing function with the boundaries shown in white at the following threshold values: (a) 50; (b) 15; (c) 11
48
A. Moody-Davis, L. Mennillo, and R. Singh
Although there is oversegmentation, the results also show that most of the parasites are accurately segmented. Additionally, any small erroneous regions can be easily identified. For instance by the fact that their area will be significantly less that the average area of the parasites. As the threshold decreases the frames are over-segmented because segmentation starts to include regions with intensity values closer to the background intensity. The increase in the number of regions becomes steep and the maximum increase is identified to be used as the threshold. For the results shown in Fig. 2, the threshold that maximized the slope of the graph was equal to 11 and is noted with an asterisk in Fig. 3, which plots the number of regions in relation to the threshold.
Fig. 3. As the threshold value decreases the number of regions segmented increases significantly due to over-segmentation
A starting threshold is established by taking the minimum of the average of the difference between the minimum and maximum intensities present in the grayscale image and a cutoff value of 50. The cutoff value was established during testing and is based on the grayscale color range of [0, 255] where 100, or twice the cutoff, is an acceptable starting difference between foreground and background. The region-based distributing function is iteratively applied with a decreasing threshold. The optimal threshold is identified, the region-based distributing function is applied, and the results are refined using the following morphological techniques. 4.4 Morphological Analysis and Processing The proposed approach takes advantage of the quick identification of the background by the region-based distributing function and then further refines those results through morphological analysis and processing. Our implementation uses morphological techniques to remove noise and improve edge detection based on our prior research [11]. The image is cleaned using closing, filling holes inside of regions, eroding the edges and removing inconsistent regions. Edges that were missed during application of the region-based distributing function create cavities. Closing is performed to aid in filling interior holes by creating the missing edge of the parasite that closes in the
Region-Based Segmentation of Parasites for High-throughput Screening
49
cavity allowing us to identify and fill holes inside the parasite. Edges are then eroded to compensate for the over estimation of the boundary by the region-based distributing function as shown in Fig. 2. The last step in the cleaning process is removing inconsistent regions by comparing each region’s average intensity value to the overall average. Regions with values outside of one and half times the standard deviation are considered background and removed. At this point the image still contains regions where multiple touching parasites are considered a single region. The Canny edge detection algorithm is applied to the original image as the first step in identifying edges to separate parasites. The edge image is analyzed to identify edge pixels surrounded by more than one region. The resulting relevant edges image is subtracted from the black and white segmented image as shown below in Fig. 4(e). This new image, called the final labels image, is processed to remove small regions and fill in any holes created in the edge removal process to produce the final segmentation result. Examples of this process are shown in Fig. 4.
Fig. 4. (a) The inverted image; (b) the results of the region-based distributing function; (c) region-based function results after removing noise; (d) Canny edge detection results; (e) edges subtracted from c; (f) relevant edges; (g) subtraction of relevant edges from c; (h) g after removal of small regions; (i) h after filling holes in regions; (j) the outlines of segmented regions
5 Results and Comparison To establish the effectiveness of the proposed method it was compared against our previous implementation using EDISON [12] mean-shift and morphological segmentation [11]. It was also compared to JSEG [13], a color quantization and spatial segmentation technique. The Codebook model [14] provided a comparison against a motion based segmentation approach. The Codebook model establishes a background codebook using a training set. This codebook is then used to segment an entire video based on motion. Lastly, the original Active Mask implementation was applied and compared. An image was selected and each method’s result compared and runtime measured. The results are shown in Fig. 5.
50
A. Moody-Davis, L. Mennillo, and R. Singh
Fig. 5. (a) The original 1384 x 1024 image prior to segmentation and the following segmentation methods: (b) mean-shift and morphology-based segmentation, (c) JSEG segmentation, (d) Codebook model segmentation at 25% scale of original image, (e) Active Mask segmentation at 50% scale, and (f) the proposed region-based morphological segmentation
The mean-shift and morphological segmentation provides decent results with a reasonable runtime of approximately 26.28 seconds. It has difficulties with the uneven illumination in the lower right-hand corner and tends to merge touching parasites. The JSEG segmentation suffers similar problems in addition to missing several parasites with a slower runtime at 104.80 seconds. The Codebook model produces poor results due to the limited movement captured in the image with a runtime after codebook creation of approximately 91.16 seconds (processing the image at 50% of the original scale). The Active Mask implementation provides reasonable results but looking at the identified regions in Fig. 6 shows the algorithm did not successfully split and merge regions. In addition, the image had to be reduced in size by 50% to avoid running out of memory during processing and the segmentation runtime was slowest at 888.86 seconds. Finally, the results of the proposed method show improved identification and splitting of regions with a reasonable runtime of approximately 15.96 seconds.
Region-Based Segmentation of Parasites for High-throughput Screening
51
Fig. 6. The results of the Acctive Mask method highlighting merged and split regions using varying grayscale to identify unique u regions
Table 1 provides a break kdown of the runtimes and accuracy for each method. Recall measures the ratio of pixels correctly identified as foreground to the total num mber of hand segmented foregrou und pixels. Precision measures the ratio of pixels correcctly identified as foreground to the total number of pixels identified as foreground durring the segmentation process. Table 1. Each method was ussed to segment a set of 5 images (1384 x 1024) and the accurracy and runtime recorded. The processing p times were evaluated on an Intel(R) Core(TM)2 D Duo CPU T9300 @ 2.50GHz with h 4.00GB RAM. Codebook generation is the training periodd for the Codebook method. Thresh hold Est. is the time taken to establish the threshold from the first image in our proposed method d. The threshold and Codebook are used to segment subsequuent images with the approximate segmentation s runtime as shown under Segmentation. Method Mean-Shift Based JSEG Codebook Codebook Active Mask Proposed Method Proposed Method
Imagee Process Segm mentation Segm mentation Codeb book Generration Segm mentation Segm mentation Threshold Est. mentation Segm
Scale 100% 100% 50%
Avg. Runtime 26.28s 104.80s 6246.35s
Recall 92.25% 85.04% N/A
Precision 69.98% 87.51% N/A
50% 50% 100% 100%
91.16s 888.86s 128.19s 15.96s
31.83% 98.85% N/A 89.54%
88.13% 67.98% N/A 97.70%
A set of five images werre hand segmented by us to compare against the resultss of the proposed method. Theese images serve as the ground truth with which we esttablish the accuracy of our meethod. The comparison against the ground truth is vissualized in Fig. 7. Fig. 7 shows the backgro ound in black and the parasites in their grayscale repressentation. False positives and d false negatives are captured using white and dark ggray respectively. Over the set of five images each pixel was evaluated to establish ppercentages of false negatives and false positives. The region-based morphological ssegmentation resulted in 0.28% % false positives and 1.39% false negatives over five images. The segmentation results correctly identified 98.33% of the pixels as eitther background or foreground.
52
A. Moody-Davis, L. Mennillo, and R. Singh
menFig. 7. First image taken from a control sample illustrates the accuracy of the proposed segm tation algorithm. Areas in blaack represent the background, white regions represent false ppositives (areas designated foregro ound that are actually background), dark gray areas represent ffalse negatives (areas designated background that are actually foreground), and the correctly segown in the original grayscale mented parasite bodies are sho
6 Conclusion The promise of the region--based morphological algorithm for segmentation of paarasite microscope images is evident e in the results shown here. The algorithm is ablee to efficiently and accurately segment while handling complex shapes and textures. T The algorithm outperformed ou ur original mean-shift based method, JSEG, the Codeboook model, and the original Acttive Mask method in both segmentation accuracy and rruntime. In addition to perform mance our algorithm makes no assumptions about the ddata itself, making it more amen nable to different data sets. Providing quality segmentattion as part of a system that au utomates drug discovery for schistosomiasis has a worldwide impact. A freely avaiilable system can be used by researchers all over the woorld, which speeds drug discoverry and in turn promotes better health. Acknowledgements. The authors thank Connor Caffrey and Brian Suzuki for generating the data and maany discussions on the biological aspects of the problem. This research was funded in i part by the NIH through grant 1R01A1089896-01, N NSF through grant IIS-0644418 (CAREER), and the California State University Progrram for Education and Research h in Biotechnology (CSUPERB) and Sandler Center Jooint Venture Grant.
References 1. CDC: Neglected Tropicall Diseases, http://www.cdc.go ov/parasites/ntd.html 2. Utzinger, J., et al.: From innovation to application: Social-ecological context, diagnostics, drugs and integrated d control of schistosomiasis. Acta Trop. (20010), doi:10.1016/j.actatropica..2010.08.020
Region-Based Segmentation of Parasites for High-throughput Screening
53
3. WHO: The World Health Report 1999. Making a Difference. World Health Organization, Geneva (1999) 4. CDC: Schistosomiasis, http://www.cdc.gov/parasites/schistosomiasis/ 5. Abdulla, M.H., et al.: Drug discovery for schistosomiasis: hit and lead compounds identified in a library of known drugs by medium-throughput phenotypic screening. PLos Negl. Trop. Dis. 3(7), e478 (2009) 6. Srinivasa, G., Fickus, M.C., Guo, Y., Linstedt, A.D., Kovacevic, J.: Active Mask Segmentation of Fluorescence Microscope Images. IEEE Transactions on Image Processing 18(8), 1817–1829 (2009) 7. Jones, T.R., Carpenter, A.E., Lamprecht, M.R., Moffat, J., Silver, S.J., Grenier, J.K., Castoreno, A.B., Eggert, U.S., Root, D.E., Golland, P., Sabatini, D.M.: Scoring diverse cellular morphologies in image-based screens with iterative feedback and machine learning. Proc. Natl. Acad. Sci. USA 106, 1826–1831 (2009) 8. Hill, A.A., LaPan, P., Li, Y., Haney, S.: Impact of image segmentation on high-content screening data quality for SK-BR-3 cells. BMC Bioinformatics 8, 340 (2007) 9. Yang, X., Li, H., Zhou, X.: Nuclei Segmentation Using Marker-Controlled Watershed, Tracking Using Mean-Shift, and Kalman Filter in Time-Lapse Microscopy. IEEE Trans. on Circuits and Systems 53(11), 2405–2414 (2006) 10. Ray, N., Acton, S.T., Ley, K.: Tracking leukocytes in vivo with shape and size constrained active contours. IEEE Transactions on Medical Imaging 21, 1222–1235 (2002) 11. Singh, R., Pittas, M., Heskia, I., Xu, F., McKerrow, J., Caffrey, C.R.: Automated ImageBased Phenotypic Screening for High-Throughput Drug Discovery. In: 22nd IEEE International Symposium on Computer-Based Medical Systems, CBMS 2009, pp. 1–8 (August 2009) 12. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 603–619 (2002) 13. Deng, Y., Manjunath, B.S.: JSEG – Segmentation of Color-Texture Regions in Images and Video. UCSB Vision Research Lab (1999), http://vision.ece.ucsb.edu/segmentation/jseg/ 14. Kim, K., et al.: Real-time foreground-background segmentation using codebook model. Real-Time Imaging 11(3), 172–185 (2005)
Adaptive Coded Aperture Photography Oliver Bimber1 , Haroon Qureshi1 , Anselm Grundh¨ ofer2 , Max Grosse3 , and Daniel Danch1 1
Johannes Kepler University Linz {firstname.lastname}@jku.at 2 Disney Research Zurich
[email protected] 3 Bauhaus-University Weimar {max.grosse}@uni-weimar.de
Abstract. We show how the intrinsically performed JPEG compression of many digital still cameras leaves margin for deriving and applying image-adapted coded apertures that support retention of the most important frequencies after compression. These coded apertures, together with subsequently applied image processing, enable a higher light throughput than corresponding circular apertures, while preserving adjusted focus, depth of field, and bokeh. Higher light throughput leads to proportionally higher signal-to-noise ratios and reduced compression noise, or – alternatively– to lower shutter times. We explain how adaptive coded apertures can be computed quickly, how they can be applied in lenses by using binary spatial light modulators, and how a resulting coded bokeh can be transformed into a common radial one.
1
Introduction and Motivation
Many digital still cameras (compact cameras and interchangeable lens cameras) apply on-device JPEG compression. JPEG compression is lossy because it attenuates high image frequencies or even rounds them to zero. This is done by first transforming each 8 × 8 pixel image block to the frequency domain via discrete cosine transformation (DCT), and then quantizing the amplitudes of the frequency components by dividing them by component-specific constant values that are standardized in 8 × 8 quantization matrices. The compression ratio can be controlled by selecting a quality factor q which is used to derive the coefficients of the quantization matrices. Visible compression artifacts (called compression noise) commonly result from quantization errors because of too low quality factors – in particular, due to the presence of sensor noise. Therefore, sensor noise is usually filtered before compression to reduce such artifacts. The aperture diameter in camera lenses controls the depth of field and the light throughput. For common 3D scenes, narrow apertures support a large depth of field and a broad band of imaged frequencies, but suffer from low light throughput. Wide apertures are limited to a shallow depth of field with an energy shift towards low frequencies, but benefit from high light throughput. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 54–65, 2011. c Springer-Verlag Berlin Heidelberg 2011
Adaptive Coded Aperture Photography
55
If JPEG compression is applied intrinsically, then its attenuation of high image frequencies leaves margin for a wider aperture, since these frequencies need not be supported optically. A wider aperture results in a higher light throughput that is proportional either to a higher signal-to-noise ratio (SNR) or to shorter shutter times. It also reduces compression noise because the SNR is increased. Thus, we trade lower compression quality either for higher SNR or for shorter shutter times. Our basic idea is the following: We allow the photographer to adjust focus and depth of field when sampling (i.e., capturing) the scene through a regular (circular) aperture. Then the captured sample image ig is JPEG-compressed to ˆig with the selected quality factor (as would be done by the camera). Using ˆig , we compute an optimized coded aperture that supports the frequencies which remain after JPEG compression and additional frequency masking. This coded aperture is then applied in the camera lens by using a spatial light modulator (SLM), and the scene is re-captured. The resulting image ic is finally JPEGcompressed to ˆic . The computed coded aperture and the subsequently applied image processing ensure that the focus, the depth of field, and the bokeh (i.e., the blurred appearance of out-of-focus regions) contained in ˆig are preserved in ˆic and that ˆic benefits from the higher light throughput of the coded aperture. Sampling a scene with multiple shots is quite common in photography. Bracketing and passive autofocus sampling are classical examples. The two-shot approach proposed should therefore not influence settled habits in photography.
2
Related Work and Contribution
Coded aperture imaging has been applied in astronomy and medical imaging for many years. Recently, it has also been employed in the context of computational photography applications. Spatially coded static [1,2,3,4,5] or dynamic [6,7] apertures have been used, implemented either as binary [1,2,4,5,7], intensity [1], or color [3] masks. The main applications of coded aperture photography are post-exposure refocusing [1,6], defocus deblurring [1,2,5,4,7], depth reconstruction [2,3,6,5], matting [1,3], and light field acquisition [1,6,7]. Static binary masks have also been applied to compensate for projector defocus [8,9]. In [10], concentric ring-mirrors were used to split the aperture into multiple paths that are imaged on the four quadrants of the sensor for depth-of-field editing, synthetic refocusing, and depth-guided deconvolution. Aperture masks are optimized, for example, for given noise models [4], to maximize zero crossings in the Fourier domain [2,5], to maximize Fourier-magnitudes [1,4,7], or are computed from a desired set of optical transfer functions (OTFs) or point-spread functions (PSFs) at different focal depths [11]. An analysis of various masks used for depth estimation is provided in [12]. None of the existing techniques compute and apply aperture masks that are optimized for the actual image content. This requires adapting the masks dynamically to each individual image. Programmable coded apertures that utilize various types of SLMs are becoming increasingly popular. Liquid crystal arrays (LCAs) have been applied to realize dynamic binary
56
O. Bimber et al.
masks for light field acquisition [6], and LCoS panels have been used for light field acquisition and defocus deblurring [7]. Although these approaches allow dynamic exchange of the aperture masks, the individual sub-masks are still precomputed and applied without any knowledge of the image content. Temporally coded apertures without spatial coding (e.g., implemented with high-speed ferroelectric liquid crystal shutters [13] or with conventional electronic shutters [14,15]) have been used for coded exposure imaging to compensate for motion blur. Our approach requires spatially coded aperture masks and is not related to motion deblurring. Adaptive coded apertures have been introduced for enhanced projector defocus compensation [9]. Here, optimal aperture masks are computed in real time for each projected image, taking into account the image content and limitations of the human visual system. It has been shown that adaptive coded apertures always outperform regular (circular) apertures with equivalent f-numbers in terms of light throughput [9]. The main contribution of this paper is to present a first approach to adaptive coded aperture photography. Instead of pre-computing and applying aperture masks that are optimized for content-independent criteria such as noise models, Fourier-magnitudes, Fourier zero-crossings, or given OTFs/PSFs, we compute and apply aperture masks that are optimized for the frequencies which are actually contained in a captured sample image after applying cameraintrinsic JPEG compression and frequency masking. We intend neither to increase depth of field nor to acquire scene depth or light fields. Instead, our masks maximize the light throughput. Increased light throughput leads either to a higher SNR with less compression noise or to shorter shutter times. We additionally present a method for transforming the coded bokeh that results from photographing with a coded aperture into the radial bokeh that has been captured initially with the circular aperture. This is called bokeh transformation. Furthermore, we demonstrate how adaptive coded intensity masks can be approximated with binary SLMs using a combination of pulse width modulation and coded exposure imaging. The supplementary material provides additional details.
3
Proof-of-Concept Prototype
We used a Canon EOS 400D with a Canon EF 100 mm, f/2.8 USM macro lens. The original diaphragm blades of the lens had to be removed to allow insertion of a programmable LCA to realize the adaptive coded aperture. The LCA is an Electronic Assembly EA DOGM132W-5 display with a native resolution of 132x32, and 0.4x0.35 mm2 pixel size. A 3D-printed, opaque plastic cast allows a light-tight and precise integration of the LCA at the aperture plane of the lens. A USB-connected Atmel ATmega88 microcontroller is used to address a matrix of 7 × 7 binary bits. Each individual bit is composed of 3 × 3 LCA pixels. Our prototype is illustrated in figure 1. Since the mechanical diaphragm had to be removed, and to enable a fair assessment of the quality enhancement, regular (i.e., circular) aperture masks were also rasterized and displayed on the LCA
Adaptive Coded Aperture Photography
57
Fig. 1. Camera prototype. A programmable liquid crystal array is integrated at the aperture plane of a consumer compound lens. A mask pattern of 7 × 7 binary bits is addressed by a microcontroller.
for comparison with the coded aperture masks. Thus, the rasterized versions of the two smallest possible circular apertures we can display (with 2% and 10% aperture opening) correspond to a 1-bit square shape and a 5-bit cross shape, respectively (see figure 4 for examples). Although the EOS 400D supports a quality factor of q=90 for on-device JPEG compression, we applied JPEG compression ourselves (implemented after [16]) to consider additional quality factors in our evaluation. Furthermore, we carried out a subsequent de-blocking step (implemented after [17], using the default narrow quantization constraint of =0.3, as explained in [17]).
4
Adaptive Coded Aperture Computation
Our goal is to compute aperture masks that retain the most important frequencies while maximizing light throughput by dropping the less important higher frequencies that are, in any case, strongly attenuated by JPEG compression. Let Ig and Iˆg be the Fourier transforms of the uncompressed and compressed versions of ig respectively. We define frequencies (fx , fy ) to be important if their magnitudes before and after compression are similar, and, therefore, their magnitude ratio Iˆg (fx , fy )/Ig (fx , fy ) ≥ τ . The larger the threshold τ , the more frequencies are dropped and the higher the light throughput gained with the resulting coded aperture. However, choosing an overly large τ results in ringing artifacts after transforming the remaining frequencies back to the spatial domain. We found that τ =0.97 (i.e., a magnitude similarity of 97%) represents a good balance between light throughput gain and image quality. For this threshold, the sum of absolute RGB pixel-differences between the original compressed image ˆig and the remaining important frequencies after thresholding and transformation to the spatial domain was less than 1% for all image contents, camera settings, lighting conditions, and compression-quality factors with which we experimented. We then construct a binary frequency mask (m) of the same resolution as Iˆg (fx , fy ) and set all entries that correspond to retained frequencies to 1, while all
58
O. Bimber et al.
others are set to 0. Using m, we adopt the method explained in [9] to compute an intensity aperture pattern a by minimizing the variance of its Fourier transform for all important frequencies: 2
M F a = e , mina M F a − e2 ,
(1)
where M is the diagonal matrix containing the binary frequency mask values of m, F is the discrete Fourier transform matrix (i.e., the set of orthogonal Fourier basis functions in its columns), a is the unknown vector of the coded aperture pattern, and e is the vector of all ones. To minimize the variance of the Fourier transform of a for the retained frequencies while maximizing light throughput, this over-constrained system can be solved in a least-squares sense with the 2 additional constraint to minimize a2 1 . As described in [9], this can be solved quickly with the pseudo-inverse: a = (M F )∗ e = F ∗ M ∗ e = F ∗ M e,
(2)
where the conjugate-transpose pseudo-inverse matrix F ∗ is constant and can be pre-computed. Thus, for each new image, a simple matrix-vector multiplication of F ∗ with M e is sufficient for computing the optimal coded aperture pattern a. Negative values in a are clipped, and the result is scaled such that the maximum value is 1. Since the LCA used can only display binary aperture patterns, a must be binarized: given the minimum and maximum LCA transmittances, tmin and tmax , values ≥ tmin + (tmax − tmin )/2 are set to 1, others are set to 0. The resulting aperture shape is manually scaled to roughly match the depth of field in ig . Section 5 explains how remaining depth-of-field variations are removed. Results for various capturing situations (image content, compression quality, initial opening of regular aperture, focus, lighting) are presented in figure 2. Compared to regular circular apertures, adaptive coded apertures always achieve a gain in light throughput. Note that their ideal shape is often not round, in order to support optimal coverage of asymmetric spectra. This is not possible with simple circular aperture shapes. The gained light throughput is directly proportional to the coded aperture opening divided by the corresponding regular aperture opening. It is not appropriate to correlate f-numbers in this case, since the coded aperture masks are often irregular intensity patterns that cannot be described by a diameter. If we ignored other noise sources, such as dark noise and read noise, and considered shot noise only, then the gain in SNR would be proportional to the square root of the light throughput gain. Other multi-shot methods, such as averaging, are not an option if only two shots are available.√Averaging two images with identical settings reduces noise only by a factor of 2 [18]. The examples in figure 2 show that we achieve significantly higher gain factors in all situations. Furthermore, compression noise is also reduced with adaptive coded apertures (especially for 1
This intrinsically maximizes the light throughput of the aperture: a small squared 2-norm of a (ai ≥ 0) also minimizes the variance of the normalized bit intensities in the spatial domain.
Adaptive Coded Aperture Photography
59
low quality factors, as shown in figure 2,top row). Alternatively to increasing light throughput, the shutter time can be decreased proportionally. More results and examples are provided in the supplementary material.
5
Bokeh Transformation
An undesired difference between ig and ic (and consequently also between ˆig and ˆic ) is the bokeh. The bokeh corresponds to the PSF of lens and aperture. Lenses with regular circular apertures result in a Gaussian PSF with a radial bokeh of defocused points. Coded apertures, however, lead to a bokeh that is associated with the coded aperture pattern and its PSF. To ensure that the bokeh in ˆic matches that in ˆig , we must transform it from the specific coded bokeh into a common radial bokeh. The scale of the bokeh pattern (i.e., its size in the image) depends on the amount of defocus, which is unknown in our cases, since the scene depth is unknown. Note that this transformation also corrects for remaining depth-of-field variations between ig and ic . All results shown in figure 2 have been bokeh-transformed. Capturing an image through an aperture with given PSF can be considered as convolution. For our two cases (regular aperture and coded aperture), this would be (before compression): ig = i ⊗ gs + ηg , ic = i ⊗ cs + ηc ,
(3)
where i is a perfectly focused image (scene point), gs and cs are respectively the PSF kernels of the regular and the coded apertures at scales s and s , and ηg and ηc represent the image noise. The convolution theorem allows a formulation in the frequency domain: Ig = I · Gs + ζg , Ic = I · Cs + ζc .
(4)
In principle, our bokeh transformation can be carried out as follows: Ic − ζc Ic = · Gs + ζg , Cs
(5)
where the division represents deconvolution with the coded PSF, and the multiplication represents convolution with the Gaussian PSF. In fact, the scales are entirely unknown since the scene depth is unknown, and equation 5 cannot be applied in a straightforward way. An easy solution is to test for all possible scale pairs (s and s ) and to select that which leads to the best matching result when comparing the inverse Fourier transform of Ic with ig . Note that the optimal scales may vary over the image because defocus is usually not uniform. To find the optimal scale pairs, we carry out the following procedure (see supplementary material for additional details): First, we register ic and ig using a homography derived from corresponding SIFT features as explained in [19], and match the brightness of ic with the
60
O. Bimber et al.
Fig. 2. Example results of adaptive coded aperture photography. Top row: decreasing JPEG compression quality factor (q=90,70,50,30). Corresponding images are brightness-matched. Second row: increasing regular aperture opening (2%,10%,27%) – controlling initial light throughput and depth of field. Third row: different focus settings (red arrow indicates the plane in focus). Bottom row: different lighting conditions. The photographs in rows 2-4 are compressed with q=70; the graphs on the right plot the light throughput gain for all quality factors. For rows 3 and 4, the standard opening of the regular aperture was 10%. The aperture patterns applied (before and after binarization) are depicted. The circular shape of the regular aperture is rasterized for the 7 × 7 resolution of the LCA.
Adaptive Coded Aperture Photography
61
brightness of ig . Then we carry out the bokeh-transformation (i.e., equation 5) N × N times for combinations of N different scales in s and s . Instead of a simple frequency domain operation, as illustrated in equation 5, we apply iterative Richardson-Lucy deconvolution (with 8 iterations) to consider Poissondistributed shot noise and Gaussian-distributed read noise. The standard deviation σ for read noise can be obtained from the sensor specs, given that ISO sensitivity (gain) and average operating temperature are known (in our case: σ=0.003 for the EOS 400D, ISO100, and room temperature). We also add shot noise and read noise after convolution. This leads to N × N bokeh transformed sample images that are close approximations of ig at the correct scales.
Fig. 3. Bokeh transformation applied to images with different focus. The bokeh of the regular aperture (rasterized circular, with 10% aperture opening) and that of the coded aperture are clearly visible at defocused highlights (defocused reflections of the truck). Note the low contrast of the LCA. The bokeh of the coded aperture also leads to artifacts (ripples in the defocused, twinkling forearm of the bowman). After bokeh transformation (bottom row) the bokeh of the regular aperture (rasterized) can be approximated. Corresponding images are brightness-matched for better comparison. The aperture patterns applied (before and after binarization) are depicted.
To find the correct scales for each image patch of size M × M pixels, we simply find the bokeh-transformed sample with the smallest (average) absolute RGB difference in that patch when compared to the same patch in ig . This is repeated for all patches. We chose a patch size of M =3 to avoid noise influence. Note that, since there is no robust correlation between matched scales and scene depth (due to the scale ambiguity, which can be attributed to the limitations discussed above), depth from defocus would be unreliable (although this is not necessary for bokeh transformation). After determining the scale pairs for each image patch, we bokeh-transform ic again (patch by patch) – but this time with
62
O. Bimber et al.
the previously determined scales for each patch. The difference from the first bokeh transformation (which is only used to determine the correct scales) is, that we now omit the step of adding shot noise and read noise after convolution, since we do not intend to reduce artificially the final image quality. Note that adding noise after convolution is necessary for the first bokeh transformation step to match ig and ic as closely as possible in order to find the correct scales. The final bokeh-transformed patches are stitched together to a new image ic , which is JPEG-compressed to ˆic . Figure 3 shows examples of the bokeh transformation. Note that the coded bokeh is transformed into the bokeh of the rasterized circular aperture. If ig could be captured with the original diaphragm of the lens, the reconstructed bokeh would be smoothly radial (see supplementary material). Note again that this is only a limitation of our hardware prototype, as mechanical constraints in the lens housing prevented us from using the original diaphragm and the LCA at the same time.
6
Intensity Masks
In principle, our adaptive coded apertures are intensity masks. They are only binarized because the applied LCA is binary. This is a limitation of our prototype, as a grayscale LCA with a large pixel footprint (to avoid diffraction-induced artifacts) would have to be custom-manufactured, whereas adequate binary LCAs are off-the-shelf products. Binarization, however, always leads to quantization errors. Although proof-of-concept results can be achieved with our prototype, a much more efficient solution (also with respect to contrast, switching times, and light throughput) would be to replace the LCA with a digital-micro-mirror array (DMA), such as a DMD chip. Switching times of typical DMAs are about 5 μs. Since they natively produce binary patterns, intensities are commonly generated via pulse width modulation (PWM) that is enabled by their fast switching times. Below, we demonstrate that PWM in combination with coded exposure imaging can also be applied to generate intensity masks for adaptive coded apertures. Instead of applying a single aperture pattern n during the entire exposure time t, we segment t into n temporal slots t = i ti , possibly of different durations, and apply different binary aperture patterns in each slot. With s = [t1 , ..., tn ], we can compute a binary pulse series b(x, y) for each aperture bit at position x, y with intensity a(x, y) by solving a(x, y) =
s · b(x, y)T t
(6)
for b(x, y). Here, b(x, y) is the vector of binary pulses [b1 , ..., bn ]x,y that, when activated in the corresponding slots [t1 ...tn ], reproduce the desired intensity a(x, y) with a precision that depends on n (n adjusts the desired tonal resolution of the intensity aperture mask). We solve for b(x, y) by sequentially summing the contributions ti /t of selected slots (traversing from the longest to the shortest) while seeking a solution for
Adaptive Coded Aperture Photography
63
Fig. 4. Intensity masks with pulse width modulation and coded exposure imaging. Top row: images captured at exposure slots s = [3.2s, 1.6s, 0.8s, 0.4s, 1/5s, 1/10s, 1/20s, 1/40s] with corresponding binary pulse patterns (bi ). Bottom row: image captured through regular aperture (2% aperture opening), images captured with adaptive coded aperture (before and after bokeh transformation), and close-ups. The bokeh transformation was carried out with the intensity mask pattern.
which this sum approximates a(x, y) with minimum error. For each selected slot, bi is set to 1, while it is set to 0 for unselected slots. Note that this is the same basic principle as applied for PWM in digital light processing (DLP) devices. The image that is integrated during the exposure time t (i.e., for the duration of all exposure slots with the corresponding binary pulse patterns) approximates the image that would be captured with the intensity mask during the same exposure time. Figure 4 presents an example realized with our LCA prototype. The shortest possible exposure time of t=6.4 s is limited by the fastest switching time of the LCA (supporting 1/40 s for the shortest exposure slot in our example), and the desired tonal resolution (n=8 in our example). For the much faster DMAs, however, a minimum exposure time of 1.28 ms is possible (assuming a minimum exposure slot of 5 μs) with the same tonal resolution and the same binary temporal segmentation. Note that because LCA and sensor are not synchronized in our prototype, the contribution of each exposure slot was captured in individual images which were integrated by summing them digitally instead of integrating them directly by the sensor. While this requires a linearized sensor response, the final result can be gamma-corrected to restore the camera’s non-linear transfer function. Because of the successively decreasing SNR in each exposure slot, however, a direct integration by the sensor leads to a higher SNR and should therefore be preferred over digital integration.
7
Limitations and Outlook
In this paper, we have shown how to trade compression quality for signal-to-noise ratio or for shutter time by means of adaptive coded apertures. This is beneficial if (i) image compression by camera hardware is obligatory, or (ii) it is known
64
O. Bimber et al.
at capture time that the image will be compressed later (e.g., for web content). Even a little compression can, depending on the image content, already lead to a substantial gain in light throughput (a gain factor of 3.6 in our experiments). Alongside compression, downsampling would be another (alternative or complementary) option if the full resolution of the sensor is not required for a particular application. In this case, we would trade resolution for signal-to-noise ratio or for shutter time. We will explore this in the future. For evaluation, we captured and computed a total of 160 images of different scenes with different lighting conditions, focus and depth-of-field settings, and quality factors. Figure 2 illustrates representative examples. Our experiments showed, that adaptive coded apertures are most efficient for large depth-of-field photographs of general 3D scenes because JPEG compression is efficient for the high frequency content imaged by narrow apertures and because the light throughput is initially low. The light throughput gain increases with decreasing quality factor. This behavior is widely invariant to focus settings and lighting conditions unless they strongly influence the image frequencies (e.g., for under- or over-exposed images, or for strong out-of-focus situations). Adaptive coded apertures are even robust for dynamic scene changes or slight movements of the camera during sampling and capturing, since the spectra of ig and ic will normally not differ significantly. They can be computed quickly when implemented in hardware (e.g., within 13 ms for a 1024x1024 image resolution using our CUDA implementation on an NVIDIA GeForce 8800 Ultra), while the more time-consuming bokeh transformation can be carried out off-line and not during capturing. In principle, ic could also be captured with an appropriately widened circular aperture whose radius is derived from the remaining frequencies after JPEG compression. This would have the advantage that the original diaphragms of consumer lenses could be used directly without an additional SLM. However, as shown in [9], it would be less efficient, since adaptive coded apertures also optimally cover asymmetric spectra that do not follow a 1/f distribution. The main limitations of our proof-of-concept prototype are the low light transmittance (only 30% when completely transparent) and the low contrast (7:1) of the LCA employed, which makes it inapplicable in practice. It also introduces slight color shifts and interference effects on partially polarized light. However, this does not affect the advantage of adaptive coded apertures in general and can easily be improved by using reflective DMAs or LCoS panels as explained in [5]. A higher aperture resolution (tonal and spatial) would also produce more precise results. Furthermore, the SLM and the sensor should also be synchronized to allow a direct sensor integration of the coded exposure slots that enable intensity masks. Improving our hardware prototype will be part of our future research. Currently, the applied coded aperture pattern is scaled manually to roughly match the depth of field in ig , while remaining depth-of-field differences between ic and ig are removed by the bokeh transformation. Our future work also includes automating this scale estimation. Another interesting avenue for future research is to investigate temporally adaptive coded apertures (i.e., content-dependent exposure codes) for enhanced motion deblurring.
Adaptive Coded Aperture Photography
65
References 1. Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled Photography: Mask Enhanced Cameras for Heterodyned Light Fields and Coded Aperture Refocusing. ACM Trans. Graph (Siggraph) 26, 69 (2007) 2. Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and Depth from a Conventional Camera with a Coded Aperture. ACM Trans. Graph (Siggraph) 26, 70 (2007) 3. Bando, Y., Chen, B.Y., Nishita, T.: Extracting Depth and Matte using a ColorFiltered Aperture. ACM Trans. Graph (Siggraph Asia) 27, 1–9 (2008) 4. Zhou, C., Nayar, S.K.: What are Good Apertures for Defocus Deblurring? In: IEEE International Conference on Computational Photography (2009) 5. Zhou, C., Lin, S., Nayar, S.K.: Coded Aperture Pairs for Depth from Defocus. In: IEEE International Conference on Computer Vision, ICCV (2009) 6. Liang, C.K., Lin, T.H., Wong, B.Y., Liu, C., Chen, H.: Programmable Aperture Photography: Multiplexed Light Field Acquisition. ACM Trans. Graph (Siggraph) 27, 55:1–55:10 (2008) 7. Nagahara, H., Zhou, C., Watanabe, T., Ishiguro, H., Nayar, S.K.: Programmable aperture camera using lcos. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 337–350. Springer, Heidelberg (2010) 8. Grosse, M., Bimber, O.: Coded Aperture Projection. In: IPT/EDT, pp. 1–4 (2008) 9. Grosse, M., Wetzstein, G., Grundh¨ ofer, A., Bimber, O.: Coded aperture projection. ACM Trans. Graph. 22, 22:1–22:12 (2010) 10. Green, P., Sun, W., Matusik, W., Durand, F.: Multi-aperture photography. ACM Transactions on Graphics (Proc. SIGGRAPH) 26 (2007) 11. Horstmeyer, R., Oh, S.B., Raskar, R.: Iterative aperture mask design in phase space using a rank constraint. Opt. Express 18, 22545–22555 (2010) 12. Levin, A.: Analyzing depth from coded aperture sets. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 214–227. Springer, Heidelberg (2010) 13. Raskar, R., Agrawal, A., Tumblin, J.: Coded exposure photography: motion deblurring using fluttered shutter. In: ACM SIGGRAPH 2006 Papers, SIGGRAPH 2006, pp. 795–804. ACM, New York (2006) 14. Agrawal, A., Xu, Y.: Coded exposure deblurring: Optimized codes for psf estimation and invertibility. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2066–2073 (2009) 15. Tai, Y.W., Kong, N., Lin, S., Shin, S.Y.: Coded exposure imaging for projective motion deblurring. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2408–2415 (2010) 16. Wallace, G.K.: The jpeg still picture compression standard. Commun. ACM 34, 30–44 (1991) 17. Zhai, G., Zhang, W., Yang, X., Lin, W., Xu, Y.: Efficient deblocking with coefficient regularization, shape-adaptive filtering, and quantization constraint. IEEE Transactions on Multimedia 10, 735–745 (2008) 18. Healey, G., Kondepudy, R.: Radiometric ccd camera calibration and noise estimation. IEEE Trans. Pattern Anal. Mach. Intell. 16, 267–276 (1994) 19. Hess, R.: An open-source siftlibrary. In: Proceedings of the International Conference on Multimedia, MM 2010, pp. 1493–1496. ACM, New York (2010)
Display Pixel Caching Clemens Birklbauer1 , Max Grosse2 , Anselm Grundh¨ ofer2 , 1 1 Tianlun Liu , and Oliver Bimber 1
Johannes Kepler University Linz {firstname.lastname}@jku.at 2 Bauhaus-University Weimar {firstname.lastname}@uni-weimar.de Abstract. We present a new video mode for television sets that we refer to as display pixel caching (DPC). It fills empty borders with spatially and temporally consistent information while preserving the original video format. Unlike related video modes, such as stretching, zooming, and video retargeting, DPC does not scale or stretch individual frames. Instead, it merges the motion information from many subsequent frames to generate screen-filling panoramas in a consistent manner. In contrast to state-of-the-art video mosaicing, DPC achieves real-time rates for highresolution video content while processing more complex motion patterns fully automatically. We compare DPC to related video modes in the context of a user evaluation.
1
Introduction and Motivation
A variety of standard video modes that stretch or zoom lower resolution video content linearly to take full advantage of large screen sizes have been implemented in TV sets. When content and screen aspect ratios differ, format proportions may be compromised, video content may be clipped, or screen regions may remain unused. Newer techniques, such as video retargeting and video upsampling, rescale individual video frames and can potentially match them to the display resolution and aspect ratio. However, none of these methods can display simultaneously more than is contained in a single frame. With display pixel caching (DPC), we take a completely different approach. Instead of zooming, stretching, or retargeting individual frames, we merge the motion information from many subsequent frames to generate high-resolution panoramas in an ad-hoc and fully automatic manner (figure 1). Thus, we buffer pixels in border regions as long as they are visually reasonable. Hereafter, we refer to these regions as pixel cache (cache, in short). In contrast to conventional video mosaicing, however, the challenges to DPC are achieving real-time rates for high-resolution input content – and ensuring spatial and temporal consistency. To our knowledge, the concept of a DPC video mode is novel.
2
Related Work
Video retargeting approaches, such as [1,2,3,4], apply a non-linear rescaling of input footage to fit it into other format ratios. Various techniques offer interactive G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 66–77, 2011. c Springer-Verlag Berlin Heidelberg 2011
Display Pixel Caching
67
Fig. 1. Results of DPC with leftward camera motion (a-d), downward motion and pre-filled cache (bottom border) (e,f), forward motion (g,h)
constraint editing during post-production to achieve high-quality results (e.g., [2]) or require processing the entire video cube globally (e.g., [4]). Online processing is therefore not possible in these cases. Methods that do support online processing achieve near real-time rates (e.g., 10fps for 720p [2]), but at moderate quality and robustness. In general, video retargeting rescales only the content of the individual frames and does not assemble high-resolution panoramas from multiple frames, as in DPC. Video upsampling , such as in [5], increases the spatial resolution of videos and is real-time capable, but does not adapt the aspect ratio and – again –
68
C. Birklbauer et al.
displays only the content that is present in individual frames. However, it can be applied as a preceding step to any other video mode (including DPC). Video completion and motion inpainting , such as [6,7], are video extensions of image inpainting. These methods can estimate motion in clipped video frame regions to guide a filling process that considers corresponding pixels in neighboring frames. They can, for example, support video stabilization [6] and are presently not real-time capable (e.g., 2fps for 720x486 in [6]). In contrast to common video completion and inpainting, DPC fills significantly larger frame regions in real time. An application of video completion to video extrapolation for filling large border regions in video frames iteratively was presented in [8]. To avoid visual artifacts, this method removes salient content and applies blur that increases towards the outer edges of the extrapolated regions. Therefore, these regions are only suitable for peripheral vision and are only applicable for displays that cover a large field of view and restrict eye movement, such as head-mounted displays. This method is not real-time capable. Video mosaicing techniques generate a mosaic image from a frame sequence by registering and blending the recorded video footage. Generally, such mosaicing techniques are optimized for both static scenes and controlled camera movements, such as pans, tilts, and translations. They can easily achieve realtime rates for moderate resolutions (e.g., 15fps for 640x480 in [9]). For unstructured camera motions, for arbitrary scenes that cause parallax effects at different depths, and for complex mixtures of global camera motion and local object motion, the motion flow in the images can vary substantially in different regions. While conventional video mosaicing fails in such a situation, this is the typical use case for DPC. Techniques that apply more advanced alignment and blending strategies (e.g., [10,11]) reduce local misalignments but are too expensive for real-time applications at high resolutions. An overview of the DPC video processing pipeline is illustrated in figure 2. The following sections explain its individual components. Additional implementation details can be found in the supplementary material.
3
Fast Image Warping
Most mosaicing techniques assume simple pan/tilt camera movements and static and distant scenes. In this case, frame-wise linear motion models (i.e., per-frame homographies) are efficient when registering subsequent frames into a panorama. Arbitrary video content, however, can contain spatially and temporally varying motions that result from parallax effects and local object movements. One homography per frame is not sufficient in these cases, as shown in figure 3b. In this section, we explain how we segment varying motion into clusters of motion layers that are transformed individually in a temporally and spatially consistent way. We show how this can be achieved in real time for high-resolution input videos and under the present memory and performance constraints of GPUs. To identify and geometrically align different motion layers, we process
Display Pixel Caching
69
Fig. 2. DPC video processing pipeline: Motion patterns of input video frames are analyzed and segmented into motion layers (1); different motion layers are warped and accumulated (2); empty cache regions are optionally filled by smooth extrapolation (3); if shot transitions are detected, the current cache undergoes the same transition as the original frames (4); the cache is finally displayed together with the original frame (5)
the motion flow field provided for P -frames of XVID MPEG4-encoded input streams with 4x4 pixel blocks for motion compensation and half-pixel motion vector accuracy. 3.1
Motion Analysis and Segmentation
Figure 3 outlines our approach to motion analysis and segmentation, which clusters the motion flow field into multiple layers of homogeneous motion regions in each frame f . Motion vectors in uniform texture regions are unstable for motion estimation. They are initially filtered out and stored in the vector set U f (1). All remaining vectors serve as input to our motion segmentation. Three conditions must be fulfilled before a new motion layer can be computed reliably (2): First, the number of existing layers must not exceed a predefined maximum (Tlc ). This ensures that the scene is not segmented into a large number of small and unstable motion clusters, as explained in [12]. Second, the number of input motion vectors must exceed a minimum (Tvc ) to avoid small clusters, since these are mostly noisy. Third, the estimated frame rate that is achieved when computing the next layer must be above a user-defined desired frame rate Tf r . With this condition, we ensure temporal consistency, since the processing time for segmentation is not constant for every frame and layer. The processing time for a new layer is predicted during run time from previous processing times required for different motion vector counts. Details on this prediction can be found in the supplementary material (section 1). If all conditions are met, RANSAC is applied to derive the 3x3 homography matrix Hlf for the new layer which contains the highest number of inliers when matching the input motion vectors with the motion vectors computed from Hlf
70
C. Birklbauer et al.
Fig. 3. Top: Flow diagram of motion segmentation. Bottom: sample segmentation (colors represent layer IDs) for forward camera motion causing parallax effects (a), the scene registered incorrectly with a single layer (b), and same frames registered correctly with multiple layers (c).
(3). Note that Hlf denotes the homography matrix representing the lth motion layer in the f th frame. Let us define the vector distance error as the distance between an original motion vector and its corresponding counterpart computed from Hlf . All motion vectors that match Hlf with a small vector distance error of less than Tgl are considered to be part of homogeneous global motion. They are assigned to the same layer and stored in the global layer map Gfl (4). Note that Gfl stores all homogeneous global motion vectors for each layer l in the f th frame. If the vector distance error of the remaining motion vectors is larger than Tlo , they are considered to be part of inhomogeneous local motion and are stored in the set of local motion vectors Lf (5). Motion vectors without any assignment in Gfl , Lf , or U f are forwarded to the next iteration for computing the next layer, just as explained above. If no further motion layer can be computed reliably since the conditions (2) can no longer be met, all unassigned motion vectors (i.e., the remaining motion vectors, Lf and U f ) are assigned to their best match of the existing motion layers (6). Thus, an unassigned motion vector that matches Hlf with the smallest vector distance is assigned to layer Gfl .
Display Pixel Caching
71
Fig. 4. Examples of an accumulation buffer and a history ring-buffer of size B = 3. Blue arrows indicate the traversal of the buffers for rendering the final composite at the current frame. Red arrows indicate the update of the buffers before a new frame is rendered. The green arrows indicate the regional motion of the vector field in each frame.
Finally, a mode filter re-assigns the layer assignment inside a 5x5 neighborhood region (7). This smoothens noise at segment boundaries. The result of this process is a segmentation of the main global motion clusters Gfl with their best matching homography matrices Hlf . Local and noisy motions are assigned to the best fitting and most stable global motion clusters. Empirically, we found that the following configuration suits most 1080p video content: Tlc = 4...10, Tvc = 2%, Tgl = 1 pixel, and Tlo = 5 pixels. 3.2
Motion Accumulation and Warping
The homographies of the determined motion layers can subsequently be used for warping previous frames to fill blank screen regions appropriately. Note that the current frame always remains unmodified. It is displayed in the screen center – either unscaled (leading to 4 borders, as shown in figure 1g) or matched to two edges of the display frame while maintaining the original aspect ratio (leading to 2 borders, as illustrated in figures 1a,e). We store the current frame and b−1 preceding frames together with their pixelwise homography assignments (Hlf ) in a history ring-buffer of size B. Initially, we define a regular 2D grid in the resolution that equals the motion-vector resolution of the MPEG data. Then we traverse the history ring-buffer in frontto-back order. For each step (i = 0, ..., b−1), we texture-map frame f −i onto the current instance of the grid (for i = 0 the grid is un-warped), and then render and alpha-blend (average) the result into a composite image. Next, we warp the grid by multiplying each grid point with its corresponding homography matrix
72
C. Birklbauer et al.
(Hlf −i ). Since each grid point always belongs to a constant position in each frame of the history ring-buffer, Hlf −i can be looked up easily. Thus, the grid is warped successively by traversing the history ring-buffer. Figure 4 illustrates this for a history ring-buffer of size B = 3 (with our hardware, we are restricted to B ≤ 65, depending on video and display resolutions). It is possible that the size of the history ring-buffer is exceeded, because more than B frames remain valid in the pixel cache. If this is the case (i.e., b = B), the oldest frame (f −B −1) is moved from the history ring-buffer to an accumulation buffer (A). Each entry in A stores a pixel’s color and texture coordinates to the position within its original (unwarped) frame. In preparation for moving the oldest frame to A, A must itself always be warped in accordance with the motions in the newest frame that is moved into the history ring-buffer, as illustrated in figure 4. This is done by re-sampling A with a new regular 2D grid that is extended to the entire accumulation buffer. Every point of this grid is warped using the homography Hlf that is assigned to the position of the current frame which corresponds to the texture coordinates saved in A at this point. The entirely warped grid is texture-mapped with the content of A, and is rendered and alpha-blended into the composite image. Finally, the correctly warped oldest frame (f −B −1) is rendered on top of A before removing it from the history ring-buffer. This updates the accumulation buffer.
4
Consistent Cache Handling
While section 3 explained how the pixel cache is filled successively with motion data, this section presents techniques for initializing empty cache regions and for clearing the cache in a consistent way when shot transitions occur. 4.1
Cache Initialization
Without motion, no additional pixels are added to the cache. Furthermore, the border edges in the cache are generally neither straight nor screen-aligned because they display locally varying motion that results, for instance, from parallax (figure 5a). We implemented three alternative initialization options: We can initialize unfilled portions of the cache with a smooth color transition by extrapolating the current cache edges spatially, as illustrated in figure 5b. The cache edges are thereby “smeared” iteratively towards the screen edges. This does not leave any blank borders. However, in cases of little motion, or during the first frames after shot transitions, the majority of the cache will be blurred. Details on this extrapolation can be found in the supplementary material (section 2). Instead of displaying the entire cache (including the fully initialized but possibly largely blurred regions), we can optionally clip it to screen-aligned edges in such a way that the cache regions between the clipping edges and the screenedges remain blank. One possibility is to place the clipping edges progressively at the outermost extent of the original cache content. This initializes only the
Display Pixel Caching
73
Fig. 5. Cache initialization options: no initialization (a), full initialization (b), progressive (c) and conservative clipping (d). The dashed lines indicate the two possible clipping edges. The solid lines show the edges of the original frame. The samples show a leftward camera motion.
regions between original cache content and clipping edges (figure 5c). Another possibility is to place the clipping edges conservatively at the inner limit of the original cache content (i.e., at the outermost extents, where the pixels along the clipping edges are originally cached and not initialized). This does not require any initialization, but may cut some of the valid cache content (figure 5d). The clipping edge, of course, moves over time, as can be seen in figures 1a-d for conservative clipping. These three initialization options, together with with the untreated DPC results and other video modes, were tested in the course of a user evaluation (section 5.2). Note that we apply a temporal smoothing filter to avoid distracting flickering within the initialized cache regions that results from rapid intensity and color changes at cache edges over multiple frames. Details on this filter are presented in the supplementary material (section 3). 4.2
Cache Clearance and Shot Detection
While section 4.1 described how the cache can be initialized, this section explains how to clear the entire cache consistently when a shot transition is detected. In [13], a method for estimating the information transport between two consecutive frames (f, f + 1) is described. This is achieved by computing a 2D RGB transition histogram between corresponding pixel pairs in these frames and then deriving the mutual information, I(f, f + 1), from these histograms. This method is widely applied for detecting hard cuts between two consecutive frames in video sequences. We extend this approach to consider a look-ahead of
74
C. Birklbauer et al.
n frames to enable the detection of hard cuts, fade-in-fade-out transitions, and cross-fades. A shot transition between frame f and frame f + n can be detected if 0 I(f + i, f + n + i) i=−m ST (f, n, m) = 1 − (1) > Tin , (m + 1)I(f + 1, f + n + 1) where m is the number of previous frames whose information transport is averaged to cancel noise, and Tin is the information transport threshold (n=10, m=4, and Tin =0.0412 are suitable for most video content). When a shot transition is detected, the cache is cleared smoothly between frames f and f + n. This is achieved by linearly blending the cache state at each intermediate frame f + j, 0 ≤ j < n of the old shot with the initialized cache of the new shot in frame f + n. This ensures that the cache content is faded in the same way as the original video content and that hard cuts, fade-in-fade-out transitions, and cross-fades are supported. Details on the applied cross-fading are presented in the supplementary material (section 4).
5
Results and Limitations
In this section, we present performance measures and the results of a user evaluation that compares DPC and its different cache initialization options with alternative video modes. DPC is a trade-off between performance and quality and has several limitations that are discussed below. 5.1
Performance
To achieve adequate load balancing, we decoded the videos via FFmpeg on the CPU while executing most of the DPC video processing pipeline in parallel on the GPU. We timed DPC on a 2.67 GHz QuadCore PC with 6GB RAM and an NVIDIA GTX 285 graphics board (1024MB VRAM). We achieved 50fps when showing PAL videos on a 720p screen, 26fps for 720p on 1080p, and 15fps for 1080p on 4x1080p (3840x2160, Quad HD). 5.2
User Evaluation
To determine how DPC performs compared to the alternative display modes, and to identify the optimal cache initialization option, we carried out a user evaluation with 59 subjects (31 male, 28 female; 15-67 years of age, with an average age of 30 years). We presented side-by-side pairs of video sequences in their original (unmodified) format and with one of the following modes: stretch, zoom, retargeting, and DPC with the four cache initialization options described in section 4.1. For each of the seven modes, two video samples were shown (one
Display Pixel Caching
75
in 4:3 and one in cinemascope format) on a 16:9 screen. All videos were looped to enforce scene transitions. For retargeting, we initially evaluated the automatic versions of the methods proposed in [1,2,3]. We chose the method in [3] with automatic face detection for the user evaluation since – although it is far from being real-time capable – it delivered the best qualitative results. The subjects were asked to compare each video mode with the original (unmodified) format and to indicate their preferences by scoring and, optionally, giving comments. Figure 6 presents the results averaged over all subjects and both video formats (4:3 and cinemascope). From the ratings and the additional comments that were provided by the subjects, we draw the following three conclusions: First, unmodified content with borders is preferred to video modes that stretch or clip content unnaturally to fill borders. The same would apply to retargeting in failure cases if, for instance, faces or people were unnaturally stretched. Second, non-straight and non-screenaligned border edges are distracting and alien, and are therefore less preferred. Presumably, this results from our TV viewing habits, which have accustomed us to rectangular frames with clear edges. Third, large empty border regions that are initialized with a smooth color transition (i.e., blur) are distracting and should be clipped. The viewers would even prefer some cache content to be lost if this reduced blurred areas. Therefore, the conservative caching approach was preferred most. The video extrapolation technique presented in [8] focuses on filling image regions that are mainly perceived in the periphery. Due to the limited resolution and accommodation capabilities in the non-foveal field, a strong blur within the filled regions can be tolerated. Our study reveals, however, that blur is not accepted if the visual focus can be at any position of the screen – which is generally the case for normal viewing distances. Our future work includes an extended user study that will investigate, for instance, video mode preferences linked to genre and for whole movies.
Fig. 6. User preferences for various video modes over the original (unmodified) content: The scores range from 1 (corresponding video mode preferred) to 6 (unmodified original preferred). The bar chart displays, average, median, the lower, and the upper quartiles.
76
5.3
C. Birklbauer et al.
Limitations
As soon as a pixel is moved into the pixel cache, it loses its original speed and motion behavior. Locally moving objects, for instance, freeze when they are cached because their true motion within the cache is unknown. We update them successively, based only on the motion that is contained in the current frame. Extrapolating the motion path of every single cached pixel based on its motion history will be a better (but also more time-consuming) option. For example, the extrapolation of cyclic motion may become possible with video texture techniques [14]. Exploiting real-time capable extrapolation techniques or cyclic and non-cyclic motion patterns will be part of our future work. Depth from motion techniques that are based on MPEG motion vectors, as already used for 2D to 3D conversion [15], may enable the approximation of coarse depth structures in real time to resolve segmentation and warping ambiguities. We will also explore these techniques. New objects entering from a frame edge that already extends into the cache appear inconsistent with the nearby cached region. This happens, for instance, when new objects enter the field of view while moving in the same direction as the camera but faster. Currently, clearing the entire cache as done for shot transitions is the only solution in these cases. However, we found these cases to be relatively rare in practice. Like all related techniques that rely on optical flow, DPC fails for motions with undefined or noisy MPEG motion vectors. However, our multi-layer warping provides better results in these cases than warping with single homographies per frame. Better compression techniques and run-time flow computations will be explored in the future. Alternatively, a conservative cache clearance strategy will avoid long caching times and accumulated registration errors.
6
Summary
The main aim of DPC is to support a novel form of viewing experience that clearly surpasses the frame-wise scaling or stretching of other video modes. Like for other advanced video modes, such as retargeting, DPC leaves margin for improvements. The evaluation of our first implementation, however, is promising. In this paper, we have presented DPC as a novel video mode, and made several technical contributions to real-time video processing (i.e., motion analysis and segmentation, motion accumulation and warping, cache initialization and clearance, and shot detection). The supplementary material provides additional implementation details. In the future, DPC may also find applications in hardware systems, such as in TV set-top boxes, modern camcorders, digital cameras, and mobile phones. Most of these devices already support temporal caching of video content (e.g., the “time-warp” functionality of modern set-top boxes) or simple image stitching for panorama imaging. DPC may allow to go beyond this.
Display Pixel Caching
77
References 1. Chiang, C., Wang, S., Chen, Y., Lai, S.: Fast jnd-based video carving with gpu acceleration for real-time video retargeting. IEEE Trans. Cir. and Sys. for Video Technol. 19, 1588–1597 (2009) 2. Kr¨ ahenb¨ uhl, P., Lang, M., Hornung, A., Gross, M.: A system for retargeting of streaming video. In: SIGGRAPH Asia 2009: ACM SIGGRAPH Asia 2009 papers, pp. 1–10. ACM, New York (2009) 3. Rubinstein, M., Shamir, A., Avidan, S.: Multi-operator media retargeting. ACM Trans. Graph. 28, 1–11 (2009) 4. Wang, Y.S., Lin, H.C., Sorkine, O., Lee, T.Y.: Motion-based video retargeting with optimized crop-and-warp. In: SIGGRAPH 2010: ACM SIGGRAPH 2010 papers, pp. 1–9. ACM, New York (2010) 5. Freedman, G., Fattal, R.: Real-time gpu-based video upscaling from local self examples. In: ACM SIGGRAPH 2010 Talks (2010) 6. Matsushita, Y., Ofek, E., Ge, W., Tang, X., Shum, H.Y.: Full-frame video stabilization with motion inpainting. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1150–1163 (2006) 7. Liu, M., Chen, S., Liu, J., Tang, X.: Video completion via motion guided spatialtemporal global optimization. In: ACM International Conference on Multimedia (2009) 8. Avraham, T., Schechner, Y.: Ultrawide foveated video extrapolation. IEEE Selected Topics in Signal Processing 5, 321–334 (2011) 9. DiVerdi, S., Wither, J., H¨ ollerer, T.: Envisor: Online environment map construction for mixed reality. In: IEEE Virtul Reality 2008, pp. 19–26 (2008) 10. Shum, H.Y., Szeliski, R.: Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment. Int. J. Comput. Vision 36(2), 101–130 (2000) 11. Brown, M., Lowe, D.G.: Automatic panoramic image stitching using invariant features. International Journal of Computer Vision 74(1), 59–73 (2007) 12. Wang, J., Adelson, E.: Representing moving images with layers. IEEE Transactions on Image Processing 3, 625–638 (1994) 13. Pitas, C.N., Cernekov, Z., Nikou, C., Pitas, I.: Shot detection in video sequences using entropy-based metrics. In: IEEE 2002 International Conference on Image Processing 22th-25th, pp. 421–424 (2002) 14. Agarwala, A., Zheng, K.C., Pal, C., Agrawala, M., Cohen, M., Curless, B., Salesin, D., Szeliski, R.: Panoramic video textures. In: SIGGRAPH 2005: ACM SIGGRAPH 2005 Papers, pp. 821–827 (2005) 15. Pourazad, M., Nasiopoulos, P., Ward, R.: An h.264-based scheme for 2d to 3d video conversion. IEEE Transactions on Consumer Electronics 55, 742–748 (2009)
Image Relighting by Analogy Xiao Teng and Tat-Jen Cham Centre for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University
Abstract. We propose and analyze an example-based framework for relighting images. In this framework, there are a number of images of reference objects captured under different illumination conditions. Given an input image of a new object captured under one of the previously observed illumination conditions, new images can be synthesized for the input object under all the other illumination conditions that are present in the reference images. It does not require any other prior knowledge on the reference and target objects, except that they share the same albedo. Though it is appreciated if the reference objects have similar shape as the target object, sphere or ellipsoid which has plenty of local geometry samples are sufficient to build up a look-up table, as this method solves the problem locally. Gradient domain methods are introduced to finally generate visual-pleasing results. We demonstrate this framework on synthesized data and real images.
1
Introduction
Despite the use of high-end cameras and equipment, most photographs taken by amateurs are flat and dull compared to those taken by professionals. We believe inappropriate lighting is often responsible for poor quality photos. The best photographers are distinguished by their patience in waiting for perfect outdoor lighting conditions or their ability in designing ideal illumination conditions in studios. Alternatively, one may try to manually modify or enhance the lighting in images through photo-editing software. However even with high-end tools, tedious work is often necessary to generate visually-pleasing results. Synthesizing images under different illumination conditions remains an important but challenging problem in both the computer vision and graphics communities. There is substantial research in creating new methods to capture the complex light interactions existing in nature. However, most of these either rely on expensive custom-built equipment, or only focus on a class of objects, such as faces, to make the problem tractable with the proposed approaches. In many computer vision tasks such as object recognition, the large image variation due to illumination [1] is one of the major problems encountered. It is difficult to identify whether intensity changes are due to surface reflectance, or illumination, or both. For example, previous research on face recognition [2] reveals that the changes induced by illumination can easily be larger than the variations due to the change of identity. The illumination problem has received G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 78–89, 2011. c Springer-Verlag Berlin Heidelberg 2011
Image Relighting by Analogy
79
considerable attention in recognition literature, with most of the methods trying to discard variations due to lighting. Intuitively, it may be helpful if we can first relight test images according to the illumination under which the training images are captured. Image relighting potentially has many other practical applications. It will be valuable in areas as diverse as interior light design, video communication, augmented reality, and many other photo/video post-production tasks. Traditional graphics techniques for scene relighting require 3D geometry for rendering, while existing image-based relighting methods usually need a laborintensive setup for every scene. The proposed approach to be investigated in this paper would take only one image as input. Using a database of reference images which are pre-rendered under different illumination conditions, we propose new methods to synthesize the scene in the input image under alternative illumination conditions.
2
Related Work
The illumination recorded in images mainly depends on three basic elements: scene geometry, surface reflectance and light source. Many techniques explicitly measure or estimate these properties for relighting. The geometry-based solutions for generating new images based on traditional graphics techniques usually render the object under some artificial lighting conditions with a 3D representation of the object obtained from special equipment (e.g.a laser range camera), or estimated from 2D image data [3]. Other methods used simple geometric representations such as an ellipsoid [4], or aligned a generic 3D face model [5] to the input image, before compositing the newly rendered lighting onto the input image. Such methods used simplified approximations for extremely complex geometry, but are limited to object groups with prior knowledge like faces. Fuchs et al. [6] fitted a morphable model to face images in different poses and with different illumination conditions, then it is used to render new images under complex lighting conditions. Okabe et al. [7] assigned an approximate normal map over a photo with a pen-based user interface to enable relighting. The primary difficulty with the above approaches is that they all require some knowledge of 3D geometry. Image-based methods relight real scenes without a complex rendering process as in traditional geometry-based techniques, but require surface reflectance and illumination information to be extracted through multiple training images. These approaches use recorded images to render new images, do not require any explicit geometric representations, and the rendering is independent of scene complexity as image pixels are directly manipulated. Methods based on Basis Functions [1,8] were proposed to first decompose the luminous intensity distributions into a series of basis luminance functions, followed by relighting the image under a novel illumination by computing a weighted summation of the rendered images, which are pre-recorded under each basis luminance function. Reflectance Functionbased methods [9,10] explicitly measure the reflectance properties at all visible parts of the scene for relighting.
80
X. Teng and T.-J. Cham
Fig. 1. Framework overview: for every pixel of the input image B, we search the best match by intensity and gradient in the example image A with the same illumination condition, after we find the best match, we do a lookup what it looks like in the example image Ap with another illumination condition, then assign that value to the result Bp
In example-based relighting, Shashua and Riklin-Raviv [11] introduced the Quotient Image for relighting of objects in the same class. When given three images of each object in the same class, and another image of a new object in the same class, novel images of the object under new illumination conditions can be synthesized. Zhang et al. [12] introduced the Spherical Harmonic Basis Morphable Model for face relighting. From a single image, they showed it is possible to estimate the model parameters of face geometry, spherical harmonic bases and illumination coefficients, after which the face texture can be computed from spherical harmonic bases. Thereafter, a new single face image can be relit by an example face image under a novel illumination condition. Peers et al. [13] proposed a face relighting approach to transfer the desired illumination on the example subject to the new subject, using Quotient Images and image warping.
3
Framework Overview
These are some key considerations made in choosing our approach: we intend to avoid explicit 3D model acquisition as required in geometry-based methods; we also intend to avoid tedious laboratory setups as required for existing imagesbased methods such as those used in movie industry. Some previous examplebased relighting research exist; however, methods for general objects have not been well studied. Inspired by image analogies [14], we propose our method in this section, and it is illustrated in Fig. 1. 3.1
Preliminaries
Let l, l represent two different lighting conditions. Suppose we have an image IB of a scene B captured under illumination condition l, and we want to synthesize
Image Relighting by Analogy
81
image IB of the same scene illuminated under lighting condition l . Suppose we also have example images IA1 , IA2 , . . . of different scenes A1 , A2 , . . . but with the same lighting condition as in image IB and corresponding images IA1 , IA2 , . . . of these scenes taken under illumination l . Intuitively, if we have enough examples, we can find identical elements of IB in IA1 , IA2 , . . . then the result IB can be synthesized from corresponding relighted elements in IA1 , IA2 , . . . The primary problems are: how many example pairs do we need, and how can we find the correct correspondence between the images under the same illumination condition? We will show that even one example pair can be useful if we are able to assume that there is a single directional light source, and the surfaces can be adequately modeled as spherical and Lambertian with uniform albedo. Formally, we establish the following equations:
IA (ua , va ) = λρA NA · L IB (ub , vb ) = λρB NB · L
IA (ua , va ) = λ ρA NA · L IB (ub , vb ) =?
(1)
where λ and ρ are illumination intensity and reflectivity coefficient respectively, N is the unit vector of surface normal, L is unit vector in lighting direction and (u, v) is a particular pixel position in image I. The subscripts A and B represent two different scenes, while the two lighting conditions are distinguished by the given the presence/absence of the prime symbol. The goal is to compute IB three observable images of IA , IA and IB . The main challenge in this problem is that the parameters of λ, ρ, N and L are all unknown for both scenes and both lighting conditions. 3.2
Similarity Matching
We will show how the similarities between two images under the same lighting condition can be captured. With the previous assumptions, correspondences between scene patches in images IA and IB are attempted to be found by matching image intensities and gradients. Two pixels are considered matched if their feature vectors are Euclidean nearest neighbors to each other in a 3D feature space representing image intensity and gradient. Following the approach of Lee and Rosenfeld [15], analysis is carried out in the illumination coordinate system. Image Intensity. We will show that by matching image intensities, we can find the corresponding patches which have the same surface slant (the angle between the surface normal and the optical axis) in the two scenes. Formally, we are considering the case when we have found (u∗a , va∗ ) such that IB (ub , vb ) = IA (u∗a , va∗ ). Under the assumption of Lambertian surface: I = λρN · L = λρ cos σ
(2)
The surface slant σ can be calculated as: σ = cos−1 (
I ) λρ
(3)
82
X. Teng and T.-J. Cham
A B When we match the intensities, IA = IB , so Iλρ = Iλρ , assuming the λρ is a constant for different objects. However we also consider the possibility of IA IB matching normalized intensities λρ = λρ , assuming the brightest pixel values A B in the two images are λρA and λρB respectively and use these gray levels as normalization factors for the images of different objects. Hence:
σA = cos−1 (
IA IB ) = cos−1 ( ) = σB λρA λρB
(4)
as σA and σB ∈ [0, π2 ], we implicitly match the surface slants in the illumination coordinate system. Image Gradient. In this paper, we will assume that each surface patch on an object can be approximated by a spherical patch. Generally, we find that for convex patches, this approximation leads to reasonable results. Given the spherical assumption, we will show that by matching image gradients, the corresponding patches will have the same surface tilt (the angle between the projection of the surface normal on the plane perpendicular to the optical axis and one horizontal axis on that plane) in the two scenes. Formally, we are considering the case when we have found (u∗a , va∗ ) such that ∇IB (ub , vb ) = ∇IA (u∗a , va∗ ). Then x ∂z y ∂z =− , =− (5) z = R2 − x2 − y 2 , ∂x z ∂y z where R is the radius of the approximated spherical patch. The unit normal vector is: ⎡ 1⎤ ⎡ ⎤ ⎡ ∂z ⎤ −x z −x ∂x 1 1 z ⎣ ∂z ⎦ = ⎣−y 1 ⎦ = ⎣−y ⎦ N= ∂y z ∂z 2 ∂z 2 R R ( ∂x ) + ( ∂y ) + 1 −1 −1 −z
(6)
Let the illumination direction L be parameterized with slant and tilt (σL , τL ). I=
λρ (−x sin σL · cos τL − y sin σL · sin τL − z · cos σL ) R
(7)
To transform the original coordinate system to the image plane, under assumption of weak perspective projection, we apply: u=f
x y ,v = f z0 z0
(8)
where f is the focal length, z0 is the constant average depth which is large compared to the object’s dimensions. The first derivatives of the image intensity on the image plane are: z0 λρ z0 λρ x ∂I y ∂I = (− sin σL · cos τL + cos σL ), = (− sin σL · sin τL + cos σL ) ∂u f R z ∂v f R z (9)
Image Relighting by Analogy
83
Subsequently, we will show that the surface tilt τ can be computed without the need to recover the unknown light parameters (σL , τL , λ), the other surface parameters of ρ and R, the object distance z0 , or the focal length f . Let the transformation from the viewer coordinate system to the illumination coordinate system be: ⎞⎡ ⎤ ⎡ ⎤ ⎛ cos σL · cos τL cos σL · sin τL − sin σL x x ⎣y ⎦ = ⎝ − sin τL cos τL 0 ⎠ ⎣y ⎦ (10) sin σL · cos τL sin σL · sin τL cos σL z z Let u , v be the axes of the image plane in the illumination coordinate system, then the image gradient becomes: ∂I ∂I ∂I z ∂u = 0 ∂x = cos σL · cos τL cos σL · sin τL ∂u (11) ∂I ∂I ∂I − sin τL cos τL f ∂y ∂v ∂v We can find that the ratio of the first derivatives is equal to the ratio of y, x component of the surface normal in the illumination coordinate system: −x sin τL + y cos τL ∂I ∂I / = ny /nx = ∂v ∂u −z sin σL + x cos σL cos τL + y cos σL sin τL
(12)
where the surface normal N (nx , ny , nz ) in the illumination coordinate system is obtained by applying the same transformation as in (10) on the surface normal N (nx , ny , nz ) in the viewer coordinate system. By definition, the surface tilt τ in the illumination coordinate system is: τ = tan−1 (ny /nx ) = tan−1 (
∂I ∂I / ) ∂v ∂u
(13)
In this instance, because we can only measure the image gradients in the viewer coordinate system and we do not assume to know the illumination direction, we cannot compute the surface tilt directly. In terms of image gradient matching, if two surface points have same first image intensity derivatives:
∂IA ∂u ∂IA ∂v
=
∂IB ∂u ∂IB ∂v
(14)
as the illumination parameters are assumed to be the same, from (11), the image gradients in the illumination coordinate system are also the same: ∂IA ∂IB ∂u ∂u (15) ∂IA = ∂IB ∂v
Then
∂v
∂IA ∂IA ∂IB ∂IB / = tan−1 / = τB (16) ∂u ∂v ∂u ∂v which means the tilts of the two patches are the same in the illumination (and in fact any other) coordinate system, even if we are unable to compute them directly. τA = tan−1
84
X. Teng and T.-J. Cham
In summary, two object patches with the same surface normal direction, which is determined by slant σ and tilt τ , are matched by matching image intensities and gradients . Note that the equations are derived in their own illumination coordinate systems. However, as the illumination conditions of IA and IB are the same, the geometric transformation from the illumination coordinates to the viewer coordinates are also the same for the two images. The equality of the surface normal directions is therefore invariant to the coordinate system used. 3.3
Synthesis
By assigning the intensity value IA (u∗a , va∗ ) to IB (ub , vb ), = λ ρA N (σA , τA ) · L IB
(17)
As Eq. (4) and (16), if the surface albedos are equal, ρA = ρB , it follows that: = λ ρB N (σB , τB ) · L IB
(18)
So we prove that the analogy-based approach leads to physics-based results. 3.4
Poisson Reconstruction
Using the direct similarity matching approach leads to noisy results for more complex objects. Part of this noise is due to pixel and gray-level quantization errors, as well as albedo variation. Additionally, we had also required that all patches on the object be approximated by a spherical patch. While this leads to reasonable results for purely convex objects, the inadequacy of the approximation for more complex objects become a source of error. However, if we focus on predominantly convex objects such as human faces, then the challenge is to find an approach to overcome the noise resulting from a minority of patches that are concave. The approach we explore is Poisson reconstruction [16]. Psychology-based research [17] on the human visual system shows that humans barely notice very gradual changes in image intensities, conversely, humans are very sensitive to small but sharp intensity changes that correspond to large (called gradients in the image. In the initial framework, the desired image IB the intensity transferring) is reconstructed by directly copying intensities via in a point-wise manner, without consideration for the mapped lookup from IA resultant image gradients. Because gradients are first order derivatives and sensitive to noise, small errors in the intensities may lead to large gradient noise, to which humans are sensitive. Alternatively, one may consider a slightly modified approach that the image are copied to form a gradient field for IB . An attempt may then gradients of IA be made to compute image IB by integration (called the gradient transferring), and since the gradients in IB are now deliberately specified, we can expected there would be less visually-sensitive gradient noise. However, this gradient field , created via point-by-point lookup from IA , is very unlikely to result in for IB a 2D-consistent gradient field that is integrable. A method to circumvent this
Image Relighting by Analogy
85
problem is to find an optimal image such that its derived (and hence integrable) gradient field is as close as possible to the target gradient field generated from lookup via IA . It turns out that a solution may be found by solving a Poisson equation. Details are provided below. The gradient of intensity I is defined by its two first partial derivatives: (Gu , Gv ) = ∇I = (
∂I ∂I , ) ∂u ∂v
(19)
where Gu and Gv are the gradients in u direction and v direction respectively. As we can generate the relighting image via an indirect lookup operation (u, v) = IA (u∗ , v ∗ ). Similarly, a 2D gradient field GB can be also obtained by IB assigning the corresponding gradient from GA , formally: GB (u, v) = GA (u∗ , v ∗ )
(20)
. where GA is the gradient field of image IA Then we can reconstruct an image from this gradient field G via integration. However, as G has been constructed in a point-wise manner, in general it cannot be integrated back into a 2D scalar field that is the image. Hence the goal is to recover a 2D scalar function I such that the gradient of I, ∇I, best approximates G in a least squares manner: ∗ (21) ∇I − G2 dudv I = argmin I
According to the Euler-Lagrange equation, the optimal I ∗ must satisfy the Poisson equation: (22) ∇2 I = ∇ · G It can be rapidly solved by the Fourier Cosine Transform as: I(u, v) = ICT(ˆIp,q ) = ICT(
∂G CT( ∂G ∂u + ∂v ) ) πp 2(cos( W ) + cos( πq H ) − 2)
(23)
where CT and ICT are the cosine transform and inverse cosine transform respectively, applied to the image of width W and height H, and p, q are spatial frequencies in u and v direction respectively. We demonstrate that the gradient field computed via analogy can be used to reconstruct visually-pleasing results with the standard Poisson solution as images Bp R in Fig. 3. However, the raw intensities computed via analogy can be added as data term in the screened Poisson solution [18]. We can get a better solution by keeping the result close to both the raw intensities and the gradient field as: ∗ I = argmin (24) λ(I − D)2 + |∇I − G2 dudv I
where λ is the trade-off coefficient, D is the data function, in our case, the raw intensities via analogy. The optimal I ∗ must satisfy: λI − ∇2 I = λD − ∇ · G
(25)
86
X. Teng and T.-J. Cham
Similar to (23), the Fourier domain solution is: I(u, v) = ICT(ˆIp,q ) where Iˆp,q =
∂G CT(λD − ( ∂G ∂u + ∂v )) πq λ − 2(cos( πp W ) + cos( H ) − 2)
(26)
(27)
and let ˆ 0,0 Iˆ0,0 = D
(28)
The problem of the unknown offset in standard Poisson is solved even with a very small positive λ, in our experiments, it is usually set to 0.01.
4 4.1
Experiments Synthetic Images
Our framework is initially demonstrated on a few simple shapes: a sphere, a cylinder and an ellipsoid in Fig. 2(a), (b) and (c) respectively, and then on a 3D face model in Fig. 2(d). In each experiment, we use the spheres as the analogy exemplars, convert one illumination condition to another through analogy-based relighting. The results show that the analogy-based approach can relight ideal objects well enough that the average difference between the physics-based lighting results and example-based relighting results are under 5 intensity levels out of 255 levels with only very few artifacts noticeable. The errors become large in the case of the 3D face model, as it contains patches far from spherical in shape. 4.2
Real Images
We also tested the framework on real images from the Yale Face Database B and the Extended Yale Face Database B [1]. As the real images contain textures and noises, surface patches with the same surface slants and tilts will not always be correctly matched, duo to deviations from our framework which assume constant albedo. As can be seen from the image Bp in Fig. 3, the direct result is not quite smooth and visually pleasing. As mentioned previously in section 3.4, this is the result of noise due to concave patches, quantization errors and albedo variation. By applying standard Poisson solution, we can generate a more realistic result as image Bp R. Screened Poisson solution further improves the result, as with respect to real image Bp -real, the mean intensity error of image Bp SR is reduced even compared to the mean error of image Bp R with offset which is applied an optimal offset on image Bp R. We can see there are some differences between the synthesized results and the real images. However, the synthesized results still can be considered as reasonable renderings which reflect the change of the illumination condition. The final result Bp SR histeq to B is produced by making the histogram of image Bp SR match to the histogram of image B to overcome unequal albedos.
Image Relighting by Analogy
87
(a)
(b)
(c)
(d)
(e) Fig. 2. Relighting synthetic images with (a) sphere example: (b) sphere; (c) cylinder; (d) ellipsoid; (e) face model. Images A and B are under the same illumination condition. Image Ap is the reference image with the alternate illumination condition, under which image Bp is generated through the analogy-based approach, while image Bp -real is synthesized using a physics-based approach. The difference between Bp and Bp -real is put on the right.
4.3
Computational Efficiency
As in many other computer vision algorithms, the most computationally expensive part of this method is the nearest neighbor search in high-dimensional space. With minor loss in accuracy, approximate algorithms are developed to provide large speedups. Recently, Muja and Lowe [19] set up a framework to automatically determine the best algorithm between multiple randomized kd-trees and hierarchical k-means tree, and the best parameter values. Tree-based methods are efficient in our case, as they work well in relatively low dimensional data and we only need intensity and gradient of every single pixel as a feature vector.
88
X. Teng and T.-J. Cham
Fig. 3. Relighting real images with Poisson solutions. From left to right, the second row shows the intensity transfer result; the standard Poisson result from gradient transfer (equivalent to set λ = 0 in screened Poisson solution); the standard Poisson result from gradient transfer with optimal offset w.r.t. real lighting result Bp -real; the screened Poisson result from intensity and gradient transfer (set λ = 0.01); and the screened Poisson result after it is histogram-matched to image B. The third row shows the differences between the results in the second row and the real lighting result. The numbers indicate the average intensity errors. More discussion can be found in section 4.2.
The current speed of our mixed Matlab and C implementation is around 1s for 150 ∗ 150 images on a workstation with a 2.66GHz Intel Core2 Processor.
5
Conclusion and Future Work
As geometry-based relighting methods require 3D representation of the scene and traditional image-based relighting approaches need a laboratory setup for each scene, we explored the possibility of an example-based relighting framework. Our initial investigation on the analogy framework with assumptions of Lambertian model, uniform albedo and spherical patch has shown them to be feasible, both theoretically and experimentally. Our approach finds the best reference patches from the example image for target image. Both image intensities and image gradients have theoretically been shown to be useful for finding the correspondence in the context of analogy-based relighting. By applying Poissonderived methods, problems in relighting real images are alleviated. While the initial approach is capable of handling predominantly convex surfaces, it would be helpful to find a relation between the relighting error and the deviation from spherical patch for relighting more general scenes. We also plan to improve the framework to work in the presence of shadows and specular
Image Relighting by Analogy
89
highlights. Besides exploring physically correct solutions, perceptually appropriate relighting is of interest, too.
References 1. Georghiades, A.S., Belhumeur, P.N., Kriegman, D.J.: From few to many: Illumination cone models for face recognition under variable lighting and pose. TPAMI 23, 643–660 (2001) 2. Adini, Y., Moses, Y., Ullman, S.: Face recognition: The problem of compensating for changes in illumination direction. TPAMI 19, 721–732 (1997) 3. Pighin, F.H., Szeliski, R., Salesin, D.: Resynthesizing facial animation through 3d model-based tracking. In: ICCV, Kerkyra, Greece, pp. 143–150 (1999) 4. Basso, A., Graf, H.P., Gibbon, D.C., Cosatto, E., Liu, S.: Virtual light: digitallygenerated lighting for video conferencing applications. In: ICIP, Thessaloniki, Greece, pp. 1085–1088 (2001) 5. Wen, Z., Liu, Z., Huang, T.S.: Face relighting with radiance environment maps. In: CVPR, Madison, WI, USA, vol. 2, pp. 158–165 (2003) 6. Fuchs, M., Blanz, V., Lensch, H.P.A., Seidel, H.P.: Reflectance from images: A model-based approach for human faces. TVCG 11, 296–305 (2005) 7. Okabe, M., Zeng, G., Matsushita, Y., Igarashi, T., Quan, L., Shum, H.Y.: Single-view relighting with normal map painting. In: Pacific Graphics, Taipei, Taiwan, pp. 27–34 (2006) 8. Fuchs, M., Blanz, V., Seidel, H.P.: Bayesian relighting. In: EGSR, Konstanz, Germany, pp. 157–164 (2005) 9. Debevec, P.E., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., Sagar, M.: Acquiring the reflectance field of a human face. In: SIGGRAPH, New Orleans, LA, USA, pp. 145–156 (2000) 10. Masselus, V., Peers, P., Dutr´e, P., Willems, Y.D.: Smooth reconstruction and compact representation of reflectance functions for image-based relighting. In: EGSR, Norrk¨ oping, Swede, pp. 287–298 (2004) 11. Shashua, A., Riklin-Raviv, T.: The quotient image: Class-based re-rendering and recognition with varying illuminations. TPAMI 23, 129–139 (2001) 12. Zhang, L., Wang, S., Samaras, D.: Face synthesis and recognition from a single image under arbitrary unknown lighting using a spherical harmonic basis morphable model. In: CVPR, San Diego, CA, USA, vol. 2, pp. 209–216 (2005) 13. Peers, P., Tamura, N., Matusik, W., Debevec, P.E.: Post-production facial performance relighting using reflectance transfer. TOG 26, 52 (2007) 14. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.: Image analogies. In: SIGGRAPH, Los Angels, CA, USA, pp. 327–340 (2001) 15. Lee, C.H., Rosenfeld, A.: Improved methods of estimating shape from shading using the light source coordinate system. Artificial Intelligence 26, 125–143 (1985) 16. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. TPAMI 10, 439–451 (1988) 17. Land, E.H., McCann, J.J.: Lightness and retinex theory. JOSA 61, 1–11 (1971) 18. Bhat, P., Curless, B., Cohen, M.F., Zitnick, C.L.: Fourier analysis of the 2d screened poisson equation for gradient domain problems. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 114–128. Springer, Heidelberg (2008) 19. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP, Lisbon, Portugal, pp. 331–340 (2009)
Generating EPI Representations of 4D Light Fields with a Single Lens Focused Plenoptic Camera Sven Wanner, Janis Fehr, and Bernd J¨ ahne Heidelberg Collaboratory for Image Processing (HCI)
[email protected] Abstract. In this paper, we present a novel method for the generation of Epipolar-Image (EPI) representations of 4D Light Fields (LF) from raw data captured by a single lens “Focused Plenoptic Camera”. Compared to other LF representations which are usually used in the context of computational photography with Plenoptic Cameras, the EPI representation is more suitable for image analysis tasks - providing direct access to scene geometry and reflectance properties. The generation of EPIs requires a set of “all in focus” (full depth of field) images from different views of a scene. Hence, the main contribution of this paper is a novel algorithm for the rendering of such images from a single raw image captured with a Focused Plenoptic Camera. The main advantage of the proposed approach over existing full depth of field methods is that is able to cope with non-Lambertian reflectance in the scene.
1
Introduction
The Focused Plenoptic Camera [1,2] (also known as Plenoptic Camera 2.0) has recently drawn a lot of attention in the field of computational photography. Sampling the 4D Light Field (see eqn. 2) of a scene with a single shot through a single lens, the plenoptic hardware design allows one to recompute standard projective 2D images under variation of major optical parameters like focal length [3], aperture [1] or the point of view [3], as well as depth estimates [4]. With the availability of first commercial camera systems [5], plenoptic imaging has also become interesting for image analysis and computer vision tasks. Since Light Fields (see eqn. 1) contain full scene information, ranging from texture and 3D geometry to surface reflectance properties [6], Plenoptic Cameras provide a rich data source for many applications like 3D scene reconstruction, object recognition or surface inspection. A Light Field l [6] is a 7D ray based representation of a synthetic or real world scene. For a spatial point (vx , vy , vz ) ∈ R3 in a given scene, the time-dependent Light Field representation parameterize all light rays with their angular direction (φx , φy ) and their wavelength λ. l(vx , vy , vz , φx , φy , λ, t) G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 90–101, 2011. c Springer-Verlag Berlin Heidelberg 2011
(1)
Generating EPI Representations with a Plenoptic Camera
91
In practice, it is very hard to record a 7D Light Field (LF) of a real world scene. However, using Plenoptic Cameras or camera arrays [7] capturing a 4D sub-space of the Light Field becomes feasible. Following limitations has to be accepted: only static scenes can be captured with a single shot (neglecting t), the optical spectrum is reduced to a simple “gray-scale” representation (neglecting λ), and the Light Field is not recorded for all 3D positions of a scene - instead it is “obFig. 1. EPI of a 4D LF. Top: 2D proserved” from a 2D camera plane outjection image from a single view point as given by the (vx , vy )-subspace, color lines side the visual hull of the scene. Since Focused Plenoptic Cameras indicate locations of the (vx , φx ) views depicted in the bottom: the local depth is capture 4D LFs in form of tiled 2D bound to the slope of the EPI line struc- microimages (see section 2), most retures, whereas the reflectance is given by lated publications in the area of comthe values on the line structure. putational photography (like [1]) use a tiled 2D representation of the 4D LF, where each tile contains local spatial information of the corresponding microlens (see section 2.2). While this representation is suitable for algorithms in the area of computational photography, e.g. computational refocusing, it does not provide easy access to other valuable information which is inherent within the 4D Light Field: namely the 3D structure (depth) and surface reflection properties of the scene. When it comes to image (scene) analysis and 3D reconstruction, the so-called Epipolar-Image (EPI) representation [8] appears to be more advantageous. The EPI representation is parameterizing a 4D Light Field with the intensity of light rays which intersect the camera plane at positions (vx , vy ) at the angles (φx , φy ∈ [0, π]): l(vx , vy , φx , φy )
(2)
The main advantage of this formulation is that the EPI structure not only holds 2D projective views from different view points (see the 2D (vx , vy )-subspace in fig, 1), but also encodes the depth in form of the slope of the line structures [6] in the 2D (φx , vx )- and (φy , vy )-subspaces. Additionally, the local surface reflectance is bound to the (φx , φy )-subspace [6]. Beyond the area of plenoptic imaging, there has already been a wide range of publications which take advantage of the EPI representation, e.g. for the analysis of camera properties [8], depth estimation [9], reconstruction of surface reflectance [10], object recognition [11] or image based rendering [6]. Additionally, the EPI representation still allows a direct application of operations common in computational photography, such as modification of focus and aperture. [8].
92
S. Wanner, J. Fehr, and B. J¨ ahne
Contribution: The contribution of this paper is twofold. First, our main contribution is the presentation of a novel algorithm for the robust computation of full depth of field views (see section 3) which are essential for the generation of EPIs from camera raw data. Second, to the best of our knowledge, we present the first approach for such a generation of EPIs from a Focused Plenoptic Camera, which allows the direct application of a wide range of computer vision algorithms to plenoptic imaging.
2
The Focused Plenoptic Camera
Plenoptic Cameras [12] follow the basic design principles of projective cameras, with the standard optical arrangement of a main lens, aperture and a 2D sensor. The only difference is an additional optical element: a microlens array positioned directly in front of the sensor. This allows the sampling of four Light Field dimensions (see eqn. 2) of a scene captured by the main lens, instead of projecting the LF onto two sensor dimensions like in a projective camera system. The position of the lens array directly affects the sampling properties of the Plenoptic Camera, which is always a trade-off between angular- and spatial resolution. In the Focused Plenoptic Camera [1,2], the microlenses are positioned in such a way, that their focal plane is set onto the image plane of the main lens (see figure 2). This leads to higher spatial but lower angular resolutions compared to other arrangements (like in [12]). Within this setup, each microlens can be considered as a single camera, projecting a small portion of the image plane onto the sensor - the so-called microimage. The field of view of each microlens depends on the local depth of the scene in its corresponding region of the image plane (see figures 2 and 3). This causes a multiple projection of scene elements which are out of focus of the main lens. We call these effects plenoptic artifacts (see figure 3). 2.1
Generating EPI Representations from Plenoptic Raw Data
Basically, the generation of an EPI representation from a sampled 4D Light Field is simple - at least compared to camera arrays [7], where the projective transformations of the views of the individual cameras have to be rectified and unified into one epipolar coordinate system requiring a precise calibration of all cameras. Due to the optical properties of the microlenses, with the image plane of the main lens defining the epipolar coordinate system, these projective transformations are reduced to simple translations [1], which are given by an offset (see section 2.4). Hence, one simply has to rearrange the viewpoint-dependent rendered views into the 4D EPI representation (see eqn. 2). However, the necessarily small depth of field of the microlenses (see section 2.3) causes other problems: For most algorithms, the EPI structure can only be effectively evaluated in areas with high-frequency textures - which of course is only possible for parts of a scene which are in focus. Additionally, plenoptic cameras suffer from imaging artifacts in out-of-focus areas. Hence, in order to
Generating EPI Representations with a Plenoptic Camera
93
In Focus
Out of Focus
Micro lens images on sensor In Focus Out of Focus
Chip
MLA
IP
ML
FP I
microlens image
1 px (pixel)
Fig. 2. 2D sketch of the optical setup of the Focused Plenoptic Camera. Left: Imaging of an object depends on the image plane. A scene point on the image plane (IP) of the main lens (ML) has its virtual image in the focal plane of the microlenses and is projected by a single microlens. Points which are not on the image plane of the ML are projected to several microlens images. Right: The resulting microlens images for the same setup.
generate EPIs which can be used to analyze the entire scene at once, we have to generate the EPIs from “all-in-focus” (i.e. full depth of field) views. The EPI Generation Pipeline 1) View rendering: Rendering of full depth-of-field images (see section 2.3) for different view points (see section 2.4) of the scene. 2) View merging: Merging of the corresponding views (see section 4.1) of different lens types. This step is only necessary for cameras with several microlens types, like the camera used in our experiments [5]. 3) View stacking: After merging process, a single set of rendered views remains. Those have to be arranged in a 4D volume according to their view angles resulting in the EPI structure: l(vx , vy , φx , φy )
2.2
Rendering Images from Raw Camera Data
The rendering process requires a onetime scene independent calibration, which extracts the position of all microlens images as well as their diameter dML . In this paper, we use a commercially available camera (see section 4), which has a microlens array where the lenses are arranged in a hexagon pattern. Due to this layout, we also use a hexagon shape for the microimages and address them with coordinates (i, j). We define an image patch pij as a microimage or a subset of it. Projective views are rendered by tiling these patches together [1,2]. The center of a microimage (i, j), determined in the coordinate system given by the initial camera calibration process, is denoted by cij . The corresponding patch images are defined as ωij (δ, o), where δ denotes the size of the microimage patch pij (δ, o) in pixels and o is the offset of the microimage patch center from
94
S. Wanner, J. Fehr, and B. J¨ ahne
a) micro-images
III
b)
II
image-patches
p i+1,j ) b(ω p
I
Fig. 3. Left: Microimages and their centers cij are indicated as well as the resulting image patches pij . Right: Raw image from a Focused Plenoptic Camera. Frame I shows an area of the scene imaged at the image plane of the main lens. Frames II and III select areas out off the main image plane, which leads to multiple occurrence (plenoptic artifacts) of scene details.
cij . We define ωij (δ, o) as an m×n matrix, which is zero except for the positions of the pixels of the corresponding microimage patch pij (δ, 0): ⎛ ⎞ 0 ... 0 ⎜ .. ⎟ ⎜ ⎟ . ⎜ ⎟ ⎜ .. . .. ⎟ ωij (δ, o) = ⎜ . ⎟ ω ∈ M at (m × n) , where p (δ, 0) ij ⎜ ⎟ ⎜ ⎟ . .. ⎠ ⎝ 0 ... 0 m × n is the rendered image resolution and (i, j) is the index of a specific image patch, imaged from microlens (i, j) (see figure 3 a). A projective view Ω (δ, o) of a scene is then rendered as: Ω (δ, o) =
Ny Nx
ωij (δ, o)
δ ∈ N , o ∈ N × N,
(3)
i=1 j=1
where (Nx , Ny ) is the number of microlenses on the sensor in x- and y-directions. The choice of the parameters δ and o directly controls the image plane and point of view of the rendered view. 2.3
Variation of the Image Plane
Figure 4 illustrates the effect of the patch size δ on the focal length of the rendered image: Scene points on the image plane of the main lens are imaged without plenoptic artifact (see figure 2), thus the full size of a microimage can
Generating EPI Representations with a Plenoptic Camera
95
be used. Scene points out off the main image plane create plenoptic artifacts, therefore the patch size δ has to be decreased to achieve artifact free texture in the affected regions of the scene (see figure 4 left). We denote a rendered image using patch size δm by Ωm . As the resolution of Ω directly depends on the choice of δ, we enforce a constant output resolution for all patches by upscaling each pij (δ, o) via bilinear interpolation to a fixed size before tiling the patches together. A super-resolution approach like [13] for better rescaling results is also possible. 2.4
Variation of the View Point
Figure 4 illustrates how the view point of a rendered projection can be changed by the translation o of the patches pij (δ, o) within the microimages. Recalling that a patch is always a subset of a microimage with its size δ depending on the image plane, makes clear that the number of possible view points is directly bound by the choice of the image plane: i.e. there is a direct trade-off between angular and spatial resolution. Parts of the scene which have been captured in focus of the main lens can only be rendered from one point of view. Hence, the focal plane of the main lens is defining the epipolar coordinate system of the Focused Plenoptic Camera.
In Focus
patch size 7px
rendered image
microlens images
Out of Focus
rendered view I
rendered view II
Foreground
microlens images
loss of resolution
Views
patch size 19px
Near main focal plane
microlens images
4px
rendered image
incorrect patch size
correct patch size
4px
view change
Fig. 4. Top: Rendering different image planes and views illustrated on the basis of the 1D scene depicted in figure 2. Bottom: Illustrates refocusing and changes of view point in two dimensions on basis of an arbitrary example.
96
3
S. Wanner, J. Fehr, and B. J¨ ahne
Full Depth of Field
The computation of full depth of field images from a series of views with different image planes usually requires depth information of the given scene: [14] and [4] applied a depth estimation based on cross-correlation. The main disadvantage of this approach is that one would have to solve a major problem, namely the depth estimation for non-Lambertian scenes, in order to generate the EPI representation, which is intend to be used to solve the problem in the first place - a classical chicken-egg-problem. To overcome this dilemma, we propose an alternative approach: As shown in section 2.3, we actually don’t need to determine the depth explicitly - all we need are the correct patch sizes δm to ensure a continuous view texturing without plenoptic artifacts. We propose to find the best δm via a local minimization of the gradient magnitude at the patch borders over all possible focal images Ωm . Since the effective patch resolution changes with δm , we have to apply a low-pass filtering to ensure a fair comparison. In practice, this is achieved by downscaling each patch to the smallest size δmin , using bilinear interpolation. We denote the band-pass filtered ¯m . Assuming a set of patch sizes δ = [δ0 , . . . , δm , . . . , δM ], we focal images by Ω render a set of border images using a Laplacian filter (see figure 5):
¯0 , . . . , ∇2 Ω ¯m , . . . , ∇2 Ω ¯M , Γ = ∇2 Ω
∇2 =
∂2 ∂2 + ∂x2 ∂y 2
(4)
From Γ we determine the gradients for each hexagon patch by integrating along its borders b (pij ), considering only gradient directions orthogonal to the edges of the patch (see figure 5). The norm of the gradients orthogonal to the border of each microimage patch pij and each image plane m is computed as: δm
δ8 =7px
δ5 =13px
δ1 =21px
Δ²Ωm
Δ
n Γm
Fig. 5. Left: Right: Three examples of the border image set (see eq. 4) and the gradient magnitude set (eqn. 5) with different patch sizes δ are depicted. The example in the center shows the correct focal length.
Generating EPI Representations with a Plenoptic Camera
97
Σ (m, i, j) = b(pij )
nb · ∇Γm ,
(5)
Here nb denotes the normal vector of each hexagon border b (compare figure 3 a). Further we define the lens specific image plane map z [i, j] as a minimum of Σm for each microlens image (i, j): z (i, j) = argmin Σ [m, i, j] m
(6)
The image plane map z (i, j) has a resolution of (Ny , Nx ) (number of microlenses) and encodes the patch size value δij for each microimage (i, j). Using z (i, j), we render full depth of field views Ω (z). This approach works nicely for all textured regions of a scene. Evaluating the standard deviation of Σ for each (i, j) can serve to further improve z (i, j): microimages without (or very little) texture are characterized by a small standard deviation in Σ. We use a threshold to replace the affected z (i, j) with the maximum patch size δmax . This is a valid approach since the patch size (i.e. the focal length) does not matter for untextured regions. Additionally, we apply a gentle median filter to remove outliers from z (i, j).
4
Experiments
Setup: For the experimental evaluation, we use a commercially available Focused Plenoptic Camera (the R11 by the camera manufacturer Raytrix GmbH [5]). The camera captures raw images with a resolution of 10 Megapixels and is equipped with an array of roughly 11000 microlenses. The effective microimage diameter is 23 pixels. The array holds three types of lenses with different focal lengths, nested in a 3 × 65 × 57 hexagon layout, which leads to an effective maximum resolution of 1495 × 1311 pixels for rendered projective views at the focal length of the main lens. Due to this setup with different microlens types, we compute the full depth of field view for each lens type independently and then apply a merging algorithm. 4.1
Merging Views from Different Lens Types
The full depth-of-field views for each lens type have the same angular distribution (if the same offsets oq have been used), but are translated relative to each other. We neglect that these translations are not completely independent of depth-ofscene. Due to the very small offset (baseline), theses effects are in the order of sub-pixel fractions. The results shown in figures 6, 7 and 8 are merged by determining the relative shifts Tn via normalized cross-correlation and averaging over the views with the same offset. Ωmerged (z, oq ) =
3 1 Tn Ωn (z, oq ) 3 n=1
Tn ∈ N × N
98
S. Wanner, J. Fehr, and B. J¨ ahne
Due to the fact that each lens type has an individual focal length, the sharpness of the results can be improved by a weighted averaging depending on the optimal focal range of each lens type and the information from the focal maps z[i, j]. Ωmerged (z, oq ) =
3 n=1
4.2
αn (z) Tn Ωn (z, oq ) , αn ∈ R
and
3
αn = 1
n=1
Results
A qualitative evaluation is shown in figures 6, 7 and 8: We compare the results of our proposed algorithms with the output of commercial software from the camera vendor, which computes the full depth of field projective views via an explicit depth estimation based on stereo matching on the camera raw data [5]. We present the raw output of both methods with only a few generic filtering. It should be noted, that the results of the depth estimates are not directly comparable - the emphasis of our qualitative evaluation lies on the full depth of scene reconstruction shown in figure 6.B. Figures 6.A, 7.A and 8.A: Estimation of the focal length vs. explicit depth estimation. The left side shows a typically dense image plane map z [i, j] (6), computed with our algorithm. On the right, for comparison, a depth map resulting from sparse stereo-matching is shown. The comparison indicates that our algorithm yields dense high quality focal maps computed with relatively little effort which also works for non-Lambertian and sparsely textured scenes. Figure 6.B: Evaluation of the full depth of field rendering. The left side shows results of our algorithm. Results on the right are based on stereo depth map from above using generic filters to improve smoothness. We show sample parts of the scene exemplary for regions where our algorithm achieves better results than the stereo matching approach. Figures 6.C, 7.B and 8.B: EPI sub-spaces: These examples show various views of 2D sub-spaces from generate 4D Light Field EPI representations. The larger images show 2D projection images from a single view point as given by the (vx , vy )-subspace, vertical color lines indicate locations of the (vx , φx ) views depicted in the bottom, and horizontal color lines indicate locations of the (vy , φy ) views shown to the right: the depth of the scene is bound to the slope of the EPI line structures, whereas the reflectance is given by the values on the line structure. The results show that our approach correctly generates the typical EPIs structures. Software and Data: We provide an open-source implementation of the presented algorithms along with raw data and further results at: http://hci.iwr. uni-heidelberg.de/HCI/Research/LightField/
Generating EPI Representations with a Plenoptic Camera
6.A
6.B
6.C
Fig. 6.
99
100
S. Wanner, J. Fehr, and B. J¨ ahne
7.A
7.B
Fig. 7.
8.A
8.B
Fig. 8.
Generating EPI Representations with a Plenoptic Camera
101
Acknowledgements. We would like to thank the Raytrix GmbH for providing the raw data for the experimental evaluation.
References 1. Lumsdaine, A., Georgiev, T.: The focused plenoptic camera. In: Proc. IEEE ICCP, pp. 1–8 (2009) 2. Georgiev, T., Lumsdaine, A.: Focused plenoptic camera and rendering. Journal of Electronic Imaging 19, 021106 (2010) 3. Ng, R., Levoy, M., Br´edif, M., Duval, G., Horowitz, M., Hanrahan, P.: Light field photography with a hand-held plenoptic camera. Computer Science Technical Report CSTR 2 (2005) 4. Bishop, T., Favaro, P.: Full-resolution depth map estimation from an aliased plenoptic light field. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part II. LNCS, vol. 6493, pp. 186–200. Springer, Heidelberg (2011) 5. Perwass, C.: The next generation of photography. White Paper, www.raytrix.de 6. Zhang, C., Chen, T.: Light Field Sampling. Morgan & Claypool Publishers, San Francisco (2005) 7. Wilburn, B., Joshi, N., Vaish, V., Talvala, E.V., Antunez, E., Barth, A., Adams, A., Horowitz, M., Levoy, M.: High performance imaging using large camera arrays. ACM Trans. Graph. 24, 765–776 (2005) 8. Levin, A., Freeman, W.T., Durand, F.: Understanding camera trade-offs through a bayesian analysis of light field projections. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 88–101. Springer, Heidelberg (2008) 9. Ziegler, R., Bucheli, S., Ahrenberg, L., Magnor, M., Gross, M.: A bidirectional light field-hologram transform. In: Computer Graphics Forum, vol. 26, pp. 435– 446. Wiley Online Library, Chichester (2007) 10. Criminisi, A., Kang, S.B., Swaminathan, R., Szeliski, R., Anandan, P.: Extracting layers and analyzing their specular properties using epipolar-plane-image analysis. Comput. Vis. Image Underst. 97, 51–85 (2005) 11. Xu, G., Zhang, Z.: Epipolar geometry in stereo, motion and object recognition (1996) 12. Adelson, E.H., Wang, J.Y.A.: Single lens stereo with a plenoptic camera. IEEE Transactions on Pattern Analysis and Machine Intelligence 14, 99–106 (1992) 13. Georgiev, T., Lumsdaine, A.: Reducing plenoptic camera artifacts. In: Computer Graphics Forum. Wiley Online Library, Chichester (2010) 14. Bishop, T., Favaro, P.: Plenoptic depth estimation from multiple aliased views. In: 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1622–1629. IEEE, Los Alamitos (2009)
MethMorph: Simulating Facial Deformation Due to Methamphatamine Usage Mahsa Kamali, Forrest N. Iandola, Hui Fang , and John C. Hart University of Illinois at Urbana-Champaign {mkamali2,iandola1,jch}@illinois.edu,
[email protected]
Abstract. We present MethMorph, a system for producing realistic simulations of how drug-free people would look if they used methamphetamine. Significant weight loss and facial lesions are common side effects of meth usage. MethMorph fully automates the process of thinning the face and applying lesions to healthy faces. We combine several recently-developed detection methods such as Viola-Jones based cascades and Lazy Snapping to localize facial features in healthy faces. We use the detected facial features in our method for thinning the face. We then synthesize a new facial texture, which contains lesions and major wrinkles. We apply this texture to the thinned face. We test MethMorph using a database of healthy faces, and we conclude that MethMorph produces realistic meth simulation images.
1
Introduction
Methamphetamine abuse contributed an economic burden of $23.4 billion to the United States in 2005 [1]. The meth epidemic has affected large portions of rural, suburban, and urban America [2]. In 2005, fifty percent of the Montana jail population was incarcerated for meth usage or other meth-related offenses [2]. The Meth Project is one of the world’s largest anti-meth organizations, and its efforts span eight U.S. states. The Meth Project deployed a large-scale marketing campaign designed to discourage meth use among teenagers in Montana [2]. This marketing campaign uses many images of methamphetamine abusers’ faces [3]. As shown in Figure 1, these images illustrate common side effects of meth, such as wrinkles, severe weight loss, and lesions. The Meth Project found that the use of such images in anti-meth marketing campaigns can deter teens from using methamphetamines. While teen methamphetamine usage remained relatively constant across the nation from 2005 to 2007, campaigns by The Meth Project helped to reduce meth usage by forty-five percent over the same time period in Montana [2]. “Meth will make you look different from normal” serves as one of the key messages of The Meth Project [2]. So far, The Meth Project has illustrated this message with photos of meth users’ faces. The leaders of The Meth Project expect that they could further reduce meth usage if it were possible to simulate what a
Currently with Google Research
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 102–111, 2011. c Springer-Verlag Berlin Heidelberg 2011
Simulating Facial Deformation Due to Methamphatamine Usage
103
Fig. 1. Representative before-and-after images of a meth user. Courtesy of Faces of Meth [4].
drug-free individual would look like if he or she became addicted to meth. Toward this goal, we have collaborated with The Meth Project to develop MethMorph, an automated system to produce realistic methamphetamine facial simulations. MethMorph takes images of faces as input and produces “MethMorphed” images as output. Developing an automated system to produce methamphetamine facial simulations presents three key challenges. The first of these challenges is to recognize the overall facial region, as well as the chin line, mouth, eyes, and nostrils in an image. Once the facial features are detected, the next challenge is to warp the shape of the face to depict wrinkles and significant weight loss. The final challenge is to depict lesions by changing the texture and color of the warped face. We address the challenges of facial feature detection, facial deformation, and facial texture modification in Sections 2, 3, and 4, respectively. In Section 6, we conclude that MethMorph produces realistic methamphetamine facial simulations.
2
Facial Feature Detection
Facial feature detection is crucial for enabling a realistic, automatic methamphetamine facial simulation. Within the facial region, we detect the location of the chin line in order to know what portion of the image to deform into a thinner, more wrinkled face. Chin line detection is particularly important for deformation, because it allows us to thin the cheeks without unintentionally thinning the neck. In addition, detecting the locations of the eyes and nostrils allows us to avoid placing lesions on the eyes and inside the nostrils. Figure 2 illustrates some of our facial feature detection goals.
104
M. Kamali et al.
Fig. 2. Pink dots on this face show correctly detected eye corners, lips, and chin line. We also detect the nostrils, though nostril detection is not shown in this figure.
2.1
Related Work in Facial Feature Detection
Viola and Jones introduced a face detection method that uses a cascade of simple classifiers [5] [6]. Given an image, the classifiers quickly reject image regions that do not even slightly resemble a face. The Viola-Jones method then explores remaining image regions with additional classifiers. After applying several classifiers, the method concludes where the faces, if any, lie in the image. Over the past decade, the computer vision community extended the ViolaJones cascade to create numerous face and facial feature detection methods. Castrill´ on et al. survey and evaluate several of these Viola-Jones based methods for localizing faces and facial features in images in [7] and [8]. Many of these methods are integrated into the Open Computer Vision Library (OpenCV) [9]. The work by Castrill´on et al. influenced our choices of methods for detecting the face, eyes, nose, and mouth in MethMorph. 2.2
Face Detection
The facial feature detection methods that we present in Sections 2.3 through 2.6 require the face to be detected accurately. Prior to detecting facial features, we first localize the face within an image. Since our facial feature detectors rely on the face being localized properly, it is crucial for our face detection scheme to perform well for a variety of faces.
Simulating Facial Deformation Due to Methamphatamine Usage
105
There exist numerous Viola-Jones based cascades for facial region detection. A recent study by Castrill´ on et al. [7]. found that two cascades called FA1 1 and FA2 2 developed by Lienhart et al. [10] offer greater accuracy than other publicly-available cascades in detecting the facial region. The cascades by Lienhart et al. [10] were among the first Viola-Jones based cascades to appear in OpenCV [9]. Given an image, we localize the facial region with the FA1 cascade developed by Lienhart et al. [10]. Our experimental results show that the face detection succeeds for 100% of the twenty-seven images that we tested. 2.3
Robust Chin Line Detection
Once we localize the facial area, we then extract the exact position of the chin line. For chin line detection, we use a method called Lazy Snapping [11]. Lazy Snapping, which combines GraphCuts and over-segmentation, has minimal dependence on ambiguous low-contrast edges. Thus, Lazy Snapping reduces the dependence on shadows for chin line detection. 2.4
Eye and Eyebrow Detection
We localize the eye region with a detection cascade developed by Bediz and Akar [12]. This detector is implemented as part of OpenCV [9]. Out of all publicly-available Viola-Jones cascade eye region detectors, Castrill´ on et al. found that the Bediz-Akar detector offers the greatest accuracy [7]. Within the eye region, we apply Canny edge detection [13] to locate the corners of the eyes. Our eye detection method succeeded in 100% of our twenty-seven input images. Figures 3a and 3b illustrate successful eye detection. Note that Figures 3a and 3b also show eyebrow detection, but eyebrow detection is not essential to our MethMorph system. 2.5
Lip Detection
We localize the mouth region with the ENCARA2 mouth detection cascade [14]. Castrill´ on et al. found that, out of all publicly-available Viola-Jones cascade mouth region detectors, ENCARA2 offers the greatest accuracy [7]. Within the mouth region, localize the lips with a time-adaptive self-organizing map (SOM). A study by Kamali et al. showed that this time-adaptive SOM exhibits greater accuracy than Snakes [15] for localizing the boundaries of the lips [16]. This method succeeded in localizing the lips in all of our tests. 2.6
Nose and Nostril Detection
We localize the nose region with the ENCARA2 nose detection cascade [14]. Out of all publicly-available Viola-Jones cascade nose detectors, a recent study found that ENCARA2 offers the greatest accuracy [7]. 1 2
FA1 is a common abbreviation for haarcascade frontalface alt FA2 is a common abbreviation for haarcascade frontalface alt2
106
M. Kamali et al.
(a)
(b) Fig. 3. Successful eye detection
(a) Properly-detected nostrils
(b) Mustaches make nostril detection more difficult
Fig. 4. Nostril detection examples
Simulating Facial Deformation Due to Methamphatamine Usage
107
Within the nose region, we apply Snakes [15] to obtain a precise outline of the nostrils. The combination of the ENCARA2 cascade and Snakes successfully localized the nostrils in 93% of the twenty-seven faces on which we tested MethMorph. Figure 4a shows the typical accuracy of our nostril detection technique. In Figure 4b, a mustache adds additional clutter, which reduces the likelihood that Snakes converge to the nostrils.
3
Facial Deformation
Once we detect the face, eyes, chin line, nostrils, and lips, our next objective is to thin the face. To perform the facial thinning, we apply a method developed by Fang and Hart for deforming the surfaces of objects in images [17]. The Fang-Hart deformation method takes an image along with an existing curve in an image, curve1 , and an additional curve, curve2 . The method then deforms curve1 to the shape of curve2 . In addition, the Fang-Hart deformation method adapts the texture from the original image to keep its original proportions, while reshaping the texture to fit onto the deformed image. To achieve this deformation and retexturing, the the Fang-Hart method extends techniques such as GraphCuts, Poisson image editing [18], and Image Completion [19]. In the context of MethMorph, the chin line detected in Section 2.3 serves as curve1 in the Fang-Hart deformation. We compute curve2 as the average of the chin line, curve1 , and a custom-made “thin face” template. Then, we apply the Fang-Hart deformation method to deform curve1 into the shape of curve2 . The result is a thinned face that is characteristic of meth users. We find that, for all the images that we tested, the Fang-Hart method succeeds in producing a realistically thinned face.
4
Lesion Simulation
Given a deformed image in which facial feature locations are known, the final step is to replace the healthy skin texture with a texture that has the lesions that are characteristic of meth users. MethMorph’s lesion simulation method is based on Textureshop [20], a technology developed and patented [21] by the University of Illinois for retexturing surfaces depicted in a single raw photograph. What differentiates Textureshop from other retexturing approaches is that Textureshop avoids the reconstruction of a 3D model of the shape depicted in a photograph, which for single raw uncalibrated photographs is a notoriously errorprone process. It instead recovers only the orientations of the depicted surfaces and uses these orientations to distort a newly applied texture so it appears to follow the undulations of the depicted surface. While Microsoft and Adobe have already expressed interest in licensing the patented Textureshop technology for their own software, The University of Illinois is allowing MethMorph to serve as its premiere application debut. In the general case, Textureshop works as follows. First, use a reflectance model to compute the normals of surfaces in an image. Next, group the surface
108
M. Kamali et al.
Fig. 5. Facial lesion texture template
(a) Healthy Face
(b) MethMorphed Face
Fig. 6. MethMorph applied to a healthy face
into patches that have similar surface normals. Then, extract and store the textures from these surfaces. Finally, apply a new, custom synthesized texture to surfaces (facial skin, for example) in an image. For one or more objects in the resulting image, a new, synthetic texture follows the underlying objects. For MethMorph, Textureshop uses Poisson image editing [18] to synthesize a new “skin with lesions” texture based on the original skin texture and our custom-made lesion texture template, which is shown in Figure 5. The result is a face that has realistic lesions and also retains its original skin tone. Note that we must avoid placing lesions on the eyes or inside the nostrils. To achieve this, we use the eyes and nostrils that we detected in Sections 2.4 and 2.6. Also, since the facial skin color often does not match the skin color of the lips, we also separately blend the original lip texture with the lesion texture.
Simulating Facial Deformation Due to Methamphatamine Usage
(a) Healthy Face
109
(b) MethMorphed Face
Fig. 7. MethMorph applied to a healthy face
To differentiate between the facial skin and the lips, we use the lip detection results from Section 2.5.
5
Results
Figures 6b and 7b illustrate the success of MethMorph in producing realistic methamphetamine facial simulations. The success of MethMorph relies on accurate facial feature detection. Given the high success rates of the facial feature detection methods presented in Section 2, it is no surprise that MethMorph offers a high end-to-end success rate.
6
Conclusions
MethMorph applies several recent innovations in computer vision and graphics. We found that Viola-Jones based detectors are sufficient for localizing facial feature regions. The application of snakes to the nose region offers a high success rate for precisely detecting the nostrils. The Fang-Hart deformation method succeeds in thinning the face in the manner that a meth user would experience. The Textureshop retexturing technology enables the application of lesions to the skin. In conclusion, MethMorph succeeds in simulating the visual appearance of meth addiction. Our work also demonstrates the effectiveness of Fang-Hart Deformation method and Textureshop in a real-world application.
110
M. Kamali et al.
MethMorph is slated to play a significant role in The Meth Project’s campaign against meth usage in the United States. Thus, MethMorph offers significant value to society while also demonstrating the effectiveness of recent computer vision and computer graphics methods. Acknowledgements. This work was supported by The Siebel Foundation and The Meth Project.
References 1. Nicosia, N., Pacula, R.L., Kilmer, B., Lundberg, R., Chiesa, J.: The Economic Cost of Methamphetamine Use in the United States, 2005. RAND Corporation (2009) 2. Siebel, T.M., Mange, S.A.: The montana meth project: ’unselling’ a dangerous drug. Stanford Law and Policy Review 20, 405–416 (2009) 3. The meth project (2011), http://notevenonce.com 4. Faces of meth (2011), http://facesofmeth.us/main.htm 5. Viola, P.A., Jones, M.J.: Robust real-time face detection. In: International Conference on Computer Vision (ICCV), p. 747 (2001) 6. Viola, P.A., Jones, M.J.: Robust real-time face detection. International Journal of Computer Vision 57, 137–154 (2004) 7. Castrilln, M., Dniz, O., Hernndez, D., Lorenzo, J.: A comparison of face and facial feature detectors based on the violajones general object detection framework. Machine Vision and Applications 22, 481–494 (2011) 8. Castrill´ on-Santana, M., D´eniz-Su´ arez, O., Lorenzo-Navarro, L.A.C., LorenzoNavarro, J.: Face and facial feature detection evaluation. In: International Conference on Computer Vision Theory and Applications (VISAPP), pp. 167–172 (2008) 9. Face recognition using opencv (2011), http://opencv.willowgarage.com/wiki/FaceDetection 10. Lienhart, R., Kuranov, A., Pisarevsky, V.: Empirical analysis of detection cascades of boosted classifiers for rapid object detection. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 297–304. Springer, Heidelberg (2003) 11. Li, Y., Sun, J., Tang, C.K., Shum, H.Y.: Lazy snapping. ACM Trans. Graph. 23, 303–308 (2004) 12. Bediz, Y., Akar, G.B.: View point tracking for 3d display systems. In: European Signal Processing Conference, EUSIPCO (2005) 13. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8, 679–698 (1986) 14. Castrill´ on, M., D´eniz, O., Guerra, C., Hern´ andez, M.: Encara2: Real-time detection of multiple faces at different resolutions in video streams. Journal of Visual Communication and Image Representation 18, 130–140 (2007) 15. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988) 16. Kamali Moghaddam, M., Safabakhsh, R.: Tasom-based lip tracking using the color and geometry of the face. In: International Conference on Machine Learning and Applications, ICMLA (2005) 17. Fang, H., Hart, J.C.: Detail preserving shape deformation in image editing. ACM Trans. Graph. 26 (2007) 18. P´erez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. 22, 313–318 (2003)
Simulating Facial Deformation Due to Methamphatamine Usage
111
19. Sun, J., Yuan, L., Jia, J., Shum, H.Y.: Image completion with structure propagation. ACM Trans. Graph. 24, 861–868 (2005) 20. Fang, H., Hart, J.C.: Textureshop: texture synthesis as a photograph editing tool. ACM Trans. Graph. 23, 354–359 (2004) 21. Fang, H., Hart, J.C.: Methods and systems for image modification. Patent US 7365744 (2008) 22. Kamali, M.: Tools for Scene Analysis and Transmission. PhD thesis, University of Illinois at Urbana-Champaign (2011)
Segmentation-Free, Area-Based Articulated Object Tracking Daniel Mohr and Gabriel Zachmann Clausthal University
Abstract. We propose a novel, model-based approach for articulated object detection and pose estimation that does not need any low-level feature extraction or foreground segmentation and thus eliminates this error-prone step. Our approach works directly on the input color image and is based on a new kind of divergence of the color distribution between an object hypothesis and its background. Consequently, we get a color distribution of the target object for free. We further propose a coarse-to-fine and hierarchical algorithm for fast object localization and pose estimation. Our approach works significantly better than segmentation-based approaches in cases where the segmentation is noisy or fails, e.g. scenes with skin-colored backgrounds or bad illumination that distorts the skin color. We also present results by applying our novel approach to markerless hand tracking.
1
Introduction
Today, tracking articulated objects (e.g. human bodies or hands) is of more interest than ever before. Consequently, robust detection and recognition of articulated objects in uncontrolled environments is an active research area and still a challenging task in computer vision. Applications can be found in more and more areas, such as games (e.g. the Kinect), gesture recognition as next generation “touchless” touchscreen, gesture recognition in mobile devices to improve application control, or rehabilitation, to mention only some of them. When utilizing the human hand as an input device (as opposed to the whole human body), cameras of very high resolution are mandatory. Furthermore, the hand has about 26 DOF. Thus, smart algorithms and a lot of computation power is needed to handle the large configuration space. Consequently, one has to find a good compromise between accuracy and computation time. On the one hand, the approach has to be robust enough, so we want to eliminate as many error sources as possible (like edge detection or segmentation). On the other hand, the approach still needs to be computationally very efficient. Many tracking systems utilize temporal coherence in order to save computation time: only the close neighborhood of the object’s position and pose from the previous frame are scanned for updating position and pose. But this has the serious disadvantage that it often leads to drifting and typically, after G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 112–123, 2011. c Springer-Verlag Berlin Heidelberg 2011
Our Classical approach approach
Segmentation-Free, Area-Based Articulated Object Tracking
Image
Likelihood d
Hypothesis about color
Segmentation
Check shape
Likelihood
Check color distribution
Segmentation
Hypothesis about shape
Image
113
Fig. 1. We turn the classical approach upside-down and first test different shape hypotheses and then check the color distribution while classical approaches first estimate the color distribution (which is error-prone because it is not known well for real images) and then check the shapes
some hundred frames, the object is lost. In order to achieve better robustness, we propose to do tracking by per-frame detection. Most approaches need some kind of low-level feature extraction as a preprocessing step, e.g. edge detection or foreground segmentation. There exists a large body of previous work to reduce the disadvantages of such a feature extraction. Our novel method works, as an overview, as follows. For each possible pose and orientation of the model, we precompute its fore- and background information. This is then transformed into an object descriptor, which we will refer to as template in the following. It is crucial to use a very robust similarity measure between the templates and the input image. We propose a new similarity measure based on the color distribution divergence. Our similarity measure works directly on the color image. The basic idea is to compare the color distribution of the foreground and background regions for each hypothesis. The more divergent the color distributions are, the higher is the probability that the hypothesis is observed. In a way, our novel approach can be viewed as turning the classical matching approach upside-down: instead of first forming a hypothesis about the color distribution of the object and then checking whether or not it fits the expected shape, we first form a hypothesis about the shape and then check whether or not the color distribution fits the expectation (illustrated in Fig. 1). Due to the large configuration space of the human hand, we need to speed up the localization and recognition. We propose a new coarse-to-fine approach where the step size depends on the template shapes and is computed offline during template generation. In addition, we combine a template hierarchy with this approach to further reduce the computation time from O(n) to O(log n), n = #templates. Basically, we coarsely scan the input image with the root node of the hierarchy and use the k best matches for further tree traversal. Our main contributions are: 1. A novel and robust similarity measure based on a new kind of color distribution divergence, which does not need any segmentation or other error-prone feature extraction. 2. Our approach can be trivially extended to include other features (e.g. depth values using ToF-cameras or the Kinect) because our approach is based on
114
D. Mohr and G. Zachmann
comparing multivariate distributions between object foreground and background, and not any color specific properties. 3. A coarse-to-fine and hierarchical approach for fast object detection and recognition. Based on the template description, we estimate the smallest possible distance between two local maxima in the confidence map. We use this knowledge to determine the scan step size and combine it with a multi-hypothesis and hierarchical template matching.
2
Related Work
Most approaches for articulated object tracking use edge features or a foreground segmentation. These features are used to define a similarity measure between the target object and the observation (input image). Most approaches are model-based. Basically, for each parameter set (pose) of the object a template is generated and used for matching. Typical edge-based approaches to template matching use the chamfer [1],[2] or Hausdorff [3]. Chamfer matching for tracking of articulated objects is, for example, used by [4], [5], [6],[7], [8], [9],[10] and [11], the Hausdorff distance by [12],[13] and [14]. The generalized Hausdorff distance is more robust to outliers. The computation of the chamfer distance can be accelerated by using the distance transform of the input image edges. Both, chamfer and Hausdorff distance can be modified to take edge orientation into account [12][8][15]. The main problems of edge-based tracking are the edge responses in the input image. Either the approach needs binary edges or works with intensities itself. In the first case, thresholds for binarization have to be chosen carefully and are not easy to be determined. In the second case, intensity normalization has to be performed. Segmentation-based approaches apply color or background segmentation. The segmentation result is then compared to the object silhouette. [16,17] uses the difference between the model silhouette and segmented foreground area in the query image as similarity measure. A similarity measure is used by Kato et al. [9]. They use the differences between template foreground, segmentation and intersection of template and segmentation. In [18], the non-overlapping area of the model and the segmented silhouettes are integrated into classical function optimization methods. In [19,20] a compact description of the hand model is generated. Vectors from the gravity center to sample points on the silhouette boundary, normalized by the square root of the silhouette area, are used as hand representation. During tracking, the same transformations are performed to the binary input image and the vector is compared to the database. A completely different approach is proposed by Zhou and Huang [21]. They use local features extracted from the silhouette boundary obtained by a binary foreground segmentation. Each silhouette is described by a set of feature points. The chamfer distance between the feature points is used as similarity measure. In [22] the skin-color likelihood is used. For further matching, new features, called likelihood edges, are generated by applying an edge operator to the likelihood ratio image. In [23,24,8], the skin-color likelihood map is directly compared
Segmentation-Free, Area-Based Articulated Object Tracking
115
with hand silhouettes. The product of all skin probabilities at the silhouette foreground is multiplied with the product of all background probabilities in the template background. Similar to edge-based matching, segmentation-based approaches either need a binary segmentation, or directly work with the segmentation likelihood map. Thus, the disadvantages of these approaches are binarization errors or false segmentation, i.e. classifying background regions as foreground. Wang [25] uses a completely different segmentation-based approach by requiring users to wear colored gloves. Each of the colors correspond to a specific part of the hand. A distance measure between two arbitrary hand poses, based on the color coding, is defined. In a preprocessing step, a large database, containing hand descriptors-based on the color glove coding, is generated. During tracking, this database is compared to the hand observed in the input image using a Hausdorff-like distance between the centers of the color regions. The disadvantages of the approach are that a homogeneous background is needed and a special glove is necessary. We propose an approach that does not need such features like edges or a segmentation. Our method directly works on the color input image. The basic idea is to estimate color distributions of the fore- and background. Here, the foreground/background regions are given by the template descriptor and the corresponding color values in the input image. In other words, our algorithm simultaneously performs shape matching and the target object color distribution estimation.
3
Matching Object Templates
In the following, we will explain our proposed similarity measure between the target object and the input image. A similarity measure, in general, is used to compute the probability that, at a given position and scale in the input image, the target object in a given pose is observed. Henceforth, we assume that for any object pose an object silhouette area is given which is denoted by template. The goal is to estimate the probability for the observation of a template at a specific scale in the image. For detection, this can be done at each position and scale in the image. 3.1
Color Divergence-Based Similarity Measure
Our similarity measure is based on the idea that the target object has a different color distribution than its surrounding background. On a very general level, this is similar to classical approaches based on image segmentation. However, in our approach, the only a priori knowledge are the template descriptors of the object shapes in each pose. We do not need a priori knowledge about the object color distribution. Given a template and a position in the input image, the pixels that correspond to the foreground and the background in the template, resp., are determined.
116
D. Mohr and G. Zachmann
Fig. 2. Example of the color distribution of a human hand and a background. The image is decomposed into the hand and the background. The first two images show the hand and the corresponding 3D color histogram. The last two images show the surrounding background and its color distribution. The color distributions of the hand and the background are quite different and can be used as similarity measure for hand shape matching.
These form a hypothesis about the object pose. If the object pose, represented by the template, is actually found at the given position, the foreground and background color distributions must be different. If a different, or no shape is found there, the color distributions of foreground and background must overlap each other significantly. For illustration, Fig. 2 shows an example. Consequently, the dissimilarity between the two distributions can be used as a measure for template similarity. Obviously, the color distribution has to be done as fast as possible. The Kullback-Leibler divergence is not feasible, because the computation of the histograms would take to long. A second disadvantage of a histogram-based representation is that there are possibly not enough color pixels to densely fill all histogram bins belonging to the inherent color distribution. Representing the color distributions by a normal distribution does not have these disadvantages. We use one multivariate Gaussian to represent foreground and background, resp.. In our application to hand-tracking we observed that the approximation of the color distribution of the human hand by one Gaussian is sufficient in most cases. Of course, the background region should not be chosen too large, in order to obtain an appropriate approximation by one Gaussian. Assume the means, μf g , μbg , and the covariances of the foreground and background regions, Σf g , Σbg , given in color space. Then we use the following color distribution similarity: D=
G(μbg |μf g , Σf g ) + G(μf g |μbg , Σbg ) 2
(1)
with the unnormalized Gaussian function G(x|μ, Σ) = |(2π)3 Σ)|1/2 N (x|μ, Σ). Using the normal distribution itself in Eq. 1 would result in lower dissimilarity values for higher covariances while having the same separability of the distributions.
Segmentation-Free, Area-Based Articulated Object Tracking
3.2
117
Fast Color Distribution Estimation
To reduce the computation time, we build upon the approach proposed by [26] for the template representation. They approximate each template silhouette by a set of axis-aligned rectangles. Then, they utilize that, using the integral image, the sum over the whole foreground and background region can be computed by O(#rectangles). In the following, we will explain the computation of the mean and covariance, irrespective of the region being a foreground or background in the template. The same calculations are, of course, used to compute either of the two. Before we can explain the computation of mean and covariance matrix, we need to make some definitions: I is a color input image; II the integral image of I; R = {Ri }i=1···n a set of rectangles representing a template region and II(Ri ) = x∈Ri I(x) the sum of all pixels over the rectangle Ri in I. The mean μ can be trivially computed by: μ∝ II(Ri ) (2) Ri ∈R
The covariance matrix cannot be computed exactly using the rectangle representation because the off-diagonal entries cannot be computed using II. We could, of course, compute the integral image of I 2 with I 2 (p) = I(p)I(p)T . But this would need 6 additional integral images (6 and not 9, because I(p)I(p)T is a symmetric matrix) and thus, result in too much memory accesses and a high latency per frame. Therefore, we have decided to estimate the covariance matrix in the following way. We let each point inside a rectangle Ri be represented by the mean μi = II(Ri )/|Ri | of the rectangle. The covariance can now be estimated by T xxT − μμT = xxT − μμT ≈ |Ri |·μi μT (3) Σ∝ i − μμ x∈R
Ri ∈R x∈Ri
Ri ∈R
To avoid obtaining too crude an approximation of the covariance matrix, we subdivide big rectangles and use the mean values of the subregions to compute the covariance matrix. We have testes two alternatives. The first one is a simple subdivision of the rectangle into rectangular blocks of equal size. The second method is an adaptive subdivision: we subdivide a rectangle successively until the covariance estimated by all subrectangles does not significantly change anymore. We expected the second method to yield better results, but it has the disadvantage that the covariance matrix is fluctuating when slightly moving the template position in the input image. This disturbs the mode finding (i.e. detecting the most probable matching position in the input image) by our method described in Sec. 4. Hence, the simpler subdivision method could work better for us. Please note that our algorithm could also take image segmentation results into account. More generally, we are not limited to any specific dimensionality of the input, i.e. we could easily incorporate other modalities such as depth values.
118
D. Mohr and G. Zachmann
Algorithm 1. objectDetection( I, H, k ) Input: H = template hierarchy, I = input image, k best hypothesis Output: M = k best matches, each containing a target object position and pose coarsely scan I with root(H), take k best matches, → match candidates C apply local optimization to each candidate ∈ C → new set C // we use [27] while C not empty do foreach c ∈ C do if c.template is leaf in H then M ∪ {c} → M else // note: c.template is a node in H Cnew ∪ { ( c.pos,templ) | templ ∈ children of c.template } → Cnew apply local optimization to Cnew → C k best matches of C → C k best matches of M → M
4
Coarse-to-Fine and Hierarchical Object Detection
So far, we have discussed how to efficiently compute the probability that an object in a specific pose is observed at a position in the input image by a similarity measure between object template and input image. In this section, we will describe, how to use this similarity measure to detect the target object position, size, and pose. To be able to do this in real-time, we save a large number of similarity computations by combining fast local optimization with a temlate hierarchy (which reduces the time complexity from O(n) to O(log n)) as follows. We match the template that represents the root node in the template hierarchy with the input image with a large, template-dependent step size. For the k best matches in the image, we use a hill climbing method to find the local maximum in the likelihood map, i.e. the position in the input image, that matches best to the template. Next, we replace the root template by the templates of its child nodes and perform a local optimization again. Again, we keep the k best child nodes. We apply hill climbing again and so on, until we reach the leaves of the template hierarchy. Finally, the best match obtained by comparing a leaf template determines the final object pose and position in the input image. We build on the approach proposed by [26] to construct the hierarchy with the following improvement: in each inner node, we only cover a region by rectangles, if it is foreground/background for all, and not only most, of the templates. This ensures, that only regions that correspond to foreground/background for all templates in the node are used to compute the fore-/background color distributions. In addition, to ensure that the intersecting fore-/background area at each inner node (excluding the intersection of all ancester nodes) is not empty, we have to allow a dynamic number of child nodes. Consider, for example, a template set that is split into two subsets. But the solely intersecting area of each subset still is an intersecting area for all of the templates. We are not able to distinguish
Segmentation-Free, Area-Based Articulated Object Tracking
119
Fig. 3. Illustration of our coarse-to-fine and hierarchical detection approach. First, we coarsely match the root node of the template hierarchy (a) to the input image (b). For the best k matches (here 4), we perform a function optimization to find the best matching image position (c). Next, we use these matches as an estimate for the positions (e) for the child nodes in the template hierarchy (d) and search for the local maxima again. When arriving at the leaves of the hierarchy, we use the best match as final hand pose estimate (f).
it from the parent node because the intersecting areas are the same. In those cases, we split the template into three subsets or even more if necessary. It remains to estimate the scan step size such that no local maximum in the confidence map is missed. This can be done offline, and depends only on the templates. Note that inner nodes in the hierarchy can be considered as templates, too, in that they describe the common properties of a set of templates. Therefore, the scan step size should be chosen not greater than the extend of the hill of a maximum in the likelihood map. This value can be determined by autocorrelating the template with itself. We do this not only in 2D image space, but also in scale space. Algorithm 1 shows the above two approaches combined, and Figure 3 illustrates the method.
5
Results
We applied our approach to tracking of the human hand and evaluated it on several datasets. We captured video sequences with different hand movements and image backgrounds. Our approach is silhouette area-based and therefore best compared to segmentation based approaches. With the human hand, skin color has proven to work well in many cases. Consequently, we compared our approach with a skin segmentation-based approach. We used three datasets, each with different skin segmentation quality. The first video sequence is a hand moving in front of a skin-colored background. Approaches based on skin color segmentation would completely fail under such conditions. The second dataset consists of a heterogeneous background. Several
D. Mohr and G. Zachmann
120
!
Fig. 4. Each group shows the results for a specific input dataset. Each bar within each group shows the mean and standard deviation of the RMS error between the brute-force and our coarse-to-fine detection is shown. An RMS error of 1 indicates the maximally possible error.
background regions of moderate size would be classified as skin, but in contrast to the first dataset, the hand silhouette often is clearly visible after skin segmentation. The third dataset contains almost no skin-colored background and, thus, is well suited for skin segmentation. For each dataset, we tested four different hand movements, all of which include translation and rotation in the image plane. The four hand gestures are: an open hand, an open hand with additionally abducting the fingers, an open hand with additionally flexing the fingers, and a pointing hand. Overall, we have 12 different configurations. Due to the lack of ground truth data, the quality is best evaluated by a human observer; computing an error measure (e.g. the RMS) between the results of two approaches does not make any sense. Therefore, we provide video sequences 1 taken under all above setups. In the video, the skin segmentation-based approach proposed by [26] and our approach are shown. Both approaches are tried with a brute-force dense sampling and our coarse-to-fine and hierarchical detection approach from Sec. 4. In the brute-force approach, we have chosen a scan step size of 12 pixels in x and y direction and 10 template scalings between 200 and 800 pixels. In all cases, the input image resolution is 1280 × 1024. In the brute-force approach, the whole template hierarchy (if any) is traversed at each position separately. This needs several minutes per frame. To achieve an acceptable detection rate, one can stop traversing the hierarchy if the matching probability is lower than a threshold τ . The risk with thresholding is that all matches could be below the threshold and the hand is not detected at all. We 1
http://www.youtube.com/watch?v=ZuyKcSqpkkE, http://cg.in.tu-clausthal.de/research/handtracking/videos/mohr isvc2011.avi
Segmentation-Free, Area-Based Articulated Object Tracking
121
have chosen τ = 0.7, which works well for our datasets. Note that our coarse-tofine detection approach does not need any threshold and consequently does not have this disadvantage. The brute-force approach will have, of course, slightly higher quality, but it will cost significantly more computation time, depending on the scan step size and the threshold τ . In order to examine the error of our coarse-to-fine hierarchical matching, we compared it to the brute-force dense sampling approach. Using both our novel method and the brute-force method, we determined the best match for each image in the video sequence. For ease of comparison, hand positions, orientations, and finger angles were normalized. These will be called configurations in the following. Then, we computed the RMS error between the two configurations over the whole video sequence. Figure 4 shows the results for each data set. Obviously, our method performs better in the “open hand” and “abducting finger” sequence. The reason is that “pointing hand” and “moving finger” templates have a smaller intersection area in the root node. This increases chances that the tree traversal finds “good” matches for nodes close to the root in image areas were there is no hand at all. Consequently, fewer match candidates remain for the true hand position during hierarchy traversal. We also measured the average computation time for each datasets (10 frames per dataset). The computation time to detect and recognize the hand in the input image is about 3.5s for the brute-force approach. This, of course, is only achieved by using a manually optimized threshold τ , such that, for most positions in the input image, only the root node or a small part of the hierarchy is traversed. Our coarse-to-fine approach needs about 1.6s per frame (and no thresholding).
6
Conclusions
In this paper, we have presented a novel, robust detection and recognition approach for articulated objects. We propose a color distribution divergence-based similarity measure that does not need any error-prone feature extraction. Thus, our method can adapt much better to changing conditions, such as lighting, different skin color, etc. We have also presented a coarse-to-fine and hierarchical object detection approach using function optimization methods and multi-hypothesis tracking to reduce computation time with only a small loss in accuracy. In addition, there are no thresholds that need to be adjusted to a given condition. And finally, it is straight-forward to incorporate other input modalities into our similarity and detection approach, such as range images or HDR images. In an application to hand tracking, we achieve good results in difficult setups, where, for example, skin segmentation approaches will completely fail. Compared to the brute-force detection approach (that uses thresholding during template tree traversal to significantly prune the tree), our approach is about 2.2 times faster and about as reliable as the brute-force approach. In the future, we want to extend the color distribution model of the object background for better handling multi-colored backgrounds (e.g. by Mixture of
122
D. Mohr and G. Zachmann
Gaussians). Additionally we want to replace the local optimization by a multi-grid approach. We are also currently working on an implementation of our method on a massively parallel architecture (GPU) to reduce the computation time.
References 1. Barrow, H.G., Tenenbaum, J.M., Bolles, R.C., Wolf, H.C.: Parametric correspondence and chamfer matching: Two new techniques for image matching. In: International Joint Conference on Artificial Intelligence (1977) 2. Borgefors, G.: Hierarchical chamfer matching: A parametric edge matching algorithm. IEEE Transaction on Pattern Analysis and Machine Intelligence (1988) 3. Huttenlocher, D., Klanderman, G.A., Rucklidge, W.J.: Comparing images using the hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence (1993) 4. Athitsos, V., Sclaroff, S.: 3d hand pose estimation by finding appearance-based matches in a large database of training views. In: IEEE Workshop on Cues in Communication (2001) 5. Athitsos, V., Sclaroff, S.: An appearance-based framework for 3d hand shape classification and camera viewpoint estimation. In: IEEE Conference on Automatic Face and Gesture Recognition (2002) 6. Athitsos, V., Alon, J., Sclaroff, S., Kollios, G.: Boostmap: A method for efficient approximate similarity rankings. In: IEEE Conference on Computer Vision and Pattern Recognition (2004) 7. Gavrila, D.M., Philomin, V.: Real-time object detection for smart vehicles. In: IEEE International Conference on Computer Vision (1999) 8. Sudderth, E.B., Mandel, M.I., Freeman, W.T., Willsky, A.S.: Visual hand tracking using nonparametric belief propagation. In: IEEE CVPR Workshop on Generative Model Based Vision, vol. 12, p. 189 (2004) 9. Kato, M., Chen, Y.W., Xu, G.: Articulated hand tracking by pca-ica approach. In: International Conference on Automatic Face and Gesture Recognition, pp. 329–334 (2006) 10. Toyama, K., Blake, A.: Probabilistic tracking with exemplars in a metric space. International Journal of Computer Vision (2002) 11. Lin, Z., Davis, L.S., Doermann, D., DeMenthon, D.: Hierarchical part-template matching for human detection and segmentation. In: IEEE International Conference on Computer Vision (2007) 12. Olson, C.F., Huttenlocher, D.P.: Automatic target recognition by matching oriented edge pixels. IEEE Transactions on Image Processing (1997) 13. Thayananthan, A., Navaratnam, R., Stenger, B., Torr, P., Cipolla, R.: Pose estimation and tracking using multivariate regression. Pattern Recognition Letters (2008) 14. Stenger, B., Thayananthan, A., Torr, P.H.S., Cipolla, R.: Hand pose estimation using hierarchical detection. In: International Workshop on Human-Computer Interaction (2004) 15. Shaknarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parametersensitive hashing. In: IEEE International Conference on Computer Vision (2003) 16. Lin, J.Y., Wu, Y., Huang, T.S.: 3D model-based hand tracking using stochastic direct search method. In: International Conference on Automatic Face and Gesture Recognition, p. 693 (2004)
Segmentation-Free, Area-Based Articulated Object Tracking
123
17. Wu, Y., Lin, J.Y., Huang, T.S.: Capturing natural hand articulation. In: International Conference on Computer Vision, vol. 2, pp. 426–432 (2001) 18. Ouhaddi, H., Horain, P.: 3D hand gesture tracking by model registration. In: Workshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging, pp. 70–73 (1999) 19. Amai, A., Shimada, N., Shirai, Y.: 3-d hand posture recognition by training contour variation. In: IEEE Conference on Automatic Face and Gesture Recognition, pp. 895–900 (2004) 20. Shimada, N., Kimura, K., Shirai, Y.: Real-time 3-d hand posture estimation based on 2-d appearance retrieval using monocular camera. In: IEEE International Conference on Computer Vision, p. 23 (2001) 21. Zhou, H., Huang, T.: Okapi-chamfer matching for articulated object recognition. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1026–1033 (2005) 22. Zhou, H., Huang, T.: Tracking articulated hand motion with eigen dynamics analysis. In: IEEE International Conference on Computer Vision, vol. 2, pp. 1102–1109 (2003) 23. Stenger, B., Thayananthan, A., Torr, P.H.S., Cipolla, R.: Model-based hand tracking using a hierarchical bayesian filter. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1372–1384 (2006) 24. Stenger, B.D.R.: Model-based hand tracking using a hierarchical bayesian filter. Dissertation submitted to the University of Cambridge (2004) 25. Wang, R.Y., Popovi´c, J.: Real-time hand-tracking with a color glove. ACM Transactions on Graphics 28 (2009) 26. Mohr, D., Zachmann, G.: Fast: Fast adaptive silhouette area based template matching. In: Proceedings of the British Machine Vision Conference, pp. 39.1– 39.12. BMVA Press (2010), doi:10.5244/C.24.39 27. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Quasi-newton or variable metric methods in multidimensions. In: Numerical Recipes, The Art of Scientific Computing, pp. 521–526. Cambridge University Press, Cambridge (2007)
An Attempt to Segment Foreground in Dynamic Scenes Xiang Xiang* Key Lab of Intelligent Information Processing of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, China
[email protected] http://www.jdl.ac.cn/user/xxiang
Abstract. In general, human behavior analysis relies on a sequence of human segments, e.g. gait recognition aims to address human identification based on people's manners of walking, and thus relies on the segmented silhouettes. Background subtraction is the most widely used approach to segment foreground, while dynamic scenes make it difficult to work. In this paper, we propose to combine Mean-Shift-based tracking with adaptive scale and Graphcuts-based segmentation with label propagation. The average precision on a number of sequences is 0.82, and the average recall is 0.72. Besides, our method only requires weak user interaction and is computationally efficient. We compare our method with its variant without label propagation, as well as GrabCut. For the tracking module only, we compare Mean Shift with several state-of-the-art methods (i.e. OnlineBoost, SemiBoost, MILTrack, FragTrack).
1 Introduction Human identification at a distance is still a challenging problem in computer vision, pattern recognition and biometrics. Different from face, iris, fingerprint or palm, gait can be detected and measured in low resolution images. Most existing training data are collected indoors (see Tab. 1), where background is nearly static. So background subtraction is the most widely used approach to segment foreground, while dynamic scenes outdoors make it difficult to work. Some methods improve the gait appearance feature to tolerate imperfect segments [1]. However, a more adaptive solution is still in need. Besides, foreground segmentation is beneficial to video editing, visual hull reconstruction and human behavior analysis, e.g. pose/gesture/action analysis. Here, for the purpose of establishing a gait dataset, the foreground is a pedestrian specified by a user of our application (see Fig. 1). In fact, there is no clear definition of foreground, which actually can be arbitrary object of interest. Thus, this paper is about foreground segmentation, yet in dynamic scenes. That is why we incorporate tracking, while tracking itself is not our emphasis. The challenges mainly come from three aspects: 1) Dynamic scenes consist of global motion, multilayered and variegated background, scale variance, etc.. 2) It is preferred but difficult to do segmentation with less user interaction. 3) Efficiency is also important. *
This work was mostly done while the author was an exchange student at Intelligent Media Lab, Institute of Scientific and Industrial Research, Osaka University.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 124–134, 2011. © Springer-Verlag Berlin Heidelberg 2011
An Attempt to Segment Foreground in Dynamic Scenes
125
Table 1. Primary gait datasets and their information (‘Bg’ is short for ‘Background’) Institute
Indoor
Outdoor
Views
Person Num
Segmentation
MIT UMD
Static No
No
Static
Static
Southampton U
Static Static Static Static No Static
Dynamic No Static Static Static No
24 Partial: 55; partial: 25 Partial:124; Partial: 20 115 25 20 122 6 >300
Bg subtraction Bg subtraction
NLPR, CAS
1 Partial: 2; partial: 4 Indoor: 11; outdoor: 3 1 6 3 2 1 25
CMU Gatech USF UCSD Osaka U
Partial static, partial dynamic
Bg subtraction Bg subtraction Bg subtraction Radar signals Bg subtraction Not done Bg sub., Graph cuts
Fig. 1. Our application for segmenting foreground in consecutive frames. In the initial frame, the user coarsely labels the foreground and neighboring background, which result in a Graphcuts-based segmentation and an initialization for Mean-Shift-based tracking. This sequence’s scenario: a pedestrian, wearing dark-colored clothes, is walking apart from a rotating camera on a road with multi-level background including waving dark green trees (i.e. dynamic texture).
To address the problems, we propose to combine Mean-Shift-based tracking and Graph-cuts-based segmentation with weak user interaction, as displayed in Fig. 1. The average precision on a number of sequences is 0.82, and the average recall is 0.72. Besides, our method is very efficient, i.e. 26 fps for a common PC (e.g. with Core i3 CPU) to process a 360×240 video. The main reasons are: 1) our tracker (i.e. Mean Shift) is with low complexity; 2) segmentation is constrained within a tracked bounding box region; 3) Optimization via Max-flow/Min-cut is efficient as well. For tracking, we admit that Mean Shift [20] with iteration on color similarity is not robust enough to the distraction of background with similar color. However, note that this work is mainly aimed to establish a gait dataset. Our outdoor scenarios for collecting training data are not totally in the wild. In general, color feature is discriminative enough to distinguish human from the background; and also, occlusion seldom happens. Do not forget that the focus of this paper is segmentation. The handling of occlusion and background distraction is out the scope of our current work. However, we do justify many design choices for tracking. In Sec. 4, we compare Mean Shift with state-of-the-art tracking methods on our data to find performances are about the same. They may be not very difficult, but are what we need to process.
126
X. Xiang
For segmentation, we employ the interactive Graph cuts [21], which formulate binary segmentation as energy minimization over a graph. The optimization problem is solved by Max-flow/Min-cut algorithm [2], which could find the cheapest subset of edges (a path) that connects two seeds placed on the desired object boundary and namely separates seeds marking the inside of the foreground and background regions. However, because of its globally optimal nature, it is prone to capture outlying areas similar to the object of interest [3]. Thus, a spatial constraint is required. In this paper, we constrain segmentation within a tracked bounding box region. In the following, related work is first reviewed in Sec. 2. Our method is elaborated in Sec. 3. Experiments are presented in Sec. 4, with conclusion followed in Sec. 5.
2 Related Work In this paper, we specify segmentation as foreground segmentation (i.e. figure-ground segmentation) in motion. It is somewhat different with motion/video segmentation, which focus more on segmenting different motion layers/regions. Within the scope of our discussion, the widely-used background subtraction is efficient when scenes to be modeled refer to static structures with limited perturbations [4, 5]. Actually, graphbased methods [6,7,8,9] own better performances. GMM [6,9], 3-D histogram [7], and KDE [10]) all can be used to model a trimap - foreground, background and unknown region. However, in dynamic scenes, a better solution is to use spatialtemporal volumes to preserve temporal coherence [11,12,13,14], e.g. J. Niebles et al. utilize a top-down pictorial structure to extract the human volume [11], and then extend it by incorporating bottom-up appearance [12]. Another solution is to fuse single-frame segmentation with tracking [15,16]. The common idea is to combine various features under a random field framework and update them. In [15], A. Bugeau et al. utilize Mean Shift for clustering and minimize the energy under a MAP/MRF framework. The descriptor is formed with spatial, optical flow and photometric features to cluster consistent groups of points. This ensures the robustness under illumination changes and fast global motion. In [16], X. Ren et al. integrate static cues and temporal coherence in CRF to segment objects repeatedly. Brightness, color and texture are summarized in a Probability-of-Boundary superpixel map; appearance, scale and spatial cues are updated, which makes their approach adaptive to motion. For tracking, the recent state-of-the-art performances are generally given by a class of on-line boosting methods, e.g. OnlineBoost [17], SemiBoost [18], MILTrack [19], etc.. The reason is: the more associated the appearance model is with variations in photometry and geometry, the more specific it is in representing a particular object. Then, the tracker is less likely to get confused with occlusion or background clutter. Therefore, on-line adaptation of appearance model allows the object representation to maintain specificity at each frame as well as overall generality to large variations.
3 Our Method Our method for foreground segmentation in videos consists of user interaction, bounding box tracking and segmentation. An overview of our workflow is as follows.
An Attempt to Segment Foreground in Dynamic Scenes
127
1. Interaction (I1) In the initial frame, a user labels the foreground and its neighboring background. The foreground and background labels are respectively denoted as FG1 and BG1 . (I2) We get a bounding box loosely containing BG1 using its outer limiting positions. Since the image patch within this bounding box contains pixels with foreground label, pixels with background label, and pixels without label, it is named ‘trimap’, short for ‘triple map’. It consists of FG1 , BG1 and unknown region. 2. Tracking with adaptive scale We set the trimap as the target and track it using Mean Shift with adaptive scales. Then, we obtain a sequence of trimap T. Beside the position of trimap, the labels within it are also updated, in a simple way explained below. 3. Segmentation with label propagation: frame number t (t = 1, 2, …) (S1) We obtain a segment S via Graph cuts with constraints from labels FGt and BGt . (S2) We propagate labels to the next frame based on a simple strategy:
FGt +1 = FGt I St
BGt +1 = BGt I (Tt +1 − St )
(1)
Note that this step can be extended to ad hoc label propagation methods (e.g. using a graphic model [24]). However, our simple strategy also turns out to work well. In the following, we will elaborate how we segment the foreground within a trimap via Graph cuts and how we track the trimap via Mean Shift. 3.1 Segmentation via Graph Cuts
The tracked bounding box with propagated labels is set as the trimap for Graph cuts. In general, with an image judged as a graph and each pixel set as a node, graph-based methods can be represented as
E ( L ) = ∑ DP ( L p ) + p∈P
∑V
( p , q )∈N
p ,q
( L p , Lq ) ,
(2)
where E is the total energy over the graph; P is a set of all nodes p, and N is a set of all pairs of neighboring nodes; L = {L p p ∈ P} is a set of labels representing whether a node denotes foreground, background or unknown; DP is a data penalty function, and V p ,q is an interaction potential. The first term is named data term and the second term is named smoothness term [14]. To construct E, both terms need to be calculated; to minimize E, usually we use the Max-flow/min-cut algorithm [2]. 1. Calculation of Data Term and Smoothness Term After the seeds of foreground and background are labeled on the initial image, we cast votes in Hough space. Then, the weights of the foreground and background are
128
X. Xiang
Fig. 2. Detailed process for the frame 101. (a) Input Image. (b)/(c) Background/ Foreground data term. (d)/(e)/(f) X/ Y/ Time smoothness term. (g) Tracking result. (h) Trimap. (i) Graph cuts segmentation results. (j) Corresponding silhouette, for the convenience of visualization.
updated, and their probability density functions obey to Gaussian distribution. Next, data terms are computed and updated based on the interpolated probability. At first, the probability density functions of foreground and background are calculated from users’ inputs. The input images are checked whether they are 3-channel images. The range of the pixels’ color values is the range space of the RGB color. According to the method of histogram calculation, the sub-space of the RGB color space (R-, G- and B-channel) can be fixed into k equal intervals, each of which is called a bin. Then, votes are cast with a kernel function: 1) the voting target is set and voting is proceeded; 2) RGB value is set for the bin; 3) weights of the foreground and background are assigned. Afterwards, for each pixel in the image sequence, the values of the data terms are assigned by first getting the bin ID and interpolation position, then the interpolated probability, and finally the posterior. Subsequently, data terms are kept updating together with the probability density functions of foreground and background. Example results are shown in Fig. 2 (b) (c). Smoothness term contributes to foreground/background boundary decision based on edges. The connectivity can be 4 or 8. Results are shown in Fig. 2 (d)(e)(f). 2. Trimap Acquisition via Tracking To initialize trimap tracking, we set the bounding box of the initial foreground region as the initial target in Mean Shift tracking. It is supposed to be tracked, as displayed in Fig. 3 (g)(h). For each of the following frame, the histograms are attained through the method explained below:
1) The surrounding region of the bounding box is calculated for comparison: if some features of the target region and its surrounding region are similar, those features should be removed to rule out the background distraction; 2) Calculation and equalization of the target region’s color histogram. After that, the Mean Shift iterator is used to estimate the next location: 1) Try not to change the size of the bounding box (e.g. 100×60). The weights of new coordinates are calculated and then the histogram templates are updated. Next,
An Attempt to Segment Foreground in Dynamic Scenes
129
the similarity between the candidate model and the previous target model is first calculated, and then compared with the previous one. If the current one is smaller, then the candidate location is updated; otherwise, a better starting location is reset. 2) Try to change the scale of the bounding box and continue the operation similar with 1). Actually, we define an extra parameter (e.g. equals to 0.1) to be the step size in the scale space. When searching for the location of the object in a new frame, we crop out bounding boxes between one step larger scale and one step smaller scale (e.g. height: 90~110; width: 54~66, but with fixed aspect ratio 10:6). 3) Compare the result of not changing scale with that of changing scale, and then adopt the best result to update the similarity. Through the above processing, the tracked location of the bounding box has been obtained. Fig. 3 (g) displays the tracking results. And then, a rectangle region same with the tracked bounding box is set as the trimap region (Fig. 3 (h)), which actually includes foreground seeds, background seeds and unknown region. 3.2 Tracking via Mean Shift
For all pixels in the target region (e.g. in our case, the trimap) in the initial frame, it is called the description of target model to compute the probability of each eigenvalue in the feature space; for subsequent frames, it is called the description of the candidate model to compute that in the candidate region where the target is possible to occur. Then, an Epanechikov function is selected as the kernel function. A similarity function is used to measure the similarity between the target model in the initial frame and the candidate model in the second frame. Next, the Mean Shift vector, along which the target transfers from the initial position, is obtained through calculating the maximum of the similarity function. Because of the Mean Shift’s convergency, as long as Mean Shift vector is computed iteratively, the ultimate position would converge to a good approximation to the target’s real position in the second frame. Similarly, we can update the position for subsequent frames in this way. In the following, we will give more detailed explanation about that procedure. 3.2.1 Description of Target Model Suppose the center point of the target region is x0 , n pixels in the region are expressed as { x i }i =1K n , and the number of the eigenvalue bin is m . Then for the
target model, the probability density of the eigenvalue for u = 1,..., m is estimated as ^
n
qu = C ∑ k ( i =1
xi − x0 )δ [b( xi ) − u ] h 2
(3)
where k ( x) is the contour function of the kernel function; because of the impact of occlusion or background distraction, the pixels near the center of the target model are more reliable than those outside: k (x) assigns the central pixels larger weights, and
130
X. Xiang x0 − xi
pixels away from the center smaller weights; the role of h in k (x) is to eliminate the impact brought by the calculation of targets of different size, so the target described by bounding box is normalized into a unit circle; δ (x) is the Delta function, and the role of δ [b( xi ) − u ] is to judge whether the color value of pixel x in the target i
region belongs to the bin with index u : 1 denotes true, and 0 denotes false; C is a C= m
∑ qu = 1
standardized constant coefficient that makes
u =1
, so
1 n
∑ k( i =1
2
x0 − xi ) h
(4)
3.2.2 Description of Candidate Model A candidate region is the region which is possible to include the moving target in the second frame and each subsequent one. Its central coordinate y is known as the central coordinate of the kernel function. The pixels in the region are expressed as { xi }i =1Knh . The description of the candidate region is called the candidate model. To
describe it, we set the probability density of the eigenvalue u = 1K m as nh
pˆ u ( y ) = C h ∑ k ( i =1
Ch =
where
2
y − xi h
)δ [b( x i ) − u ]
(5)
1 nh
∑ k( i =1
2
y − xi ) h
is a standardized constant coefficient.
3.2.3 Similarity Function The similarity function describes the degree of similarity between the target model and candidate model. Actually, in ideal circumstances, the probability distributions of the two models are exactly the same. Here, we choose Bhattacharyya coefficient as the similarity function, which is defined as m
ρˆ ( y ) ≡ ρ ( pˆ ( y ), qˆ ) = ∑ pˆ u ( y ) qˆ u
(6)
u =1
Its value is between 0 and 1. The greater the value of ρˆ ( y ) is, the more similar the two models are. The candidate model of different candidate region in the current frame should be calculated. Finally, the position where ρˆ ( y ) gains the maximum, is the predicted position of the target in the current frame. 3.2.4 Target Location To obtain ρˆ ( y ) ’s maximum, we set the initial position of the target center in the
current frame as that in the previous frame, which is y0 . From y0 , we search the target position with the optimal matching, while the real target’s center position is y . We calculate the candidate model ρˆ ( y ) and then expand Eq. 6 at ρˆ ( y ) according to Taylor expansion. Now, the Bhattahcyarya coefficient is approximate to
An Attempt to Segment Foreground in Dynamic Scenes
ρ ( pˆ ( y ), qˆ ) = m
where
wi = ∑ u =1
nh
C 1 m ∑ pˆ u ( y0 )qˆ u + 2h 2 u =1
Ch qˆu δ [b( xi ) − u ] 2 pˆ u ( y0 )
,
nh
∑ w k( i =1
i
∑ wi k ( i =1
y − xi h
131
2
)
(7)
2
y − xi ) = f n ,K h .
Note that in Eq. 7, only the second addend part (i.e. f n, K ) changes with y. Actually, f n , K is similar with a KDE. The only difference lies in that it includes an extra ^ ^ weight wi . So in order to obtain ρ ( p( y ), q) ’s maximum, f n , K should first be made the maximum. In the following, the Mean Shift vector is set as the vector from the center y0 of the candidate region to the target’s real region y nh
m( y ) = y1 − y0
∑ xi wi g ( i =1 nh
∑ i =1
wi g (
2
yˆ0 − xi h yˆ0 − xi h
) 2
− y0
(8)
)
where g ( x) = −k ( x) , m( y ) is the vector of the target center from y0 to y , and Mean Shift vector starts at y0 . The search moves from y0 to where the color value has the largest change between the two models, which is better than a greedy manner. '
4 Experiments We test our method on sequences from our gait dataset (see Fig. 3 (a)-(c), Fig. 4) and challenging sequences from the web (see Fig. 5). For comparison (see Fig. 3 (b)-(f)), we modify GrabCut [22] with similar label propagation strategy to process videos, and also test a variant of our method without label propagation (i.e. constant labels). We choose Mean Shift for the tracking of trimap. It is very efficient and also robust in the case of our data, while it can be replaced by other methods. As shown in Fig. 3 (g)-(l), we compare it with state-of-the-art methods (i.e. OnlineBoost, SemiBoost, MIL-track, FragTrack [23]) as well as particle filter and template matching. FragTrack, MILTrack and OnlineBoost are very robust, while Mean Shift is comparable with them. For evaluation, Tab. 2 presents the average F-measure score on all the sequences we tested. It is defined as the harmonic mean of Precision and Recall. As a reminder, Prec = True Positive / (True Positive + False Positive) and Rec = True Positive /(True Positive + False Negative). Moreover, TP = Ground Truth ∩ Segment, FP = Segment – TP, FN = Ground Truth – TP. In a word, the larger F-measure value is, the better the overall performance is. For tracking, evaluation is itself a challenge. Quantitative comparison (e.g. plotting the center location error versus frame number) sometimes fails to correctly capture tracker performance. Thus, here we only present qualitative comparison (see Fig. 3(a),(g)-(l)) and recommend readers to see also the results of full sequence released at http://www.jdl.ac.cn/user/xxiang/fgseg/ .
132
X. Xiang
Fig. 3. Results of comparative experiments. (a)-(c): Our method’s results. (b)-(f): comparison of segmentation methods. (a),(g)-(l): comparison of tracking methods. Frame: 1,41,81,121,161. (a) Our tracking results (via Mean Shift, with color histogram). (b) Our segmentation results (via Graph cuts). (c) Silhouettes corresponding with (b). (d) Results of our method without label propagation. (e) Silhouettes corresponding with (d). (f) GrabCut’s results. (g) OnlineBoost’s tracking results (with Haar-like, HOG & LBP, same for (h)). (h) SemiBoost’s results (no detection result in frame 41,81,121,161). (i) MILTrack’s results (with Haar-like feature). (j) FragTrack’s results (part/fragment based representation). (k) Particle filter tracker’s results (yellow: 100 samples; red: predicted position of the target). (l) Template matching.
An Attempt to Segment Foreground in Dynamic Scenes
133
Fig. 4. Results on 2 of our sequences: Segments (contours are labeled in red) and silhouettes
Fig. 5. Results on 2 sequences from the web. Mean Shift is robust due to strong fg-bg contrast. Table 2. Evaluation for our method, its variant and GrabCut. Prec and Rec have been averaged based on the raw result of each sequence. Ground Truth consists of manually labeled masks.
Method Our method Our method’s variant GrabCut
Precision 0.82 0.68 0.49
Recall 0.72 0.51 0.50
F-score 0.77 0.58 0.49
5 Conclusion In this paper, we have presented a method to segment foreground in dynamic scenes. The use of bonding box makes it feasible to combine tracking, label propagation and segmentation. Future work will include, but not be limited to: 1) simultaneous tracking and segmentation (e.g. modeling a graph containing both appearance and motion cues); 2) label propagation via learning; 3) incorporating prior knowledge (e.g. shape, saliency). Acknowledgements. The author would like to sincerely thank Yasushi Yagi and Yasushi Makihara for valuable guidance, Yasuhiro Mukaigawa, Junqiu Wang and Chunsheng Hua for helpful discussion, FrontierLab@OsakaU program for research arrangement, JASSO for the scholarship, and Xilin Chen for his support.
134
X. Xiang
References 1. Lee, L., Grimson, W.E.L.: Gait Analysis for Recognition and Classification. In: Proc. IEEE Int. Conf. Automatic Face and Gesture Recognition, pp. 148–155 (2002) 2. Boykov, Y., Kolmogorov, V.: An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. IEEE Trans. PAMI 26(9) (2004) 3. Malcolm, J., Rathi, Y., Tannenbaum, A.: Multi-Object Tracking Through Clutter Using Graph Cuts. In: Proc. IEEE ICCV, pp. 1–5 (2007) 4. Piccardi, M.: Background Subtraction Techniques: A Review. In: Proc. IEEE Int. Conf. Systems, Man and Cybernetics, vol. 4, pp. 3099–3104 (2004) 5. Sun, J., Zhang, W., Tang, X., Shum, H.Y.: Background Cut. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part II. LNCS, vol. 3952, pp. 628–641. Springer, Heidelberg (2006) 6. Bray, M., Kohli, P., Torr, P.: Posecut: Simultaneous Segmentation and 3D Pose Estimation of Humans using Dynamic Graph-Cuts. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part II. LNCS, vol. 3952, pp. 642–655. Springer, Heidelberg (2006) 7. Criminisi, A., Cross, G., Blake, A., Kolmogorov, V.: Bilayer Segmentation of Live Video. In: Proc. IEEE CVPR, pp. 53–60 (2006) 8. Juan, O., Boykov, Y.: Active Graph Cuts. In: Proc. IEEE CVPR, vol. 1, pp. 1023–1029 (2006) 9. Li, Y., Sun, J., Shum, H.Y.: Video Object Cut and Paste. Proc. ACM SIGGRAPH 2005, ACM Trans. Graphics 24(3), 595–600 (2005) 10. Zhong, F., Qin, X., Peng, Q.: Transductive Segmentation of Live Video with NonStationary Background. In: Proc. IEEE CVPR, pp. 2189–2196 (2010) 11. Niebles, J., Han, B., Ferencz, A., Fei-Fei, L.: Extracting Moving People from Internet Videos. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 527–540. Springer, Heidelberg (2008) 12. Niebles, J., Han, B., Fei-Fei, L.: Efficient Extraction of Human Motion Volumes by Tracking. In: Proc. IEEE CVPR, pp. 655–662 (2010) 13. Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient Hierarchical Graph-Based Video Segmentation. In: Proc. IEEE CVPR, pp. 2141–2148 (2010) 14. Bai, X., Wang, J., Sapiro, G.: Dynamic Color Flow: A Motion-Adaptive Color Model for Object Segmentation in Video. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 617–630. Springer, Heidelberg (2010) 15. Bugeau, A., Perez, P.: Detection and Segmentation of Moving Objects in Highly Dynamic Scenes. In: Proc. IEEE CVPR (2007) 16. Ren, X., Malik, J.: Tracking as Repeated Figure/Ground Segmentation. In: CVPR (2007) 17. Grabner, H., Bischof, H.: On-line Boosting and Vision. In: Proc. CVPR, pp. 260–267 (2006) 18. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised On-Line Boosting for Robust Tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008) 19. Babenko, B., Yang, M.H., Belongie, S.: Visual Tracking with Online Multiple Instance Learning. In: Proc. IEEE CVPR, pp. 983–990 (2009) 20. Comaniciu, D., Ramesh, V., Meer, T.: Real-Time Tracking of Non-Rigid Objects Using Mean Shift. In: Proc. IEEE CVPR, vol. 2, pp. 142–149 (2000) 21. Boykov, Y., Jolly, M.: Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images. In: Proc. IEEE ICCV, vol. 1, pp. 105–112 (2001) 22. Rother, C., Kolmogorov, V., Blake, A.: “GrabCut”: Interactive Foreground Extraction using Iterated Graph Cuts. Proc. ACM SIGGRAPH 2004, ToG 23(3), 309–314 (2004) 23. Adam, A., Rivlin, E., Shimshoni, I.: Robust Fragments-based Tracking using the Integral Histogram. In: Proc. IEEE CVPR, vol. 1, pp. 798–805 (2006) 24. Badrinarayanan, V., Galasso, F., Cipolla, R.: Label Propagation in Video Sequences. In: Proc. IEEE CVPR, pp. 3265–3272 (2010)
From Saliency to Eye Gaze: Embodied Visual Selection for a Pan-Tilt-Based Robotic Head Matei Mancas1 , Fiora Pirri2, and Matia Pizzoli2 1
2
University of Mons, Mons, Belgium Sapienza Universit` a di Roma, Rome, Italy
Abstract. This paper introduces a model of gaze behavior suitable for robotic active vision. Built upon a saliency map taking into account motion saliency, the presented model estimates the dynamics of different eye movements, allowing to switch from fixational movements, to saccades and to smooth pursuit. We investigate the effect of the embodiment of attentive visual selection in a pan-tilt camera system. The constrained physical system is unable to follow the important fluctuations characterizing the maxima of a saliency map and a strategy is required to dynamically select what is worth attending and the behavior, fixation or target pursuing, to adopt. The main contributions of this work are a novel approach toward real time, motion-based saliency computation in video sequences, a dynamic model for gaze prediction from the saliency map, and the embodiment of the modeled dynamics to control active visual sensing.
1
Introduction
The question of determining where to look in front of a scene is the most relevant when designing vision architectures. Real sensors and actuators are characterized by physical limitations, constraining the field of view and motion capabilities. Moreover, computational power should be preserved to focus on crucial aspects for the task at hand. Computational visual attention [1,2,3] is thus emerging as an appealing component for embodied vision architectures. To support active vision, control models of eye movements have been designed to compute the optimal sequence to move the visual sensors in order to achieve a given task [4], usually modeled as the problem of maximizing a certain measure of information or reward. In contrast, saliency models [5,6] provide a biologically consistent prediction of eye movements, in the sense of the frequency with human beings attend regions in the observed scene. The past 25 years of investigations on eye movements have brought a more thorough understanding of how attention works with the oculomotor system in order to control the sensory data collection and extract the interesting information from visually rich environments [7]. Eye movements are explained by the need of keeping the interesting target in the fovea. Similarly, for a robotic system, many active vision problems take advantage of related capabilities: even for non-foveated cameras, tracking by an active vision system requires the target to be in the centre of the field of view in G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 135–146, 2011. c Springer-Verlag Berlin Heidelberg 2011
136
M. Mancas, F. Pirri, and M. Pizzoli
order to minimize the chances to loose it or to extract stereo information; many robot platform are endowed with omni-directional vision, coupled with active Pan-Tilt-Zoom or RGBD cameras, which resemble the human subdivision in peripheral and foveated vision. The present work bridges the gap between control and saliency models, as it combines a motion-based saliency map with a model of eye movements, based on a non-linear Bayesian state-space process, in order to predict the focus of the attention accordingly. A motion-based saliency model provides us with a dynamic, near real-time saliency map. According to this, the next attended location and gaze behavior are selected and used to provide control to the active vision system, thus implementing an embodied visual selection mechanism. The remainder of the paper is the following: Section 2 describes the computation of the motion-based saliency model; the dynamic prediction of gaze movement is addressed in Section 3, while Section 4 reports the analysis of the experimental results. Finally, in Section 5, the conclusions are drawn.
2
Motion-Based Saliency Model
This section describes the main aspects of the motion-based saliency model, which ensures real-time rendition of salient-motion and features. The algorithm is based on three main steps: motion feature extraction, spatio-temporal filtering, and rare motion extraction. The resultant rarity pattern, although computed in an instantaneous context within the same frame, provides information of a short time history induced by spatio-temporal filtering. As a first step, features are extracted from the video frames. For the purpose of modeling eye motion dynamics, speed and motion direction were computed. The method can be generalized to any other dynamic or static physical feature such as acceleration, rotational motion, color mean, color variance. The motion vector is computed making use of Farneback’s algorithm for optical flow [8], which is quite fast if compared to other techniques. The frame is divided in cells and features are extracted from non overlapping cells for speed. The chosen cell size is 3 or 5 pixels wide in order to take into account small motions. The features are then discretized into 4 directions (north, south, west, east) and 5 speeds (very slow, slow, mean, fast, very fast). A spatio-temporal low-pass filter is designed to cope with the discretized feature channels, having 4 directions and 5 speeds. The filter first separates the space and time dimensions. Frames are first spatially low-pass filtered; then, a weighted sum is carried out on time dimension by using a loop and a multiplication factor 0 ≤ β < 1. So, given a feature channel F , the filtered value in position (i, j) at time t is given by: Fˆ (i, j, t) = α
N n=1
m
βn
2
m
2
A(h, k)F (i − h, j − k, t − n)
(1)
m h=− m 2 k=− 2
where A is an m × m Gaussian smoothing kernel and α a normalization term. This process will tend to provide lower weight to those frames entering the
From Saliency to Eye Gaze
137
loop several times (the older ones) because of the β n in (1) which decreases as the loop iteration n increases. Our approach only takes into account frames from the past (not from the future). This approximation of a 3D convolution provides increasing spatial filtering through iterations and it is suitable for on line processing. The neighborhood of the filtering is obtained by changing the size of the spatial kernel and by modifying the β parameter for the temporal part. If β is closer to 0, the weight applied to the temporal mean will decrease very fast, so the temporal neighborhood will be reduced, while a β closer to 1 will let the temporal dimension be larger. Filtering is implemented at two different scales using m ∈ {3, 9} and β ∈ {0.9, 0.8}. After the filtering of each of the 9 feature channels (4 directions, 5 speeds), a histogram with 5 bins is computed for each resulting image and the selfinformation I(bi ) of the pixels for a given bin bi is computed as I(bi ) = − log(H(bi )/||B||).
(2)
Here H(bi ) is the value of the histogram H at the bin bi , indicating the frequency counts of a video volume, resulting from the 3D low-pass filtering, within the frame; B is the cardinality of the frame, namely, the size of the frame in pixels. H(bi )/B is simply the occurrence probability of a pixel of bin bi . Selfinformation is a pixel saliency index. The matrices containing self-information, thus the saliency of the pixels at the two scales, one for each different 3D filters, are added. Then self-information is summed at the different scales. Once a saliency map is computed for each feature channel, a maximum operator is applied to gather the 4 directions into a single saliency map and the 5 speeds into a second saliency map. Rare motion is salient. The two final maps specify the rarity of the statistics of a given video volume at two different scales for a given feature. The two final conspicuity maps represent the amount of bottomup attention due to speed and motion direction features. Figure 1 illustrates results on groups of moving objects with complex backgrounds. Further details can be found in [9]. To validate the predictiveness of the motion-based saliency model, we chose not to rely on any existing corpus of analyzed data and we collected evidence from gaze tracking experiments making use of a wearable device during natural observation tasks. The collected sequences of eye movements refer to the observation of a bus stop (Figure 2). The complete dataset, comprising 20000 frames collected at 30 fps and labeled with the corresponding Point of Regard from gaze
Fig. 1. Annotated frames and corresponding saliency maps from the UCSD dataset [6]. The contribution of the speed feature to saliency is indicated by the red color, while cyan indicates directions.
138
M. Mancas, F. Pirri, and M. Pizzoli
Fig. 2. Comparison of the dynamic saliency model (right) described in Section 2 against the static model (left) in [3] in terms of gaze prediction during a natural observation task, in the form of ROC analysis; The ground truth provided by gaze tracking is also shown
tracking, was divided into two sets, one used for deriving the dynamic model and one used for validation. The static saliency model [3] has been compared to the described one, which is dynamic as it takes into account motion cues in the computation of the saliency map. The ground truth is represented by the collected sequence of points of regard. Figure 2 summarizes the result in the form of ROC analysis. Gazed locations corresponding to maxima in the saliency map are considered good predictions. Those maxima that are not attended by gaze are false positives. Not surprisingly, the dynamic saliency model outperforms the prediction capability of the static model in case of dynamic, natural stimuli. Still it is important to note the high false positive rate, due to the substantial number of un-attended maxima in the saliency map. The reason behind this high number of maxima resides in the integration of appearance and motion cues in the saliency model. Different features compete to gain the focus of attention at each time instant. As a consequence, a way is needed in order to select the next attended region among those that are emerging as salient. In addition, a further choice is requested to select the gaze beahavior. The dynamic model that controls gazing is responsible of what to look, that is what region among the saliency maxima and how, that is, if performing fixations, smooth pursuit or saccades.
3
From Saliency Map to Gaze Prediction
Given a dynamic saliency map, the local maxima at each frame (at each time step t), give information about the possible gaze localization. The problem of gaze prediction is the following: the gaze location at time step t is observable only via the saliency map, estimating all points of interest for the gaze at t. Therefore a usually adopted solution for gaze location prediction is to define a gaze scan path as a path through the local maxima of the saliency map. However, within dynamic environments, the dynamic saliency maps register motion of different elements in the scene, like people, cars, flickering of several objects, clouds, and mainly ego motion; therefore several local maxima are returned by
From Saliency to Eye Gaze
139
the saliency map, see Figure 2 and the experiments made available by Itti at https://crcns.org/data-sets/eye. This implies that the likelihood of any of the selected scan-paths would be approximatively the same; furthermore the number of possible paths through the local maxima is exponential in the length of the path. Despite the fact that the saliency map takes care of motion and motion innovation, and accounts for the different gaze behaviors, not being a process itself, it cannot model perspicuously the dynamic of horizontal eye movements such as smooth pursuit, saccades and fixations, and the way these movements are constrained by the embodiment. Therefore it is necessary to reproduce the processes underlying the saliency maps and use the saliency maps as a collection of observations, as we shall specify in the following. A saccade is a fast eye movement that can reach peak velocities of 800◦/s. A saccadic eye motion has no feedback, meaning that the vision system is turned-off while the movement, and correction, via a further saccade, is achieved only when the target is reached. A saccade is a gaze motion from one target to another that can range from less than a degree to 45◦ . Because saccades follow an exponential function, peak velocity is reached early, and so far it cannot be fully simulated by a controlled mechanical motion, but under severe constraints. Smooth pursuit is a slow movement of the eye exhibited during tracking. Like saccade is a voluntary eye movement, but while saccades are elicited under different stimuli, smooth pursuit is elicited only under the stimulus of a moving target. Smooth pursuit reaches peak velocities of about 60◦ /s in order to keep the target position in the center of the fovea. As opposite to saccades, during smooth pursuit the vision is clear as it is foveated. Indeed, when the target is in the fovea, then the retinal image is no more in motion, this fact induces the ability of predicting the motion and thus to return to the target using the memorized stimulus. Finally, fixational eye movements are the result of micro saccades, ocular drifts and ocular microtremors related to the prevention of perceptual fading. Several studies have simulated smooth pursuit, likewise the sudden change induced by saccadic generation (see for example [10]), and microsaccades [11], but always using as observations the current eye position via eye tracking, that is, on the basis of motion identification via the eye. Other studies have modeled the mechanical structure of eye muscles (see for example [12]) to predict eye motion. Despite the interest of these studies for a single step eye motion prediction, these cannot be used to predict gaze localization for embodied systems. To estimate the gaze scan path we have to consider that the target to be tracked is precisely the gaze and not what the gaze is observing. The gaze, as target, is effectively a point, whose scan path is its projection on the 2D image. The process to be reproduced is based on the following information, at time step t: 1. The saliency map. 2. The updated eye position at time step t − 1. 3. The underlying gaze processes induced by the the horizontal eye movements. The upshot of the above discussion is: the gaze is observed only through the saliency map, and its scan path is induced by at least three types of motion,
140
M. Mancas, F. Pirri, and M. Pizzoli
namely saccade, smooth pursuit and fixation. Accordingly, there are at least three hidden real stochastic processes for the gaze and each of these process needs to be represented via a precise motion model. We note also that, at time step t, each local maxima of the saliency map is, in principle, the current state of some of the k t possible scan paths started at time step t0 , with k the number of local maxima. An important technique for managing multiple interacting dynamic models relies on the Markovian switching systems, also known in general as Interacting Multiple Model (IMM)[13]. Markovian switching systems model a process that changes in time by providing different models for each of the underlying processes. The state of each process is estimated under a bank of r filters. Each filter serves to estimate the state of the process and a mixing distribution establishes which process is effectively active at time t. For the gaze scan path, the state of each process, a 7-dimensional random variable, is estimated by a non linear Bayes filter using the unscented Kalman transform, the transition between the three models has been estimated via a non parametric mixture model, as described in Section 4. For each process, underlying the gaze scan-path, the estimation of the state xt , involves the following two equations: xt = f (xt , ut ) yt−1 = h(xt−1 , wt )
(3)
The above estimation requires a transformation problem which can be stated as follows; x ∼ N (μx , Σxx ), statistics y are related to x by a non linear function g, with x ∈ Rn , y ∈ Rm , g : Rn → Rm . The unscented Kalman filter [14] is an optimal filtering approximating the filtering distribution of state x, based on the unscented transform of the variable y. The idea of the unscented transform is to approximate the filtering distribution, by approximating its Gaussian distribution [15]. Namely, a set of σ points are chosen so that their sample mean and covariance are the estimated mean and covariance of the state x, then the non-linear function g is applied to the σ-points to yield the mean and variance of the looked for distribution. The unscented transform is as follows. Let x and Σxx be the mean and covariance of the state x of dimension n, approximated by 2n + 1 σ-points, whose matrix X is formed by 2n + 1 σ-vectors as follows: √ (4) X = [x · · · x] + ( n + λ) 0 Σxx − Σxx Here λ = α2 (n + k) − n is a scaling parameter, α determines the spread of the σ-points around x, k is a scaling parameter and β is used to incorporate prior knowledge of x. (n + λ)Σxx i is the i-th row of the the matrix square root (m) and Wi (c) defined as (see [16]). The σ-vectors are associated with weights Wi follows: (m)
W0
=
λ λ 1 (c) (m) (c) , W0 = + (1 − α2 + β), Wi = Wi = n+λ n+λ 2(n + λ)
(5)
From Saliency to Eye Gaze
141
The transformed points Y i are obtained by applying the non linear function g to the sigma points: Yi = g(Xi ), i = 0, . . . , 2n and the new mean and variances are given as follows: y=
2n i=0
(m)
Wi
yi and Σyy =
2n i=0
Wi (Y i − y)(Y i − y) (c)
(6)
Then the transformation process yielded by the filter consists of the following two steps of prediction and update: 1. Prediction: (a) Compute the matrix of σ points Xt−1 . (b) Propagate the σpoints using the dynamic model f of the specific gaze process in so obtaining ˆ t . (c) Compute the predicted mean and covariance of the state. X 2. Update (a) Compute the matrix of σ points Xt . (b) Propagate the σ-points ˆ t. using the observation model h of the saliency map St , in so obtaining Y (c) Compute the predicted mean and covariance of the observations and the cross-covariance of state and observations. (d) Compute the filter gain, and the state mean and covariance, conditional to the observations. For the gaze scan path estimation, the state is a 7 dimensional variable x specified by the location of the gaze, in the image, the velocity, the acceleration and the saliency value of the point on the saliency map: x = (x, y, x, ˙ y, ˙ x ¨, y¨, smap )
(7)
The dynamic model specified by f , for each process, is defined as follows. For the characterization of f we shall consider position, velocity and acceleration along a single spatial dimension, respectively. For the saccade the dynamic model is the Singer model [17]. In this model the target acceleration is correlated in time, with correlation given by σ 2 exp(−α|τ |), α > 0, where σ 2 is the variance of the saccade acceleration and α is the reciprocal of the saccadic behavior, in so depending on the milliseconds needed to accomplish a saccade by the embodied system. Therefore it can be determined by α2max /3(1 + 3pmax − p0 ), where αmax is the maximum rate of acceleration, with probability pmax and p0 is the probability of no acceleration. The discrete time representation of the continuous model, see [17], is xt = Ψ (T, α)xt−1 + ut−1
(8)
Here T = 1/30 is the sampling rate, given by the framerate, ut is the inhomogeneous driving input whose variance is derived in [17], and Ψ (T, α) is, under the saccadic process, the state transition matrix: ⎡ ⎤ 1 T (1/α2 )[−1 + αT + e−αT ] ⎦ Ψ (T, α) = ⎣ 0 1 (1/α)[1 − e−αT (9) 0 0 e−αT The dynamic model for the fixation motions (micro saccades, ocular drifts and ocular microtremors) is taken to be a Gaussian random walk. Therefore xt = ϕxt−1 + σu ut−1 ∼ N (ϕxt−1 , σu2 )
(10)
142
M. Mancas, F. Pirri, and M. Pizzoli
Finally the smooth pursuit is modeled by a Wiener process velocity model. The discrete time representation of the model is: ⎡ ⎤ 1 ΔT 0 0 ⎦ xt−1 + ut−1 (11) xt = ⎣ 0 1 0 0 1 The measurement model returns only the updated location of the gaze, using the saliency map. To simplify the measurement only the first n local maxima are used, ordered according to the saliency value; further, among these ones the local maximum, having both highest saliency value, and whose speed and bearing from the predicted gaze location is minimal, is chosen. Clearly this characterization of the measurement is adapted to the particular process. More specifically, let S P = (x1 , . . . , xn ) , be the vector of positions of the n local maxima selected, and let S S = (s1 , . . . , sn ) be the vector of the saliency values, associated with the local maxima chosen. Furthermore let us denote by VP = (Δx1 , . . . , Δxn ) the displacement of the local maxima in P , measured as the flight distance of the gaze to that position, and let Θ = (θ1 , . . . , θn ) be the bearing of the local maxima given the predicted gaze location. For the saccade, the measurement is defined as follows: yt = C(δVP )t + Hxk + N (0, qt );
(12)
Here δ and C are matrices of dimension n × n, whose elements δij and cij are defined as:
δij =
δi+1,j+1 = 1 if sj = arg maxj (S S ) 0 otherwise
cij =
⎧ ⎨ ci+1,j+1 = 1 if vj = arg mink (V (Δxt )) and θk = arg mink (Θ(vj )) ⎩0 otherwise (13)
Fig. 3. Left: the scan-path estimated with the IMM-UKF in cyan, the Ground Truth scan path in light green and in blue the scan path through the local maxima. Local maxima are the blue-yellow dots. Right: the Gaussian Mixture model and the continuous observation HMM used to estimate the transition kernel and the parameters of the IMM processes.
From Saliency to Eye Gaze
143
Here Δxt is the n × 1 vector of the displacements, selected by δ, to the local maxima, vj is the vector of selected displacements ci,j and 10 00000 H= (14) 01 00000 The model for the fixation simply maintain the state unchanged and the model for the smooth pursuit inverts the ordering specified for the saccade, by first choosing the closest among the selected local maximal value, further it chooses the angle and finally the saliency value. It is easy to see that the three models obey a parsimonious principle to optimizing gaze use of resources. Given the above defined models of the three processes, different transition kernels have been estimated, and are reported in Section 4. The IMM model is finally i|j i|j i|j computed by estimating the mixing probabilities μt = (πij μt−1 )/( πij μt−1 ) , at each estimation step, made of prediction and update as previously described, the mean, covariance and likelihood of each process is estimated and, finally the combined estimates is computed as: xt = i=1 μit xit k (15) Σxx = i=1 μit × Σti (xi − x)(xi − x) In Section 4 we discuss the experiments and the evaluation of the model with respect to gaze scan-paths recorded live in outdoor and dynamic scenes.
4
Experiments
Experiments have been performed at the different layers of the system. Experiments and comparisons with other approaches concerning the saliency map have been illustrated in Section 2. Here we are mainly concerned with the estimation of the parameters of the IMM via a continuous observations HMM, with mixtures of Gaussians observations, with the mean squared error of the IMM computed with respect to the ground truth, obtained by a wearable gaze tracking device, and finally with the results obtained with a pan-tilt. As test data we have collected the bus stop dataset (Figure 2), an outdoor scan-path comprising over 20 thousand gaze tracking frames, with the task of counting the people moving in the area. We have used 4 sets of data, using as features the velocity magnitude and the acceleration to estimate (via the Expectation Maximization algorithm) a continuous observation HMM with three states each accounting for observations sampled from a Gaussian mixture with three components. The estimated model with the reduced set of features is illustrated in Figures 3. Having set the dimension of the space to three models the best transition kernel, namely the one maximizing the posterior probability of the scan-path, turned out to be the following, likewise the means for the three processes: ⎡
Sacc ⎢ Sacc 0.1 T =⎢ ⎣ Fix 0.48 SP 0.57
Fix 0.45 0.48 0.05
⎤ ⎤ ⎡ ⎤ ⎡ SP x y 1.63 9.43 1.46 17.53 0.23 xsacc 0.45 ⎥ ⎥ and ⎣ xf ix ⎦ = ⎣ x y 0.01 −0.09 −0.23 −0.13 0.28 ⎦ 0.04 ⎦ xSP x y 0.04 −0.79 1.18 −1.12 2.3 0.38 (16)
144
M. Mancas, F. Pirri, and M. Pizzoli
Fig. 4. The scan-paths of ten local maxima chosen deterministically at each step, with location determined only with respect to the x-coordinate, and compared with the IMM-UKF estimated one (in cyan) and the ground truth in light green
On the left the transition kernel, and on the right the estimated means of the three processes, for the gaze location we have left x, y, since it was set to the center of the image. The mean squared error has been computed both for the scan-path obtained by selecting the local maximum at each time step t and for the scan-path estimated via the IMM-UKF, with respect to the ground truth. The results are illustrated in Figure 5, while in Figure 4 the gaze location only with respect to the x-coordinate has been taken to illustrate that the local maxima can be hardly chosen, as they are exponential in the length of the scan-path and also only the best one can approximate the scan-path estimated by the switching processes IMM-UKF. Finally to effectively understand, in terms of observed objects, whether the estimated scan-path would be able to attend, at least partially, the elements in the scene effectively observed by the subject (from which the Ground Truth
Fig. 5. Mean squared error, between the estimated scan path and the ground truth, and between a randomly selected scan-path across all the local maxima
From Saliency to Eye Gaze
145
has been extracted), we have computed the number of pixels that have been in common between a region 10 × 10 around the gaze location, and the point estimated by the IMM-UKF. The outcome is illustrated in Figure 6.
Fig. 6. The number of pixel that fall into a region 10 × 10 centered in the gaze location as effectively obtained via the ground truth
Experiments for the embodied system were performed using a Directed Perception PTU D46 17 pan-tilt. The maximum angle which can be mechanically achieved is 9◦ , meaning that over this amplitude the delay would be unacceptable for a saccade. The pan-tilt is directly controlled by the position elicited by the Bayesian process.
5
Conclusion
We presented a dynamic system to control active vision based on computational attention. The underlying dynamic saliency model integrates motion cues in the computation of the saliency map and its prediction performance has been experimentally evaluated against static approaches using gaze tracking sequences as ground truth. An IMM-UKF takes care of selecting the next attended location among the multiple maxima which are generated by the saliency map, and decides which gaze behavior to implement. The model is trained on evidence collected from humans by means of gaze tracking experiments and provides smooth scan-paths among saliency maxima which are likely to be attended by a human observer, also exhibiting frequencies of transition between gaze behaviors that are consistent to the biological counterpart. The resultant attentive control is suitable for robot active vision and demonstrates that considerations related to the embodiment naturally lead to a gaze prediction mechanism which seems better correlated to real gaze than the classical use of the saliency map maxima.
146
M. Mancas, F. Pirri, and M. Pizzoli
References 1. Treisman, A., Gelade, G.: A feature-integration theory of attention. Cognitive Psychology 12, 97–136 (1980) 2. Tsotsos, J., Culhane, S., Wai, W., Lai, Y., Davis, N., Nuflo, F.: Modeling visual attention via selective tuning. Artifical Intelligence 78, 507–547 (1995) 3. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20, 1254–1259 (1998) 4. Butko, N., Movellan, J.: Infomax control of eye movements. IEEE Transactions on Autonomous Mental Development 2, 91–107 (2010) 5. Koch, C., Ullman, S.: Shifts in selective visual-attention: towards the underlying neural circuitry. Hum. Neurobiol. 4, 219–227 (1985) 6. Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency in dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 171–177 (2010) 7. Kowler, E.: Eye movements: The past 25years. Vision Research, 1–27 (2011) 8. Farneb¨ ack, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003) 9. Mancas, M., Riche, N., Leroy, J., Gosselin, B.: Abnormal motion selection in crowds using bottom-up saliency. In: Proc. of the ICIP (2011) 10. Sauter, D., Martin, B., Di Renzo, N., Vomscheid, C.: Analysis of eye tracking movements using innovations generated by a kalman filter. Medical and Biological Engineering and Comp. (1991) 11. Engbert, R., Kliegl, R.: Microsaccades uncover the orientation of covert attention. Vision Research 43, 1035–1045 (2003) 12. Komogortsev, O., Khan, J.I.: Eye movement prediction by kalman filter with integrated linear horizontal oculomotor plant mechanical model. In: ETRA, pp. 229– 236 (2008) 13. Blom, H., Bar-Shalom, Y.: The interactive multiple model algorithm for system with markovian switching coefficients. IEEE Trans. on Automatic Control 33, 780– 783 (1988) 14. Julier, S.J., Jeffrey, Uhlmann, K.: Unscented filtering and nonlinear estimation. Proceedings of the IEEE (2004) 15. Julier, S.J., Uhlmann, J.K.: A new extension of the kalman filter to nonlinear systems. In: Proceedings of AeroSense: The 11th International Symposium on Aerospace/Defense Sensing, Simulation and Controls, pp. 182–193 (1997) 16. Wan, E., van der Merwe, R.: The unscented kalman filter for nonlinear estimation. In: Proc. of the Symposium on Adaptive Systems for Signal Processing, Communication and Control (2000) 17. Singer, R.A.: Estimating optimal tracking filter performance for manned maneuvering targets. IEEE Transactions on Aerospace and Electrictronic Systems (1970)
Adaptive Two-Step Adjustable Partial Distortion Search Algorithm for Motion Estimation Yonghoon Kim, Dokyung Lee, and Jechang Jeong Department of Electronics and Computer Engineering, Hanyang University
[email protected],
[email protected],
[email protected]
Abstract. Motion estimation is widely used in video coding schemes, because it enables the transmission and storage of video signals within lower bit rate. A full search (FS) algorithm is optimal method of motion estimation, but it is suffers from high computational complexity. To reduce the complexity various methods are proposed. Recently, two-step edge based partial distortion search algorithm (TS-EPDS) is introduced, and it shows about 100 times faster than FS without PSNR degradation. In this paper, we proposed adaptive two-step adjustable partial distortion search algorithm, and it is 200 times faster than FS with negligible PSNR decrease. The proposed algorithm is suitable for real time implementation of high quality digital video application.
1
Introduction
Motion estimation is an essential process of the video image coding scheme, because it enable the transmission and stores the video signals while using a lower bit rate. Block Matching algorithm is the most popular technique, because it’s high efficiency in reducing temporal redundancy between successive frames. This technique divides each frame into fixed N×N macroblocks, and the motion vector for each current frame block is estimated by obtaining the minimum Sum of Absolute Difference (SAD) for the previous frame. Full search algorithm (FS) is the fundamental method of the BMA which is search every candidate block in search window. It is normally considered as optimal, but it suffers from heavy complexity burden. In video coding system, motion estimation takes 80% of the total complexity. In order to reduce the complexity burden, various algorithms have been proposed, such as three step search (3SS) [1], diamond search (DS) [2], cross search algorithm (CSA) [3], 2-d logarithmic search (2DLOG) [4], new three-step search (N3SS) [5], and four-step search (4SS) [6] algorithms. These algorithms significantly reduce the complexity of motion estimation by matching only some of the predefined points within the search window. These algorithms are based on the assumption that the global minimum motion vector distribution is center-biased, instead of uniformly distributed. However, this assumption does not always satisfy the real-world video sequence [7]. As a result, the video quality degradation is produced by trapping the local minimum points. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 147–155, 2011. © Springer-Verlag Berlin Heidelberg 2011
148
Y. Kim, D. Lee, and J. Jeong
The Alternating Subsampling Search Algorithm (ASSA) [8] reduces the number of pixels, instead of reducing the number of checking points. This algorithm is based on alternating subsampling patterns when calculating different location’s Block Distortion Measure (BDM). ASSA gives four times the computational reduction, with its mean square error (MSE) performance very close to that of FS. Apart from the subsampling algorithm, half way-stop techniques can also be used in order to recue computational complexity within the BDM calculation. A Partial Distance Search (PDS) [9] is used in the vector quantization encoding process and is one of the examples provided. The basic concept of PDS is as follows: the computational complexity is reduced when the measured Sum of Absolute Difference (SAD) calculation is prematurely terminated, which occurs when it finds that a partial SAD is greater than the minimum SAD, which was already obtained during the search. In order to improve the performance of the PDS, a sorting based approach called fast full search with sorting by gradient (FFSSG) is proposed [10]. Cheung and Po [11] proposed an improved version of the PDS method, called a Normalized Partial Distortion Search (NPDS) for early rejection of an impossible candidate motion vector. NPDS achieves an average computation reduction of 12 to 14 times, with only a small degradation of quality. Cheung and Po extended the NPDS to an adjustable PDS (APDS) [12], while searching by quality factor. Simulation results show that APDS reduces the computational complexity up to 38 times, with a relatively larger degradation in PSNR performance, when compared to a full search. Some other fast methods, based on PDS, are proposed, such as the enhanced normalized partial distortion search based on the block similarity [13], fast full search using adaptive matching scan strategy to reject impossible candidates early on [14], sorting partial distortion based motion estimation [15], and the new sorting-based partial distortion elimination algorithm [16]. The Dual Halfway Stop normalized PDS (DHS-NPDS) [17] is introduced, and the simulation results for DHS-NPDS show a high reduction of 50 times, but the video quality degradation is large when compared with a full search within high motion video sequences. More recently, Sarwer and Wu [18] introduce the Two-Step Edge Based Partial Distortion Search (TS-EPDS) algorithm, which achieves 90 times the reduction of computational complexity, with negligible PSNR degradation. This paper proposes an adaptive two-step APDS algorithm. The proposed algorithm, based on TS-EPDS, reduces the complexity by using early termination in the first step and adaptive search range of the second step search. The rest of this paper is organized as follows. The PDS and APDS algorithms are described in Section 2, and the proposed adaptive two-step APDS is presented in Section 3. Section 4 reports the experimental results on the proposed algorithm, and the conclusions are given in Section 5.
2 2.1
Adjustable Partial Distortion Search Algorithm Partial Distortion Search Algorithm
The FS algorithm finds best match block using the Sum of Absolute Difference (SAD), which gives similar result with mean square error. The SAD is defined as
Adaptive Two-Step Adjustable Partial Distortion Search Algorithm
149
SAD ( x, y, mx, my ) = M
M
∑∑ I n ( x + i, y + j ) − I n −1 ( x + i + mx, y + j + my )
,
(1)
i =1 j =1
Where It and It-1 denote the pixel value in current and previous frame, respectively, while (x,y) represents the coordinates of the upper left corner pixel of the current block, and (mx,mv) is the displacement relative to current block located at (x,y).
Fig. 1. Computation order of sub-block position of 16 x 16 block
The partial distortion search (PDS) is terminated early when the calculated partial SAD dk is greater than SADmin. The PDS divides the macroblock into 16 4x4 subblocks, which are defined as 3
3
d k = ∑∑ I n ( x + i + sk , y + j + tk ) i =0 j = 0
(2)
− I n −1 ( x + i + sk + mx, y + j + tk + my ) , k
Dk = ∑ d p ,
(3)
p =1
Where dk is kth partial distortion, while (sk, tk) denotes the pixel coordinates of the upper left corner pixel of the sub-block. During the block matching, the partial matching error Dk accumulated for every period, is computed and compared with the minimum SAD. If Dk is larger than SAD minimum then it is considered as impossible candidate, and the rest of the block calculation is skipped. In order to reject the improper candidate more earlier, the Normalized Partial Distortion Search (NPDS) uses normalization. Instead of using Dk, normalized partial
150
Y. Kim, D. Lee, and J. Jeong
distortion Dnorm is compared with the normalized minimum SAD. Dnorm is defined as follows:
Dnorm = d k ×
16 . k
(4)
The probability of early rejection is greatly increased by adopting normalization. As a result, performance improves about from 10 to 12 times in terms of computation reduction; however, false rejection also occurs, because distortion does not have various motion vectors inside the block. Due to this erroneous rejection, the PSNR performance of the NPDS is not sufficient in comparison with that of FS. Table 1. Offsets of the pth partial distortion from the upper left corner pixel of a sub-block
k 1 2 3 4 5 6 7 8
2.2
(sk, tk) (0,0) (2,2) (2,0) (0,2) (1,1) (3,3) (3,1) (1,3)
k 9 10 11 12 13 14 15 16
(sk, tk) (1,0) (3,2) (0,1) (2,3) (3,0) (1,2) (2,1) (0,3)
Adjustable Partial Distortion Search
APDS algorithm is combination of PDS and NPDS with quality factor k. It decides whether the block-matching on the current candidate continues or stops for the rest of the partial distortions. In this algorithm to achieve adjustable control, adjustable function f(n,k) is defined as
f (n, k ) = (1 − k ) ⋅ n + k ⋅ N 2
(5)
When k=0, then f(n,k)=n, and APDS gives the same performance as NPDS. In other case k=1, f(n,k)=N2, and APDS works like PDS. It also provides same quality as FS. Accordingly, if k is increased, then APDS is faster, but there is some quality degradation. In this paper, k is defined as 0.4 which show best performance on both speed and PSNR, and it is described in [13]. Using predefined f(n,k), APDS performs normalized distortion comparison: N2·Dk > f(n,k)·Dmin, where Dk is accumulative partial distortion and Dmin is SAD minimum. If N2·Dk of current candidate is smaller than f(n,k)·Dmin, then compute next partial distortion and repeat the normalization comparison until the last partial distortion. At the end of the block comparison, if the candidate block distortion N2·Dk is smaller than f(n,k)·Dmin, it replaces the current Dmin.
Adaptive Two-Step Adjustable Partial Distortion Search Algorithm
3
151
Proposed Algorithm
The proposed algorithm is based on TS-EPDS, which gives high performance in terms of both PSNR and complexity. In order to further reduce computations, we propose an adaptive two-step APDS (ATS-APDS). The two-step algorithm does not best algorithm, but it has possibility to improvement by using early termination condition. It is composed of first rough search, and second concentrated search. In first step, ATS starts from the center of the search area, and moves spirally outward along the search pattern in Fig.2. Selected search point pattern is dense at center area, and sparse in outside area, because density of center area is directly effect to video quality, which has small motion.
Fig. 2. Selected search point for search range 8. In case of 16, center pattern is not changed, and outside patter is just spanned.
3.1
Early Termination
In the first step, early termination parameter R is defined as: ⎧ L _ SADmin if MVX + MVY < 2 ⎪ R = ⎨ L _ SADmin >> 1 else if MVX + MVY < 4 ⎪ L _ SAD >> 2 otherwise min ⎩
(6)
Where L_SADmin represents SADmin of collocate block of the previous frame, and MVX and MVY denote motion vector of collocate block in the previous frame. Equation (6) is based on assumption that a block, which has very small motion vector, has
152
Y. Kim, D. Lee, and J. Jeong
high correlation of SADmin between previous frame and current frame. To determine correlation, absolute sum of motion vector (ASMV) is used. If ASMV is lower than 2, then SADmin of previous frame is be a R, between 2 and 4, then R is half of SADmin, and other case, R is defined as quarter of SADmin of previous collocate frame. 3.2
Search Range Adjustment
In the first step, two motion vectors, the best candidate motion vector which has minimum SAD, and second best candidate motion vector, are obtained in first step. In order to further reduction of calculation, search range adjustment is performed before starting the second step. To control the search range of the second step, we use motion vector prediction using the left, upper, and upper left corner block. If the best motion vector in first step is similar to the neighbor block, then we change the search range D, and it is defined as follow: ⎧1 if MAX ( cur _ mv − u _ mv , cur _ mv − l _ mv , cur _ mv − ul _ mv ) < 2 D=⎨ , otherwise ⎩2
(7)
If the best candidate motion vector is similar with neighboring motion vectors, then we consider it is close to the true motion vector. Therefore, D is set as 1, and second motion vector is not used in second search range. If the best candidate motion vector is not similar with neighboring motion vectors, then D is 2, and we search around the both best and second best candidate motion vector. In second step, we use also APDS and do not search the points which are already searched in first step. Because of the early termination, sometimes we have only one best motion vector. In this case, we use same method in (7), and If D is 1, then we do not need to ignore second motion vector, because it does not exist. Fig.3 represents overall algorithm of ATS-APDS. Early termination in first step and search range adjustment produce an improvement in complexity reduction.
Fig. 3. Block diagram of overall algorithm about ATS-APDS
Adaptive Two-Step Adjustable Partial Distortion Search Algorithm
153
Table 2. Experimental results (8 CIF sequences, 90 frames for Football and 300 frames for the others). ATS-EPDS denotes TS-EPDS with search range adjustment. ATS-APDS_1 represents ATS-APDS algorithm without early termination, and ATS-APDS_2 includes early termination and search range adjustment algorithm.
PSNR Speed-up PSNR foreman Speed-up PSNR coast Speed-up PSNR akiyo Speed-up PSNR mobile Speed-up PSNR mother Speed-up PSNR news Speed-up PSNR hall Speed-up PSNR average △PSNR Speed-up football
4
FS
PDS
NPDS
APDS
TS-EPDS
27.73 1 32.33 1 30.68 1 42.94 1 25.18 1 40.54 1 36.91 1 34.83 1
27.73 1.68 32.33 3.17 30.68 3.35 42.94 9.35 25.18 3.76 40.54 3.19 36.91 6.88 34.83 3.03 33.89 0 4.48
27.51 10.82 32.21 12.35 30.58 12.95 42.85 13.30 25 13.07 40.44 12.49 36.69 13.19 34.7 12.89 33.747 -0.143 12.6
27.73 9.66 32.36 15.29 30.69 33.83 42.94 52.8 25.17 36.85 40.53 20.52 36.89 42.73 34.81 22.46 33.888 -0.002 30.24
27.81 19.03 32.37 51.29 30.71 85.9 42.93 187.59 25.13 95.62 40.52 73.11 36.85 135.36 34.8 57.87 33.889 -0.001 92.56
33.89 1
ATSEPDS 27.79 19.35 32.31 58.7 30.7 99.64 42.92 363.03 25.13 105.39 40.5 161.59 36.83 221.84 34.79 73.11 33.873 -0.017 147.08
ATSAPDS_1 27.8 32.66 32.34 77.96 30.68 129.44 42.93 253.86 25.14 156.62 40.52 84.42 36.85 190.62 34.8 102.92 33.882 -0.008 132.23
ATSAPDS_2 27.77 33.11 32.32 85.91 30.68 160.84 42.93 504.39 25.14 211.94 40.51 106.29 36.83 293.39 34.78 148.43 33.872 -0.018 199.41
Experimental Results
The Proposed algorithm is simulated using the CIF (352×288) sequences, and these video have 300 frames, except for the football (90 frames). In this experiment, the proposed algorithm is implemented on independent codec, which has motion estimation and motion compensation only. The block size is selected as 16×16, and the size of the search window is 33×33. We compared ATS-APDS against FS, PDS, NPDS, APDS, TS-EPDS, and ATS-EPDS. In order to estimate the objective performances, the PSNR and the number of operations are used. The number of operations is defined as: Operations = (add or sub) + abs + 8 × (mul or div) .
(8)
The reason multiplication and division operations are multiplied by a weight of 8 is each requires more computations than addition/subtraction. Speed-up factor is determined by dividing operations of FS into operations of BMA algorithm. Table 2 shows the PSNR and the speed-up performance of 8 different algorithms. The PDS is lossless method, and PSNR result is same as FS, but it show only 4.5 times faster than FS. NPDS shows about 13 times faster, but there is video quality degradation. The APDS show 30 times faster and TS-EPDS shows 93 times faster than FS without noticeable PSNR drop. In case of videos with small motion, such as akiyo, news, and mother&daugther, the speed-up is more accelerated. The search range adjustment algorithm is adapted to ATS-EPDS, and it is approximately 1.6 times faster than TS-EPDS with same video quality. From this result, search range adjustment show good performance with negligible PSNR drop. To analyze the efficiency of early termination algorithm, we experiment ATS-APDS_1 and
154
Y. Kim, D. Lee, and J. Jeong
ATS-APDS_2. ATS-APDS algorithm is almost same as ATS-EPDS, except EPDS is replaced with APDS. Although APDS is far faster than EPDS, the reason that ATSEPDS is faster than ATS-APDS_1 is ATS-EPDS (and TS-EPDS) has its own early termination algorithm. ATS-APDS_2 show 1.5 times faster than ATS-APDS_1 and the PSNR performance is almost same. From the result, search range adjustment especially reduces the complexity of the video sequences that have small motion vector. The early termination algorithm improves the speed over the sequences.
5
Conclusion
In this paper, the ATS-APDS algorithm is proposed to reduce the complexity of the motion estimation. To increase early rejection rate of improper candidate motion vectors, EPDS algorithm is replaced by APDS. To achieve further reduction of complexity, early termination algorithm and search range adjustment is used, and these algorithms can effectively reduce search point. Experimental results show that the proposed algorithm maintains the almost same PSNR performance as FS for various video sequences, while it has an average complexity reduction of 199 times. Therefore, the ATS-APDS algorithm is suitable for real time video applications of high quality video coding. Acknowledgement. This work was supported by the Brain Korea 21 Project in 2011.
References 1. Li, R., Zeng, B., Liuo, M.L.: A new three-step search algorithm for block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 4(4), 438–442 (1994) 2. Zhu, S., Ma, K.-K.: A new diamond search algorithm for fast block matching motion estimation. IEEE Trans. Image Processing, col. 92(2), 287–293 (2000) 3. Ghanbari, M.: The cross-search algorithm for motion estimation. IEEE Trans. Commun. 38, 950–953 (1990) 4. Jain, J.R., Jain, A.K.: Displacement measurement and its application in inter frame image coding. IEEE Trans. Commun. COM-29(12), 1799–1808 (1981) 5. Li, R., Zeng, B., Liou, M.L.: A new three-step search algorithm for block motion estimation. IEEE Trans. Circuits. Syst. Video Technol. 4, 438–442 (1994) 6. Po, L.M., Ma, W.C.: A novel four step search algorithm for fast block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 6(3), 313–317 (1996) 7. Chow, K.H.K., Liou, M.L.: Generic motion search algorithm for video compression. IEEE Trans. Circuits Syst. Video Technol. 3, 440–445 (1993) 8. Liu, B., Zaccarin, A.: New fast algorithms for the estimation of block motion vectors. IEEE Trans. Circuits Syst. Video Technol. 3(2), 148–157 (1993) 9. Bei, C.D., Gray, R.M.: An improvement of the minimum distortion encoding algorithm for vector quantization. IEEE Trans. Commun. COM-33, 1132–1133 (1985) 10. Montrucchio, B., Quaglia, D.: New sorting based lossless motion estimation algorithms and a partial distortion elimination performance analysis. IEEE Trans. Circuits Syst. Video Technol. 15(2), 210–220 (2005)
Adaptive Two-Step Adjustable Partial Distortion Search Algorithm
155
11. Cheung, C.K., Po, L.M.: Normalized partial distortion algorithm for block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 10(3), 417–422 (2000) 12. Cheung, C.K., Po, L.M.: Adjustable partial distortion search algorithm for fast block motion estimation. IEEE Trans. Circuits Syst. Video Technol. 13(1), 100–110 (2003) 13. Hong, W.G., Oh, T.M.: Enhanced partial distortion search algorithm for block motion estimation. Electronic Letters, 1112–1113 (2003) 14. Kim, J.N., Byun, S.C., Kim, Y.H., Ahn, B.H.: Fast full search motion estimation algorithm using early detection of impossible candidate vectors. IEEE Trans. Signal Process. 50(9), 2355–2365 (2002) 15. Montrucchio, B., Quaglia, D.: New sorting based lossless motion estimation algorithms and a partial distortion elimination performance analysis. IEEE Trans. Circuits Syst. Video Technol. 15(2), 210–220 (2005) 16. Choi, C.R., Jeong, J.J.: New Sorting-Based Partial Distortion Elimination Algorithm for Fast Optimal Motion Estimation. IEEE Trans. Consumer Electron. 55(4), 2335–2340 (2009) 17. Sarwer, M.G., Jonathan Wu, Q.M.: Efficient Two Step Edge based Partial Distortion Search for Fast Block Motion Estimation. IEEE Trans. Consumer Electron. 55(4), 2154– 2162 (2009)
Feature Trajectory Retrieval with Application to Accurate Structure and Motion Recovery Kai Cordes, Oliver M¨ uller, Bodo Rosenhahn, and J¨ orn Ostermann Institut f¨ ur Informationsverarbeitung (TNT) Leibniz Universit¨ at Hannover {cordes,omueller,rosenhahn,ostermann}@tnt.uni-hannover.de http://www.tnt.uni-hannover.de/
Abstract. Common techniques in structure from motion do not explicitly handle foreground occlusions and disocclusions, leading to several trajectories of a single 3D point. Hence, different discontinued trajectories induce a set of (more inaccurate) 3D points instead of a single 3D point, so that it is highly desirable to enforce long continuous trajectories which automatically bridge occlusions after a re-identification step. The solution proposed in this paper is to connect features in the current image to trajectories which discontinued earlier during the tracking. This is done using a correspondence analysis which is designed for wide baselines and an outlier elimination strategy using the epipolar geometry. The reference to the 3D object points can be used as a new constraint in the bundle adjustment. The feature localization is done using the SIFT detector extended by a Gaussian approximation of the gradient image signal. This technique provides the robustness of SIFT coupled with increased localization accuracy. Our results show that the reconstruction can be drastically improved and the drift is reduced, especially in sequences with occlusions resulting from foreground objects. In scenarios with large occlusions, the new approach leads to reliable and accurate results while a standard reference method fails.
1
Introduction
Camera motion estimation and simultaneous reconstruction of rigid scene geometry using image sequences is a key technique in many computer vision applications [1,2,3,4,5,6]. The basis for the estimation is the usage of corresponding features which arise from a 3D point being mapped to different camera image planes as shown in Fig. 1. By using a statistical error model which describes the errors in the position of the detected feature points, a Maximum Likelihood estimator can be formulated that simultaneously estimates the camera parameters and the 3D positions of feature points. This joint optimization is called bundle adjustment [7]. It minimizes the distances between the detected feature points and the reprojected 3D points. Most sequential approaches for structure and motion recovery determine corresponding features in consecutive frames. These correspondences are subsumed G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 156–167, 2011. c Springer-Verlag Berlin Heidelberg 2011
Feature Trajectory Retrieval with Application
157
to a trajectory. For small baselines between the cameras, feature tracking methods like KLT [8] are appropriate to obtain stable correspondences. For larger viewpoint changes, features matching methods [9,10,11] have proved impressive performance in determining stable correspondences. The image signal surrounding the feature position is analyzed and distinctive characteristics of this region are extracted for a comparison. These characteristics are assembled to a vector which provides a distinctive representation of the feature. This vector, called descriptor [12], is used to establish correspondences by calculating a similarity measure (L2 distance) between the current descriptor and the feature descriptors of a second image. In standard structure and motion recovery approaches, a feature without a correspondence in the previous frame is regarded as a newly appearing object point. If the image feature has been temporarily occluded as shown in Fig. 1, the new object point and the object point that has been generated before the occlusion adopt different 3D positions. As a consequence, errors accumulate and noticeable drift occurs. This problem arises from foreground occlusion as shown in Fig. 2, moving objects, repeated texture, image noise, motion blur, or because tracked points temporarily leave the camera’s field of view. For the performance of the bundle adjustment, it is essential to assign reappearing feature points to the correct trajectories and object points. In case of a closed sequence, the drift can be reduced by enforcing the constraint between the cameras observing the same scene content (i.e. first and last camera view after a complete circuit) [13,14,15]. In [16], the drift is reduced by estimating the transformation between reconstructed 3D point clouds using RANSAC [17]. In [6], broken trajectories caused by occlusion are merged using a combination of localization and similarity constraints of the reprojected object 3D object points Pj
structure of rigid objects
foreground occlusion lin
e
of
s ig
ht
I k
I k+ corresponding feature points
Ik+1
3
I k +2
pj,k ↔ pj,k+1
came ra m otion
Fig. 1. The feature trajectories pj,k , pj,k+1 , . . . , pj,k+l are used for structure and motion estimation. Due to foreground occlusion, they discontinue and the corresponding scene content reappears in the images Ik+2 and Ik+3 , respectively. A real world example is shown in Fig. 2.
158
K. Cordes et al.
Fig. 2. Bellevue sequence: example frames 96, 101, 106, 111 with temporarily occluded scene content resulting from a foreground object. In standard structure from motion approaches, nearly all feature trajectories discontinue. New object points are generated for scene content after being occluded temporarily.
points in the images. These approaches are only applicable in a post processing step, i.e. after the last frame of the sequence is processed. Furthermore, the estimation has to be accurate before applying the merging. Otherwise it would be impossible to match the cameras, point clouds, or the reprojections of object points. A recent approach [18] directly uses the SIFT descriptor for establishing correspondences in sequential structure and motion recovery. An additional homography constraint for planar features is included to stabilize the tracking using a two-pass matching. Each of the above mentioned publications neglect to provide an appropriate accuracy evaluation of the reconstructed results. The commonly used reprojection error is not appropriate for the comparison of results with different numbers of constraints in the bundle adjustment. A more accurate solution with more constraints may have a higher reprojection error [16], because it is more likely to find a reconstruction solution for a less constrained system of equations. While the accuracy of the reconstruction increases by enforcing more correct scene relations in the bundle adjustment, the reprojection error increases because of the additional constraints. This important aspect of evaluation is accentuated in this paper. In our work, the broken trajectories are continued during the tracking by referring a reappearing feature to its last valid occurrence in a previous image. Instead of post-processed merging of different sets of reconstructed object points, our approach immediately continues the trajectory after the occlusion or disturbance of the track. In cases without frame to frame correspondences, the camera recovery can continue without a new initialization. Our feature track retrieval is also beneficial in tracking situations with noise, small occlusions, or repeated texture. The approach is applicable for a live broadcast scenario and may be used to extend any sequential structure and motion recovery method. The limited localization accuracy of SIFT is addressed by incorporating the feature localization technique presented in [19]. This approach improves the feature localization procedure of SIFT by assuming a Gaussian shape of the feature neighborhood. Hence, the combined approach provides the robustness of SIFT coupled with increased localization accuracy.
Feature Trajectory Retrieval with Application
159
This paper provides the following contributions: – a feature tracker which allows to bridge occlusions and therefore generates long feature trajectories yielding to an improved 3D reconstruction – a detailed analysis of the achieved accuracy of the new tracker using a highly accurate feature localization procedure – several test scenes demonstrate the superior performance of the proposed combined method In the following Section 2, the camera motion estimation is briefly explained. The feature localization technique is shown in Section 3. In Section 4, the approach of feature trajectory retrieval is presented. Section 5 shows experimental results using natural image sequences. In Section 6, the paper is concluded.
2
Structure and Motion Recovery
The goal of structure and motion recovery is the accurate estimation of the camera parameters and, simultaneously, of 3D object points of the observed scene [4]. The camera parameters of one camera are represented by the projection matrix Ak for each image Ik , k ∈ [1 : K] for a sequence of K image frames. The input data for standard structure and motion estimation consists of corresponding feature points pj,k , pj,k+1 in consecutive image frames. The accumulation of correspondences over several images is a trajectory tj := (pj,k , pj,k+1 , . . . , pj,k+l ), j ∈ [1 : J]. For each trajectory tj , a 3D object point Pj is reconstructed. The 3D-2D correspondence of object and feature point is related by: pj,k ∼ Ak Pj
(1)
where ∼ indicates that this is an equality up to scale. The reconstruction starts with an initialization from automatically selected keyframes [20,21]. After computing initial values for the current camera AK for frame K, the result is optimized by minimizing the bundle adjustment equation: =
J K
d(pj,k , Ak Pj )2
(2)
j=1 k=1
The value r = 2JK , which is often used for evaluation [14,15,19] is the reprojection error. Object points with small trajectories (length < 3 images) or large reprojection errors are discarded to increase the tracking stability. For the case of video with small displacements between two frames, feature tracking methods like KLT tend to produce less outliers than feature matching methods. Nevertheless, in this work feature correspondences are established using the SIFT descriptor [9]. It provides more flexibility for reconstructing from images with wide baseline cameras. This also leads to a better interpretability of the accuracy validation of the presented methods.
160
K. Cordes et al.
In the standard structure and motion recovery, features that have no correspondence to the previous frame are considered as new object points. This can lead to significant drift even in short sequences when large occlusions occur. In this work, this problem is solved using the trajectory retrieval as explained in Section 4. This approach also leads to an increase in the length of the trajectories in general. The second problem is the limited localization accuracy of the SIFT detector, especially for coarse scales [12]. Hence, the localization accuracy is increased using a better approximation of the gradient signal as explained in the following Section 3.
3
Increased Localization Accuracy for SIFT
In [22], it is shown that the feature localization of the SIFT detector can be improved by modifying the gradient signal approximation procedure. The assumption that the image signal around a feature has Gaussian shape is incorporated. Approximately, this assumption can be transferred to the Difference of Gaussians pyramid which is used for the feature localization. Instead of interpolating with a 3D quadric by the original SIFT detector [12], a regression with a Gaussian function is used. The approximation of the selected scale of the Difference of Gaussian pyramid is determined by: −1 1 v (3) Gp (x) = · e− 2 ((x−x0 ) Σ (x−x0 )) |Σ| 2 a b using the covariance matrix Σ = and a peak value parameter v. The b c2 feature coordinate x0 = (x0 , y0 ) provides increased localization accuracy. The optimal parameter vector p = (x0 , y0 , a, b, c, v) is computed using a regression analysis with Levenberg-Marquardt optimization. For details, the reader may refer to [19,22].
4
Feature Trajectory Retrieval
During frame to frame tracking, several trajectories discontinue due to occlusion, repeated texture, or noise in the image signal. By extending the feature comparison to more images, these discontinued trajectories are retrieved if they reappear. These additional constraints are used in the bundle adjustment. To guarantee robust feature matching, the SIFT descriptor [9] is used for the correspondence analysis. In the following, the proposed feature trajectory retrieval is explained using an appropriate memory (Section 4.1) and an outlier elimination method (Section 4.2). 4.1
Trajectory Memory
By using a trajectory memory, the common frame to frame comparison is extended with the correspondences between the current image frame K and frame
Feature Trajectory Retrieval with Application
161
K − L, L = 2, . . . , Lmax . The memory stores each trajectory tj that discontinues and attaches its descriptor for a later retrieval. To guarantee that the stored trajectories correspond to stable object points, only trajectories of a length larger than Lmin images are considered. For the experiments in this paper, this stability constraint is set to Lmin = 4. Newly appearing features in the current image IK with no valid feature correspondence to the previous frame are compared to each trajectory tj in the memory using the attached SIFT descriptors. For the matching, the second nearest neighbor search is used [9]. The best match is a candidate to continue the previously discontinued trajectory tj as shown in Fig. 3. The candidate has to be verified by using an outlier elimination scheme, which is explained in Section 4.2. To reduce the computational time, the matching between the current feature and the trajectories is limited to Lmax past images. The parameter Lmax controls the size of the trajectory memory. A trajectory in the memory that is older than Lmax images is deleted. For the experiments in this paper, the memory size is set to Lmax = 50 images. To reject false matches, an outlier detection method is applied as explained in the next section.
.. pj,K tj
...
...
IK−L−1 IK−L
IK−1
IK
Fig. 3. Schematic feature tracking situation for the proposed Feature Trajectory Retrieval. The newly detected feature pj,K has a valid corresponding trajectory in image IK−L . Our approach retrieves the corresponding feature trajectory tj .
4.2
Outlier Elimination
After determining candidates for the resumption of previously discontinued trajectories they have to be verified using the epipolar constraint. Outliers are detected and removed from the correspondence set. If feature correspondences are found between image IK and image IK−L , they are evaluated using the RANSAC algorithm [17] together with the epipolar constraint using all valid feature correspondences between IK and IK−L . The epipolar constraint is defined by the fundamental matrix F [23]: p j,k+1 F pj,k = 0
∀i
and
det(F) = 0
(4)
162
K. Cordes et al.
Fig. 4. The Lift sequence (frames 19, 26, 33, 83) with temporarily occluded scene content resulting from a foreground object. In standard structure from motion approaches, all feature trajectories discontinue in frame 30. Thus, the tracking fails (top row). With the presented extensions, the tracking leads to accurate results as shown in the bottom row. Here, augmented objects are integrated in the sequence to demonstrate the accuracy of the results. The center row (SIFT with FTR) shows slight drift of the objects while the top row demonstrates a failure with the SIFT reference method.
To detect the outliers, the RANSAC approach is used evaluating the epipolar distance. The inliers can be referred to previously discontinued trajectories. The outliers remain in the memory for a possibly successful match in the following frames. Using the successfully matched correspondences increases the performance of the bundle adjustment for the sequential camera motion estimation, especially in cases when the number of frame to frame correspondences is low. The bundle adjustment equation (2) is extended with corresponding features before their occlusion. These constraints can be used immediately for the joint optimization in frame K by referring to the already reconstructed and stable object point Pj .
5
Experimental Results
The approaches presented in Section 3 (Gauss SIFT) and Section 4 (Feature Trajectory Retrieval - FTR) are validated using natural image sequences. Example images are shown in Fig. 2 for the Bellevue and in Fig. 4 for the Lift sequence1 . In both cases, the standard structure and motion recovery is strongly influenced by occlusions resulting from foreground objects. The presented methods lead to 1
Kindly provided by Vicon: http://www.vicon.com/
Feature Trajectory Retrieval with Application
163
reliable and accurate results. The FTR provides many useful correspondences to previously discontinued trajectories. Thus, the generation of redundant and error-prone object points is avoided. For qualitative evaluation, the total number of object points, the object points visible in each frame and the reprojection error r are shown in Fig. 5 for the Bellevue sequence and in Fig. 6 for the Lift sequence. The points in time in which large occlusions occur are marked with ti . The total number of object points for the FTR is smaller than for the reference method as shown in the top diagram of Fig. 5. This is due to the retrieval of trajectories with corresponding 3D object points. Nevertheless, the number of objects points visible in each frame is higher for the FTR. Thus, more constraints following from larger trajectory lengths are used in the bundle adjustment. This leads to a more accurate reconstruction, although the reprojection error increases for FTR as shown in the bottom 6000
Gauss SIFT with FTR Gauss SIFT reference SIFT with FTR SIFT reference
object points
5000 4000
t1
t2
3000 2000 1000 0
object points per frame
0
50
100
150
200
150
200
250 150 50 0
50
100
0.35 0.3 RMSE
0.25 0.2 0.15
Gauss SIFT with FTR Gauss SIFT reference SIFT with FTR SIFT reference
0.1 0.05 0 0
50
100 frame
150
200
Fig. 5. Bellevue reconstruction results, from top to bottom: total number of object points, object points visible in each frame, and reprojection error r (Root Mean Square Error, RMSE). Due to the Track Retrieval, our method (FTR) provides less object points in total, but more usable constraints in the bundle adjustment. Following from the additional constraints, r increases for FTR. Compared to SIFT, the new localization technique (Gauss SIFT) decreases r .
K. Cordes et al.
object points
164
1800 1600 1400 1200 1000 800 600 400 200
Gauss SIFT with FTR Gauss SIFT reference SIFT with FTR SIFT reference
RMSE
object points per frame
0
10
20
30
40
50
60
70
80
70
80
200 150 100 50
0 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06
10
20
30
40
50
60
Gauss SIFT with FTR Gauss SIFT reference SIFT with FTR SIFT reference
t1 0
10
20
30
40
50
60
70
80
frame
Fig. 6. Lift reconstruction results, from top to bottom: total number of object points, object points visible in each frame, and reprojection error r (Root Mean Square Error, RMSE). Again, our method (FTR) provides more usable constraints in the bundle adjustment. Until frame 30, the more constraints lead to an increase in r for FTR. The reference method fails in frame 30. Compared to SIFT, the new localization technique (Gauss SIFT) has lower r .
diagrams of Fig. 5 and Fig. 6, respectively. That the scene reconstruction using FTR provides better results is validated by integrating virtual objects to the scene2 as shown in Fig. 4 for some example frames. Due to an insufficient number of valid correspondences, the reference method fails for the Lift sequence in frame 30. Interestingly, the FTR leads to more object points per frame for the sequence parts without occlusion, too. We can infer that the method also increases the performance in occurrence of noise and repeated texture, which is present in many scene parts in the Bellevue sequence (windows, grass texture).
2
The video can be downloaded at: http://www.tnt.uni-hannover.de/staff/cordes/
Feature Trajectory Retrieval with Application
165
The feature localization technique using the Gaussian regression function (Gauss SIFT with FTR) leads to better results in any case. Compared to the localization of SIFT (SIFT with FTR), it results in more object points, more object points per frame and, even for the higher amount of object points, a smaller reprojection error. Usually, the reprojection error increases when more constraints are added in the bundle adjustment, because it is more likely to find a reconstruction solution for a less constraint system of equations. Here, the results provide decreased reprojection error and more object points. The approaches are tested using the application scenario of integrating virtual objects into the image sequence. To validate the reconstruction accuracy, for the Lift sequence two static objects are integrated, which is shown in Fig. 4. Due to an accurately estimated camera path, the integration is convincing throughout the sequence using the combination of FTR and Gauss SIFT. The objects show a drift using the FTR and the standard SIFT feature localization technique. The structure and motion estimation fails in frame 30 for the reference method.
6
Conclusion
We present an improved sequential structure and motion recovery approach. Discontinued feature trajectories are retrieved using the distinctive SIFT descriptor and a correspondence analysis for nonconsecutive image frames. The retrieved correspondences between features in the current image and previously discontinued trajectories can be immediately used in the bundle adjustment of the current frame. The method leads to longer trajectories and avoids generating redundant and error-prone 3D object points. As a result, the bundle adjustment performance is increased. To compensate for the limited localization accuracy of the SIFT detector, an improved localization method is applied. It uses a Gaussian approximation of the image signal gradient instead of the interpolation with a 3D quadric used by the original SIFT. A problem using the reprojection error for comparing reconstructions from different numbers of scene constraints is revealed: while the reconstruction improves using more scene relations, the reprojection error increases because the additional constraints limit the possible solutions of the system of equations used for the bundle adjustment. The performance is validated using natural image sequences with temporal occlusions resulting from foreground objects. While the results of the reference method suffer from broken trajectories, the presented approach using the combination of highly-accurate feature localization and feature trajectory retrieval shows no drift. This is demonstrated by integrating augmented objects to the video. The presented extensions are applicable to any sequential structure and motion recovery algorithm.
166
K. Cordes et al.
References 1. Frahm, J.M., Pollefeys, M., Lazebnik, S., Gallup, D., Clipp, B., Raguram, R., Wu, C., Zach, C., Johnson, T.: Fast robust large-scale mapping from video and internet photo collections. Journal of Photogrammetry and Remote Sensing (ISPRS) 65, 538–549 (2010) 2. Hasler, N., Rosenhahn, B., Thorm¨ ahlen, T., Wand, M., Seidel, H.P.: Markerless motion capture with unsynchronized moving cameras. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009) 3. van den Hengel, A., Dick, A., Thorm¨ ahlen, T., Ward, B., Torr, P.H.S.: Videotrace: rapid interactive scene modelling from video. In: ACM SIGGRAPH 2007 papers. SIGGRAPH 2007, vol. (86). ACM, New York (2007) 4. Pollefeys, M., Gool, L.V.V., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual modeling with a hand-held camera. International Journal of Computer Vision (IJCV) 59, 207–232 (2004) 5. Snavely, N., Seitz, S.M., Szeliski, R.: Modeling the world from internet photo collections. International Journal of Computer Vision (IJCV) 80, 189–210 (2008) 6. Thorm¨ ahlen, T., Hasler, N., Wand, M., Seidel, H.P.: Registration of sub-sequence and multi-camera reconstructions for camera motion estimation. Journal of Virtual Reality and Broadcasting 7 (2010) 7. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment - a modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) ICCV-WS 1999. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000) 8. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 674–679 (1981) 9. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV) 60, 91–110 (2004) 10. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. In: British Machine Vision Conference (BMVC), vol. 1, pp. 384–393 (2002) 11. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 27, 1615–1630 (2005) 12. Brown, M., Lowe, D.G.: Invariant features from interest point groups. In: British Machine Vision Conference (BMVC), pp. 656–665 (2002) 13. Engels, C., Fraundorfer, F., Nist´er, D.: Integration of tracked and recognized features for locally and globally robust structure from motion. In: VISAPP (Workshop on Robot Perception), pp. 13–22 (2008) 14. Fitzgibbon, A.W., Zisserman, A.: Automatic camera recovery for closed or open image sequences. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1406, pp. 311–326. Springer, Heidelberg (1998) 15. Liu, J., Hubbold, R.: Automatic camera calibration and scene reconstruction with scale-invariant features. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A.V., Gopi, M., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4291, pp. 558–568. Springer, Heidelberg (2006) 16. Cornelis, K., Verbiest, F., Van Gool, L.: Drift detection and removal for sequential structure from motion algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 26, 1249–1259 (2004)
Feature Trajectory Retrieval with Application
167
17. Fischler, R.M.A., Bolles, C.: Random sample consensus: A paradigm for model fitting with application to image analysis and automated cartography. Communications of the ACM 24, 381–395 (1981) 18. Zhang, G., Dong, Z., Jia, J., Wong, T.T., Bao, H.: Efficient non-consecutive feature tracking for structure-from-motion. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 422–435. Springer, Heidelberg (2010) 19. Cordes, K., M¨ uller, O., Rosenhahn, B., Ostermann, J.: Bivariate feature localization for sift assuming a gaussian feature shape. In: Bebis, G., Boyle, R.D., Parvin, B., Koracin, D., Chung, R., Hammoud, R.I., Hussain, M., Tan, K.H., Crawfis, R., Thalmann, D., Kao, D., Avila, L. (eds.) ISVC 2010. LNCS, vol. 6453, pp. 264–275. Springer, Heidelberg (2010) 20. Thorm¨ ahlen, T., Broszio, H., Weissenfeld, A.: Keyframe selection for camera motion and structure estimation from multiple views. In: Pajdla, T., Matas, J. (eds.) ECCV 2004, Part I. LNCS, vol. 3021, pp. 523–535. Springer, Heidelberg (2004) 21. Torr, P.H.S., Fitzgibbon, A.W., Zisserman, A.: The problem of degeneracy in structure and motion recovery from uncalibrated image sequences. International Journal of Computer Vision (IJCV) 32, 27–44 (1999) 22. Cordes, K., M¨ uller, O., Rosenhahn, B., Ostermann, J.: Half-sift: High-accurate localized features for sift. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Workshop on Feature Detectors and Descriptors: The State Of The Art and Beyond, pp. 31–38 (2009) 23. Hartley, R.I., Zisserman, A.: Multiple View Geometry, 2nd edn. Cambridge University Press, Cambridge (2003)
Distortion Compensation for Movement Detection Based on Dense Optical Flow Josef Maier and Kristian Ambrosch AIT Austrian Institute of Technology, Donau-City-Strasse 1, 1220 Vienna, Austria {josef.maier.fl,kristian.ambrosch}@ait.ac.at
Abstract. This paper presents a method for detecting moving objects in two temporal succeeding images by calculating the fundamental matrix and the radial distortion and therefore, the distances from points to epipolar lines. In static scenes, these distances are a result of noise and/or the inaccuracy of the computed epipolar geometry and lens distortion. Hence, we are using these distances by applying an adaptive threshold to detect moving objects using views of a camera mounted on a Micro Unmanned Aerial Vehicle (UAV). Our approach uses a dense optical flow calculation and estimates the epipolar geometry and radial distortion. In addition, a dedicated approach of selecting point correspondences that suits dense optical flow computations and an optimization– algorithm that corrects the radial distortion parameter are introduced. Furthermore, the results on distorted ground truth datasets show a good accuracy which is outlined by the presentation of the performance on real–world scenes captured by an UAV.
1
Introduction
For the detection of objects moving on the ground in temporal succeeding images captured from UAVs, we propose a novel technique for the estimation of the radial distortion and epipolar geometry. There are various methods known for estimating the radial distortion, but only a few of them can be considered for the detection of moving objects in dense optical flow. Methods exist, that model the distortion with the help of features in the image (lines, rectilinear elements, vanishing points, . . . ), e.g. [1]. Other techniques are calibrating the camera offline (e.g. [2]). Unfortunately, feature based approaches cannot be considered an optimum solution when dense optical flow data are available. Furthermore, an offline calibration would not suit a system that is considered for real-time movement detection. Here, we require a method that works online and suits point correspondence algorithms. Such methods were published among others by Fitzgibbon [3], Barreto and Daniilidis [4], and by Steele and Jaynes [5]. In
The research leading to these results was funded by the KIRAS security research program of the Austrian Ministry for Transport, Innovation and Technology (www.kiras.at).
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 168–179, 2011. c Springer-Verlag Berlin Heidelberg 2011
Distortion Compensation for Movement Detection
169
our work, we based our method on Fitzgibbon which was modified to fit our application. We are using Fitzgibbon’s algorithm because we only need to estimate one parameter for radial distortion and because of its straight forward implementation. This is due to efficient numerical algorithms for solving Quadratic Eigenvalue Problems (QEPs) that are readily available. Moreover, Fitzgibbon’s algorithm only needs 9 point correspondences by providing 4-6 possible solutions. As we are using dense optical flow data instead of point correspondences from feature based matching, we are introducing a dedicated approach of selecting point correspondences. This approach is optimized for the significance of image regions for distortion compensation, being independent of feature positions. Subsequently to the estimation of the fundamental matrix and the radial distortion parameter we are correcting the parameter of radial distortion towards its correct value by an algorithm based on a Levenberg–Marquardt iteration [6]. Afterwards, the distances from points to epipolar lines are calculated. On these distances an adaptive threshold is applied and distances that are above this threshold are marked as moving objects.
2
Optical Flow
For the correspondence search and computation of the Optical Flow we are using the Modified Census Transform (MCT) [7] on the intensity and gradient images. This algorithm was originally proposed by Ambrosch et al. [8] for the deployment in an embedded real–time stereo vision system. Furthermore, Puxbaum et al. [9] evaluated this area–based correspondence algorithm for Optical Flow, achieving highly accurate results. The Census Transform, originally proposed by Zabih and Woodfill [10] is a non–parametric algorithm that evaluates the pixels in a neighborhood region according to the comparison function ξ, where 1 | i1 > i2 (1) ξ(i1 , i2 ) = 0 | i1 ≤ i2 and concatenates the resulting values, being either 1 or 0, to a bit-vector. In contrast to the original transform, the MCT compares the pixel values with the mean intensity value of the whole neighborhood region instead of using the center pixel’s value. This way, the transform is more robust to saturated center pixels, showing significantly better results on gradient images [8]. For the matching costs, a sparse Hamming distance over the bit-vectors is computed. Here, only every 2nd bit in x and y direction is used for the computation, reducing the complexity of the calculation by a factor of 4. Then, these costs are aggregated in an additional computational step. This aggregation has the purpose of mean filtering the matching costs, highly reducing the noise in the resulting correspondence values. Unfortunately, this aggregation also leads to a high smoothness in the results, extending and distorting object boundaries. However, for an accurate computation of the distortion compensation, the reduction of noise is of essential importance as proposed in section 5.
170
J. Maier and K. Ambrosch
The subpixel refinement is performed using the method proposed by Shimizu et al. [11], where the matching costs are interpolated by fitting a parabola in vertical and horizontal direction with dˆsub =
c0 − c2 . 2c0 − 4c1 + 2c2
(2)
For the reduction of outliers caused by wrong matches, we are further computing confidence values. Here, the difference between the absolute minimum in the matching costs and the second lowest minimum is computed and compared. If the confidence value is below a given threshold, the values are disregarded. In our work, we are always using a confidence threshold value of 10.
3
Distortion Compensation for Dense Optical Flow
To be able to detect a moving object and not an artifact that might be a result of distortion or noise, the fundamental matrix, one parameter for radial distortion, and an adaptive threshold have to be computed. Thus, we are using the well known method from Fitzgibbon [3] to estimate one parameter for radial distortion and the epipolar geometry. This original method is briefly described in section 3.1. Moreover, section 3.2 describes a few changes to this original algorithm to better fit our application. These changes also include a special form of selecting point correspondences from the available dense optical flow data, introduced in section 3.3. Afterwards, we are trying to correct the radial distortion parameter, which is described in section 3.4. 3.1
Feature–Based Lens Distortion and Multiple View
A part of the proposed movement detection algorithm is based on Fitzgibbon’s estimation of multiple view geometry and lens distortion [3], which is described in the following. This algorithm simultaneously calculates an approximate value λ for radial lens distortion and the fundamental matrix F that are an input to the subsequent optimization–algorithm (see section 3.4). Fitzgibbon uses the division model 1 (3) p= 2x 1 + λ x for describing the approximated radial distortion, where p is an undistorted and x = (x, y)T a distorted (non-homogeneous) 2D point in the first image. This notation is consistent throughout the whole paper unless otherwise noted. Points in the second image are denoted as x and p and satisfy the same relation as the points in the first image which can be expressed in condensed form as x = L(p).
(4)
In this discussion, the distortion center is assumed to be in the middle of the image. As the relationship between x and p is related on their distance from the
Distortion Compensation for Movement Detection
171
image center, the origin of the image points throughout this paper is the distortion center. This does not place any constraint on the position of the principal point. The distortion center and the principal point are generally different [12]. Moreover, a slight displacement of the distortion center from its real position does not strongly affect the correction [3,12]. In addition, all algorithms described in this paper are performed only with normalized points. Under distortion–free conditions, the fundamental matrix is computed using perfect (homogeneous) point correspondences p ↔ p by using the 8-point algorithm [6,13] which relies on the equation of the fundamental matrix: pT F p = 0.
(5)
As this is rarely the case, equation (5) is modified to include the distortion parameter λ: (x + λz )T F (x + λz) = 0 xT F x + λ(zT F x + xT F z) + λ2 zT F z = 0,
(6)
where z = (0, 0, x )T . Equation (6) can be solved by formulating it as a quadratic eigenvalue problem (QEP). Therefore, equation (6) can be rearranged by expanding everything out and gathering the resulting three row vectors into three matrices: 2
1 ] ·f + [ x x , x y , x , y x , y y , y , x , y , + λ [ 0 , 0 , x r2 , 0 , 0 , y r2 , xr2 , yr2 , r2 + r2 ] ·f + + λ2 [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , r2 r2 ] ·f = 0 (D1 + λD2 + λ2 D3 ) ·f = 0 ,
(7)
where r = x, r = x , and f is a vector extracted from F [3]. If the number of point correspondences is more than 9, the three matrices are not quadratic and every matrix of the overdetermined system has to be multiplicated on its left by D1T to be able to solve the QEP. In the presence of noise, the solution of the normal equations D1T D1 f = 0, D1T D2 f = 0, and D1T D3 f = 0 suffer from bias and mainly from variance. Additionally, this leads to 10 solutions in the general case and to 6 or fewer in practise [3,5]. Therefore, a robust estimation has to be used for real data. For this case, Fitzgibbon uses a feature–based matching algorithm and a RANSAC strategy [6] to eliminate incorrect correspondences and to estimate F and λ. To mark correspondences as inliers or outliers, the error is calculated for each correspondence and a good set of correspondences is selected by minimizing the following cost function: 2 2 L(ˆ p ) − x + L(ˆ min p) − x , (8) (x, x , λ) = T {pˆ ,ˆp |ˆp F pˆ =0} ˆ and p ˆ are estimates of the true points. Fitzgibbon approximates these where p points by first undistorting the noisy image coordinates (p = Li (x)), performing the Hartley–Sturm correction [6,14] in the second step, and distorting the corrected points (ˆ x = L (ˆ p)) in the last step.
172
3.2
J. Maier and K. Ambrosch
Modification of Fitzgibbon’s Algorithm for Area–Based Correspondence Search
Fitzgibbon’s algorithm performs quite well when nearly noise–less data is available. When only noisy data is available, however, the RANSAC algorithm for eliminating the incorrect correspondences can demand a high computational effort. One reason is the calculation of the reprojection error for every correspondence, particularly when the number of available correspondences is high. Therefore, a faster calculation of the reprojection error is preferable. This is achieved by using the optimal correction method of Kanatani for triangulation from two views [15,16] instead of the Hartley–Sturm correction [6,14]. This method is significantly faster (approx. 17 times — extracted from Figure 4 of [15]) than the Hartley–Sturm correction. Additionally to the effect that the algorithm performs faster, the subsequent optimization–algorithm (see section 3) is based on the reprojection error. Thus, the corrected point correspondences are not only used for the error calculation in the RANSAC algorithm. For calculating the corrected points and the reprojection error, every point correspondence is undistorted by equation (3) in the first place. Afterwards, a ˆ = 0 is calculated by ˆ T F p first approximation which may not strictly satisfy p ˜, F p ˜ Pk F p ˜ p ˆ=p ˜− p (9) ˜ , Pk F T p ˜ ˜ + F T p ˜ , Pk F p Fp ˜ ˜, F p ˜ Pk F T p p ˆ =p ˜ − p , (10) T ˜ , Pk F T p ˜ ˜ , Pk F p ˜ + F p Fp ˜ and p ˜ are undistorted (homogeneous) point correspondences and Pk ≡ where p diag(1, 1, 0) is a projection matrix. For a further improvement of the correspondences, an iterative scheme is used where the solution of ˆ, p ˜ − p ˆ ,p ˆ, F p ˆ + F p ˜ −p ˆ + FTp ˆ Pk F p ˆ p ˆ ˆ =p ˜− p (11) ˆ , Pk F T p ˆ ˆ , Pk F p ˆ + F T p Fp ˆ, p ˜ − p ˆ ˆ ,p ˆ, F p ˆ + F p ˜ −p ˆ + FTp ˆ Pk F T p p ˆ = p ˆ ˜ − p (12) ˆ , Pk F T p ˆ ˆ , Pk F p ˆ + F T p Fp is calculated during each iteration. After each iteration, the corrected point corˆˆ respondences are updated by the solutions of equations (11) and (12) (ˆ p←p ˆ ˆ ). This computation is repeated until the iterations converge. This ˆ ←p and p mostly occurs after three to four iterations. On this point it has to be mentioned that the real implementation of Kanatani’s correction method is optimized for numerical implementation as described in [15]. The results are the same. In contrast to Fitzgibbon’s algorithm, an area–based (cf. section 2) instead of a feature–based matching algorithm is used. This has the advantage that the area in the image from where the correspondences for estimating F and λ are selected can be chosen freely. Therefore, correspondences can be selected that are in most cases independent from each other (no more than 2 points lie on
Distortion Compensation for Movement Detection
(a)
173
(b)
Fig. 1. Selection of point correspondences. (a) Different image areas. (b) 500 selected point correspondences on a real image.
one line). This can lead to a similar solution for F and λ by using less point correspondences than for a quasi randomly selected point set (e.g. feature–based matching algorithm). Moreover, a region in the image can be chosen where the impact of radial distortion leads to a more accurate solution of Fitzgibbon’s algorithm (cf. section 3.3). Naturally, the RANSAC algorithm randomly selects point correspondences from the whole set [6]. Thus, there is a possibility that correspondences are randomly selected that are concentrated on a small region in the image. However, experimental tests have shown that in such cases and for the use with real data only a very inaccurate solution for F and λ can be calculated. To prevent such a configuration, the necessary point correspondences for the RANSAC algorithm are split up into at least 4 point sets with nearly the same number of correspondences that are randomly chosen from different areas of the image. Hence, the possibility that point correspondences are selected during the execution of the RANSAC algorithm that deliver a good approximation to the real F and λ is higher than for a normal random selection of correspondences. This leads in many cases to an improvement in computation time. 3.3
Point Sampling
As afore mentioned in section 3.2, the area from where the point correspondences are selected can be chosen freely due to an area–based matching method. Thus, the image is divided into 4 different areas (cf. Figure 1(a)) whereas only 2 of them are used for selecting correspondences. Figure 1(a) shows 3 different areas that are denoted as W1, W2, and W3. For selecting point correspondences only the areas W1 and W2 are used. This is due to the fact that the effect of radial distortion in the midmost circular area is insignificant for normal perspective, non–catadiopric, and non–wide–angle cameras where the distortion center is in or near the middle of the image. Therefore, and as a result of numerical accuracy and noise, the calculation of the radial distortion would be very inaccurate when only point correspondences from this midmost area would be selected. As a result, excluding points from this area can lead to a better solution for λ and to
174
J. Maier and K. Ambrosch
a reduced execution time for the RANSAC algorithm. Experimental tests have shown that point correspondences which are selected from area W3 distort the solution of F and λ in most cases. This can be traced back to optical effects that arise near the borders of an aperture and with increasing distance from the optical axis of the lens system. Such effects might beside others be diffraction, 4 coma and the growing influence of higher radial distortion terms (e.g. λ2 x ). Additionally to the afore mentioned area–restriction, the points are selected in such a way that most of them are independent from each other (no more than 2 points lie on one line). Thus, points are selected that are located on circles, indicated by the dashed lines in Figure 1(a). The minimum distance from one point to the next is at least 10 pixels on the arc of the respective circle, cf. Figure 1(b). Moreover, the minimum distance from one circle to the next is as well 10 pixels. As can be seen in Figure 1(b) on the right side and the outer circle, a gap between selected point correspondences is present. This is due to the fact that before a point correspondence is selected, it is checked whether the distance of a point in the first image to its corresponding point in the second image is below a threshold or not. If not, the correspondence is rejected. Thus, the probability to select only correct correspondences is very high. 3.4
Optimizing the Pre–estimated Radial Distortion Parameter
After pre–estimation of the radial distortion parameter λ by the modified Fitzgibbon–algorithm and the calculation of the corrected point correspondences ˆ i (during RANSAC estimation) a subsequent algorithm tries to correct ˆi ↔ p p λ. The method is based on the Levenberg–Marquardt algorithm [6]. The corrected point correspondences are used during the RANSAC estimation for calculating the reprojection error. On this error, a threshold is used to mark inliers and outliers. For marking corrected point correspondences as inliers or outliers, the same error but a slightly higher threshold is used. These inliers ˆi ↔ p ˆ i , their associated measured point correspondences xi ↔ xi , and λ are p the input to the optimization–algorithm. ˆ i = (ˆ ˆ i = (ˆ xpi , yˆpi ) and p xpi , yˆp i ) are distorted. Afterwards, In the first step, p T T T ˆ ˆ i ) and measurement vectors Xi = xi , x estimated measurement vectors Xi = (ˆ T (xTi , xT ) are defined, whereas i ˆ = (X ˆT,X ˆT,...,X ˆ T )T X 1 2 n X = (XT1 , XT2 , . . . , XTn )T .
(13) (14)
ˆ and the Jacobian matrix J are In the next step, an error vector = X − X computed:
T T
T T T ˆ1 ˆn ˆn ˆ1 ∂x ∂x ∂x ∂x , ,..., , . (15) J= ∂λ ∂λ ∂λ ∂λ To have a criterion for decision, which λ fits better to the data, the new estimated ˆ or the old one, the reprojection error t is calculated over all selected point λ
Distortion Compensation for Movement Detection
175
correspondences with t =
n 2 2 ˆ i ˆ Xi − X + Xi − X i .
(16)
i=1
On the basis of J and , A = JT J
(17)
T
B=J
(18)
can be calculated. The ongoing computation is performed iteratively depending ˆ In this on the reprojection error ˆt that is calculated using the new estimated λ. iterative process, A is multiplied by a value 1 + κ that varies from iteration to iteration. The initial value of κ was chosen to be 10−3 . Thus, the update value Δ for λ can be calculated with Δ = (A (1 + κ))
−1
B
(19)
and λ can be updated: ˆ = λ + Δ. λ
(20)
ˆ fits better to the data than the old To check if this new distortion parameter λ ˆ Afterˆi ↔ p ˆ i are distorted using λ. one, the corrected point correspondences p wards, the new overall error ˆt is calculated with equation (16) and compared ˆ and ˆt and with t . If the new error is smaller, λ and t are replaced by the new λ κ is divided by 10, cf. equation (21). In addition, the whole process of optimizing λ is repeated with the updated λ and starts again with the computation of the Jacobian matrix in equation (15). If the new overall error ˆt is larger or equal to t , κ is multiplied by 10 (cf. equation (21)) and the update value Δ is calculated again starting with equation (19). ⎧ ˆ ˆt κ · 10−1 | t > ˆt λ ⎨ λ t κ = (21) ⎩ | t ≤ ˆt λ t κ · 10 This iteration is stopped if the update value Δ is very small in contrast to λ or the new overall error ˆt is a few times in succession larger or equal to the old one (t ). In most cases, the optimized λ fits better to the data than the one calculated with the modified Fitzgibbon–algorithm. An exception is, if the pre–estimated λ is not even close to the correct value and/or the fundamental matrix does not describe the present epipolar geometry.
4
Movement Detection Based on Dense Optical Flow
To be able to detect a moving object in two temporal succeeding images on the basis of the distance from points to epipolar lines, a threshold has to be calculated that is large enough not to identify noisy pixels as a moving object and is small
176
J. Maier and K. Ambrosch
enough to recognize slowly moving objects. Therefore, the threshold must be adaptive from image pair to image pair. For this reason, the afore mentioned ˜ = Li (x )) point distance is calculated for every undistorted (˜ p = Li (x) and p ˜ i marked as inlier during the execution of the RANSAC ˜i ↔ p correspondence p algorithm, cf. equation (23). Preceding to this calculation, the epipolar lines have to be computed (equation (22)). The distances are calculated for both, the first (temporal) and the second image (dri and dli ) and are combined afterwards by equation (24). ˜i li = F p
(22)
˜ T p i li
dri = l12i + l22i di = d2ri + d2li
T
with li = (l1i , l2i , 1)
(23) (24)
From these distances di the quartiles are calculated and distances that are below the lower and above the upper quartile are rejected. From the remaining distances the mean value μ, which is the average value of the arithmetic mean ˜ as well as the standard deviation s are calculated. value d¯ and the median d, From μ and s the threshold is calculated. Experimental tests have shown that a good threshold value THl is given by: THl = μ + 4s.
(25)
For detecting a moving object in two temporal succeeding images, the distances from all in the picture available (undistorted) points to their corresponding epipolar lines are calculated with equations (22) to (24). Then, in practice two thresholds are applied to these distances. The distances must be above THl and below THu = 3 · THl . In this case, the related pixel is marked as moving object.
5
Results and Discussion
The whole algorithm for detecting a moving object in two temporal succeeding images comprises multiple independent sub–algorithms. Thus, there are one final and two intermediate results available. In section 5.1 the results of the modified Fitzgibbon–algorithm are presented and compared with those including the optimization–algorithm. In the subsequent section 5.2 the results of the whole algorithm for movement detection are presented. 5.1
Evaluation with Ground Truth Datasets
To study the performance of Fitzgibbon’s algorithm in contrast to the same algorithm but with additional optimization of the radial distortion parameter λ, the ground truth dataset “Teddy” (from Middlebury database [17]) with a size of 584x388 pixels was distorted. Moreover, for testing the performance in
Distortion Compensation for Movement Detection
177
a more realistic condition, additional motion blur was added. Afterwards, the optical flow was calculated by the method described in section 2. For cases, where motion blur was added, an optimal aggregation and Census block size of 15 pixels each was selected, and in the case without motion blur different block sizes were applied from image pair to image pair for the optical flow calculation. On the basis of this optical flow, the fundamental matrix F and the radial distortion parameter λ were calculated by Fitzgibbon’s algorithm. Subsequently, λ was optimized and both, the old (unoptimized) and the new λ were stored for a later comparison. These calculations were performed for different sample sizes of point correspondences, starting with 30 and going up in increments of 20 to 350 point correspondences. The number of point correspondences used for the estimation of F and λ during each RANSAC iteration was 9. From the resulting λ–values (unoptimized and optimized), those below the lower and above the upper quartile were rejected. From the remaining values, the median, minimal and maximal λ–values were calculated and the final results are presented in Figure 2. The median is indicated by a trend line. Figure 2(b) shows, that it was possible to reduce the error and to achieve more realistic λ–values even when the optical flow calculation produces more noise due to a small block size. Nearly the same can be observed in Figure 2(d), where a motion blur of up to 10 pixels is present.
(a)
(b)
(c)
(d)
Fig. 2. Accuracy of λ. The ground truth is shown as red dashed line. (a) λ resulting from Fitzgibbon’s algorithm over different block sizes. (b) Optimized λ over different block sizes. (c) λ resulting from Fitzgibbon’s algorithm over varying motion blur. (d) Optimized λ over varying motion blur.
178
J. Maier and K. Ambrosch
(a)
(b)
(c)
(d)
Fig. 3. Detection of 4 moving persons. (a) First image. (b) Second image. (c) Optical flow resulting from first and second image. The color indicates the amount of displacement. (d) Detected moving persons.
5.2
Real–World Scenes
For testing the performance of the whole movement–detection algorithm, a camera was mounted on an UAV. From the recorded video stream, two consecutive images with a resolution of 1280x720 pixels were extracted (see Figures 3(a) and 3(b)). There, 4 people are imaged moving on the ground. From these two images the optical flow was calculated (see Figure 3(c)) as described in section 2. Then, the fundamental matrix and the optimized λ were calculated using a sample set of 300 point correspondences. The number of point correspondences used for the estimation of F and λ during each RANSAC iteration was 9. On the basis of this, the moving persons were correctly detected in the image, as can be seen in Figure 3(d).
6
Conclusion
We presented a novel method for movement detection based on a dense optical flow calculation for which we estimated the fundamental matrix and the radial distortion by an adaption of a well known algorithm and successfully corrected the value of radial distortion toward the real value. We showed that the optimization–algorithm is able to correct an estimation of the radial distortion parameter λ even if it is not very close to the real solution. Moreover, we presented that the detection of moving objects can be performed in the real world by
Distortion Compensation for Movement Detection
179
providing an adaptive threshold and a good approximation of the fundamental matrix and radial distortion.
References 1. Gonzalez-Aguilera, D., Gomez-Lahoz, J., Rodriguez-Gonzalvez, P.: An Automatic Approach for Radial Lens Distortion Correction From a Single Image. IEEE Sensors Journal 11 (2011) 2. Lucchese, L., Mitra, S.K.: Correction of Geometric Lens Distortion Through Image Warping. In: Proc. of the 3rd Int. Symposium on Image and Signal Processing and Analysis (2003) 3. Fitzgibbon, A.W.: Simultaneous linear estimation of multiple view geometry and lens distortion. In: Computer Vision and Pattern Recognition (2001) 4. Barreto, J.P., Daniilidis, K.: Fundamental Matrix for Cameras with Radial Distortion. In: Proc. of the Tenth IEEE Int. Conf. on Computer Vision (2005) 5. Steele, R.M., Jaynes, C.: Overconstrained Linear Estimation of Radial Distortion and Multi-view Geometry. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 253–264. Springer, Heidelberg (2006) 6. Hartley, R., Zisserman, A.: Multiple View Geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge (2000) 7. Froeba, B., Ernst, A.: Face detection with the modified census transform. In: Proc. of the Sixth IEEE Conf. on Automatic Face and Gesture Recognition (2004) 8. Ambrosch, K., Kubinger, W.: Accurate hardware-based stereo vision. Computer Vision and Image Understanding 114 (2010) 9. Puxbaum, P., Ambrosch, K.: Gradient-based modified census transform for optical flow. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Chung, R., Hammoud, R., Hussain, M., Kar-Han, T., Crawfis, R., Thalmann, D., Kao, D., Avila, L. (eds.) ISVC 2010. LNCS, vol. 6453, pp. 437–448. Springer, Heidelberg (2010) 10. Zabih, R., Woodfill, J.I.: Non-parametric Local Transforms for Computing Visual Correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 800, pp. 151–158. Springer, Heidelberg (1994) 11. Shimizu, M., Okutomi, M.: Precise Sub-pixel Estimation on Area-Based Matching. In: Proc. of the Eight IEEE Int. Conf. on Computer Vision (2003) 12. Willson, R.G., Shafer, S.A.: What is the Center of the Image? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (1993) 13. Hartley, R.: In Defense of the Eight-Point Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 14. Hartley, R., Sturm, P.: Triangulation. Computer Vision and Image Understanding 68, 146–157 (1997) 15. Kanatani, K., Sugaya, Y., Niitsuma, H.: Triangulation from Two Views Revisited: Hartley-Sturm vs. Optimal Correction. In: Proc. of the 19th British Machine Vision (2008) 16. Sugaya, Y., Kanatani, K.: Highest Accuracy Fundamental Matrix Computation. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part II. LNCS, vol. 4844, pp. 311–321. Springer, Heidelberg (2007) 17. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A Database and Evaluation Methodology for Optical Flow. Int. J. of Comp. Vis. 92 (2011)
Free Boundary Conditions Active Contours with Applications for Vision Michal Shemesh and Ohad Ben-Shahar Ben-Gurion University of the Negev, Beer-Sheva, Israel {shemeshm,ben-shahar}@cs.bgu.ac.il
Abstract. Active contours are used extensively in vision for more than two decades, primarily for applications such as image segmentation and object detection. The vast majority of active contours models make use of closed curves and the few that employ open curves rely on either fixed boundary conditions or no boundary conditions at all. In this paper we discuss a new class of open active contours with free boundary conditions, in which the end points of the open active curve are restricted to lie on two parametric boundary curves. We discuss how this class of curves may assist and facilitate various vision applications and we demonstrate its utility in applications such as boundary detection, feature tracking, and seam carving.
1
Introduction
Active contours (a.k.a. snakes) is a popular family of models in which each possible contour C(s) in the image plane is associated with an energy, and once initialized (manually or automatically), the contour deforms continuously in order to converge to an optimal energetic state. Suggested first by Kass et al. [1], the original formulation of active contours uses an explicit parametric representation of the curve and therefore became known as the parametric active contour model. The energy of active contours is usually formulated as a functional containing an internal and external energy terms (1) Ecurve (C(s)) = [Eint (C(s)) + Eext (C(s))] ds . During the evolution of the contour, the internal energy term controls its shape while the external energy term attracts the curve to certain image features. One can define each of them according to the desired applications, which typically have been either linear features detection (e.g. [2,3,4]), image segmentation (e.g. [5,6]), object detection (e.g. [7,8]), or motion tracking (e.g. [9,10,11]). Owing to the type of applications that they have been usually adopted for, most existing active contour models are formulated on closed curves, while much fewer open active models are put to use. Interestingly, however, the first instance of open active contours (OAC) was already presented in the seminal work by Kass et al. [1], and since then it has been adapted occasionally for applications that involve G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 180–191, 2011. c Springer-Verlag Berlin Heidelberg 2011
Free Boundary Conditions Active Contours with Applications for Vision
181
linear features, for example in geophysical [3], medical [12], or biological [4,13] contexts. Apart from distinct energy functionals that suit their respective applications, the different uses of OAC have been characterized by their different boundary conditions. Notably, these different boundary conditions have focused exclusively on two cases: either fixed boundary conditions or no boundary conditions at all. In this work we present and formulate a third class of OACs, whose boundary conditions are known in the calculus of variations as “free” [14, pp. 68-74]. We discuss why this type of objects is potentially a most useful class of active contours for vision and we demonstrate it in the context of several different applications. Y
Y
Y
M
M
M
B0 (q)
B1 (q) P0 =B0 (x0 )
C(s)
P1=B 1(x 1)
P0
N
P0 P0 =B0 (q 0)
P1
C(s)
B1 (q)
N
X
(a)
M B0 (q)
B0 (q)
C(s)
Y
(b)
C(s)
B0 (q)
P1 =B1 (q 1)
B1 (q)
B1 (q)
N
X
(c)
P1
N
X
X
(d)
Fig. 1. Four typical examples for OAC (black) and boundary curves (red) drawn on an image I. An OAC which is represented as a function of one variable (a-b) and as a parametric curve (c-d).
2
Related Work
As mentioned above, OAC have been used much less frequently in vision, and when they are employed it is either with fixed boundary conditions or no boundary conditions at all. Fixed boundary conditions, the case where both end points of the active contour are fixed in space, were used for linear feature detection in road maps and medical images [12,15,16,17]. In such cases, the end points are assumed to be anchors that are known with full certainty in advance and hence need not shift in space during evolution. This situation reflects the most fundamental minimization problem of calculus of variation (i.e., the problem of finding the shortest string between two given end points [18, pp. 33-34]) and in practice it involves a curve evolution that resembles the classical closed snake for all points except at the end points (which preserve their position throughout the iterations). Unlike this classical problem, visual OAC with fixed boundary conditions also involve the influence of external forces. Once the initial curve is set with its two end points fixed in place, additional “inflation” [12] or “deflation” [17] forces can reduce its sensitivity to local minima during the evolution. Alternatively, local minima have been handled by global search via dynamic programming. Such approach was first implemented for the active contours parametric model [19] and later on used to find the path of minimal cost between two given points using either the surface of minimal action [15] or a function describing at each point the minimal cost of a curve connecting that point to a pre-determined “source point” in the image [16].
182
M. Shemesh and O. Ben-Shahar
In some application, where the desired end points of the snake are not known a priori, fixed boundary conditions are clearly improper. In such cases, the energy of the OAC is endowed with “stretching forces” that affect its end points in the direction tangent to the curve [4]. Related also to earlier ideas on incremental “snake growing” [2,20], this approach assumes no special constraints on the end points, and hence evolves all snake points similarly. In cases where the movement of the end points is not only in the tangential direction, internal energies such as inertia and differential energy are integrated [10] and affect the movement of the end points as well. While open curves are more naturally represented parametrically, OAC have also been considered implicitly [21] in the spirit of modern approaches to active contours that employ level set formulations (e.g., [22,23,24]). Since OAC are naturally suited for linear features like edges and filaments, they are less consistent with the fact that level sets of functions are generically closed curves. Hence, an implicit representation for OACs was suggested via the medial axis-derived centerline of a level set function induced by the curve [21]. To fit better the closed nature of levelsets, an alternative approach defined the OAC via functions whose levelsets partially surround the image margins but cross its interior along a linear feature of interest [25]. This solution, however, is appropriate only when the open curve connects two different margins of the image, which is usually not the case. Here too the end points obey no boundary condition behavior in order to stretch along the sought-after linear features. Since OAC are naturally suited for thin and elongated features, applications of both types of OAC models have focused on those cases where one needs to detect such features more robustly than local edge detection can provide. Included among such application are the extraction of roads (e.g., [10,16,21]), the detection of coastlines in satellite images (e.g., [3]), model-based contour detection (e.g., [26,27]), extraction of thin linear features in medical or biological contexts (e.g., [4,12,13,16]), and boundary detection (e.g., [2,17]). Unlike the previous types of OAC, in this paper we propose a new class of OAC with free boundary conditions. In this model each of the two end points of the OAC is free to move, but it is entitled to do so in a very particular and constrained fashion, namely along a given parametric boundary curve, as shown in Fig. 1 (red). This configuration is often natural in vision applications, where it is known a priori that the end points cannot depart from a particular image plane structure of co-dimension 1. Furthermore, this formulation can greatly simplify the initialization process of active contours by providing some freedom for the initial location of the end points (unlike in “fixed boundary conditions” case) but not too much freedom (as in the “no boundary conditions” case) that can divert the final convergence state of the snake from the desired location. In the rest of this paper we first discuss the mathematical principles that drives this class of OAC and then describe and demonstrate its practicality for applications ranging from boundary detection and recognition, through tracking, to seam carving.
Free Boundary Conditions Active Contours with Applications for Vision
3
183
Mathematical Foundations
Active contours are deformable curves in images which evolve over time to an optimal energetic state. The energy functional from Eq. 1 can be expressed more explicitly by 1 ∂I ∂I (C), ∂y (C), . . .) ds (2) Ecurve = 0 f (C, C , . . .) + g(I(C), ∂x where C(s) : [0, 1] → 2 describes a parametrized differentiable planar curve in an image I : [0, M ] × [0, N ] → , f (·) is a function of the curve, and g(·) is a function of the image beneath the curve (including their derivatives up to some desired order). During the evolution of the active contour toward minimal energetic state, the f (·) function controls its shape while the g(·) function pulls the curve to certain image features. In the classical snakes by Kass et al. [1], the active contour energy functional was defined as follows 1 Ecurve = 0 α|C (s)|2 + β|C (s)|2 + g(s) ds (3) g(s) = −γ |∇(Gσ ∗ I(C(s)))| , where α and β control the elasticity and rigidity of the contour, respectively, γ weighs in the degree to which high gradient regions attract it, and Gσ is a smoothing Gaussian kernel with standard deviation σ. In contrast to fixed or no boundary conditions used for open active contours, in the OAC framework discussed here, the end points of our OAC open curve are restricted to lie on two parametric boundary curves. This type of free boundary conditions requires special care in the control of end point dynamic, as dictated by the calculus of variations [14,28] and derived below in the next subsections. Before we turn to derive the dynamics of OAC and the constraints induced by the free boundary conditions, we note that an OAC in the image plane can be represented in various ways. Most generally, it can be represented as a parametric curve C(s) = (x(s), y(s)), and the analysis for this case is presented in the next subsection. However, if the OAC is stretched “left to right” in the image plane and is not allowed to “backtrack” at any point then it may be represented more simply as a function of the form y = y(x) (i.e, C(s) = (s, y(s)) (See Fig. 1a,b). In such cases, the variational analysis becomes univariate and provides additional insights, but since it is a special case of the general analysis, it is derived in the Supplementary Materials only. 3.1
The Variational Problem for C(s) = (x(s), y(s))
Assume one represents the OAC as a generalized parametric representation of the OAC as C(s) = (x(s), y(s)), where s is the curve parameter. For simplicity, suppose again that the energy functional over the curve is limited to first order derivatives. Under these assumptions the energy functional from Eq. 3 can be written generally as 1 Φ(x, y, x , y )ds , (4) J[C(s)] = 0
184
M. Shemesh and O. Ben-Shahar
where Φ incorporates both the internal and external terms. Suppose now that
the two end points of the active contour, i.e, P0 = C(0) and P1 = C(1) are constrained to lie on two smooth boundary curves B0 (q) and B1 (q), respectively, where Bi (q) : [0, 1] → 2 (i = 0, 1). Let q0 and q1 be the parameter values, where C(s) intersects the boundary curves B0 (q) and B1 (q), i.e., B0 (q) = (X0 (q), Y0 (q)) B1 (q) = (X1 (q), Y1 (q))
P0 = C(0) = B0 (q0 ) P1 = C(1) = B1 (q1 ) ,
(5)
as illustrated in Fig. 1c. This free boundary variational problem leads to the following coupled pair of Euler-Lagrange equations [14,28] Φx − Φy −
d ds Φx d Φ ds y
=0 =0 .
(6)
The additional end point constraints now become (see [14, pp. 222-228]) (Φx , Φy )|s=0 · (X0 , Y0 )|q0 = (Φx , Φy )|s=0 · B0 |q0 = 0 (Φx , Φy )|s=1 · (X1 , Y1 )|q1 = (Φx , Φy )|s=1 · B1 |q1 = 0
(7)
(where Φx , Φy stands for Φx (x, y, x , y ) and Φy (x, y, x , y ), respectively). These constraints imply a particular geometrical configuration at the two end points. We observe that the vector of partial derivatives of the function Φ with respect to x and y should be perpendicular to the curves B0 , B1 at the end points P0 , P1 , respectively. This constraint is typical of variational problems with free boundary conditions and is known as the transversality conditions (see [14, pp. 72-73]). 3.2
Implementation for a Typical Visual OAC
Let the OAC be represented as a general parametric curve C(s) and let us consider an energy function Φ from the family described in Eq. 3. For simplicity of presentation, in the following we will also assume β = 0 (see the Supplementary Materials for the more general case) and hence, applying Eq. 6 on the functional 1 (8) Ecurve = 0 α|C (s)|2 + g(s) ds , g(s) = −γ |∇(Gσ ∗ I(C(s)))| , yields the following two Euler-Lagrange equations Φx − Φx −
d ds Φx d Φ y ds
=0 ⇒ =0 ⇒
∂ − 2αx + ∂x g=0 ∂ − 2αy + ∂y g = 0 .
(9)
One more thing we need to take care of is the specific form that the transversality constraint takes for our selected energy functional. Applying Eq. 7 to our specific Φ functions yields the following additional end point constraints 2α (x , y )|s=0 · (X0 , Y0 )|q0 = 0 2α (x , y )|s=1 · (X1 , Y1 )|q1 = 0 .
(10)
Note that in this particular case the transversality constraint implies that the OAC must remain orthogonal to the boundary curves on both ends.
Free Boundary Conditions Active Contours with Applications for Vision
185
While the necessary conditions expressed in Eq. 9 should be satisfied at the extrema points of the functional, a standard way to obtain these extrema is to use these equations in a steepest descend fashion [2,3,1]. Hence, if C(s, t) represents the curve C(s) at time t, its evolution over time is given by
Ct (s, t) =
∂ C(s, t) = −2αC (s, t) + ∇g(s, t) . ∂t
(11)
Obviously, should any particular application require or desire a different composition of internal and external energies (cf. Eqs. 1 through 8), the Euler-Lagrange and the dynamics of the OAC would need to adjust accordingly. In practice, the OAC is represented as a time-evolving series of control points {xti , yit } for i = 1...n. In order to enforce boundary conditions in each time step in the OAC’s evolution in a way that respects Eq. 10, one must adjust the end points (points 1 and n) such that (1) they remain on the boundary curves and (2) the boundary curves are perpendicular to the OAC. More explicitly, the new coordinates for point i = 1 (P0 ) which should be placed on boundary curve B0 (see Fig 1 and Eqs. 5) are P0 = (X0 (q0 ), Y0 (q0 )) when q0 is the root of the polynomial D(q) (representing the derivative of the distance between point i = 2 of the OAC and the curve B0 )
D(q) = ((X0 (q) − x2 )X0 ) + ((Y0 (q) − y2 )Y0 ) ,
(12)
s.t. 0 ≤ q0 ≤ 1. The computation for the new coordinates of point i = n is analogous. A detailed presentation of additional mathematical derivations and numerical considerations is presented in the Supplementary Material. 3.3
Properties of the Dynamics
At first sight it seems that the difference between OACs with and without free boundary conditions are negligible as the only operational differences between them is the application of transversality. It remains to discuss how critical is this constraints is in practice, and how it affects the dynamics of the OAC. In this context two comments are in order. Intuitively, when internal forces are missing, one would expect the end points to remain on the boundary curves while obeying external forces only. Indeed, when α = 0 we get (Φx , Φy ) = (0, 0), transversality as expressed in Eq. 7 becomes meaningless, and the OAC would evolve under the influence of external forces only. On the other hand, when only internal forces control the curve, transversality constraints become an operational necessity. Since the application of transversality is operational important, it remains to discuss how this reconciles with the fact that not always an object boundary or other linear features will be perpendicular to the boundary curve. Indeed, as most regularizations prevents classical active contours from accurately locking on sharp features like corners, so does transversality in OAC with free boundary conditions, which might entail a slight deviation from the genuine image-based feature where it meets the boundary curve. In practice, the advantages of OAC with free boundary conditions well exceed this limitation since the the deviation
186
M. Shemesh and O. Ben-Shahar
extends no more than one pixel from the boundary curves and hence virtually unnoticeable.
4
Applications for Vision
OAC with free boundary conditions offer a new platform for a number of applications. In this section we discuss several such applications and the benefits that are introduced by addressing them with our new type of OACs. We emphasize that at this point it is not our goal to claim that using OAC with free boundary conditions is necessarily better than traditional techniques that have been studied extensively in the context of these applications, nor do we attempt to exhibit better than state-of-the-art performance in any of these applications. Rather, in discussing these applications we hope to present the opportunities that could lie ahead in using this class of OAC. Clearly, to beat the state-of-the-art one would need to optimize the use of these objects for any particular application, and in particular, further research would be required in each case in order to select the energy functional and the choice of the free boundary conditions accordingly. 4.1
Boundary Detection
Boundary detection is one of the most natural and popular applications of active contours. As we now discuss, using OACs with free boundary conditions may be a better choice for certain instances of this application. Consider for example the detection of coastlines in satellite images [3,29,25]. In most cases the end points of the curve describing the coastline reside on the image boundaries and the importance of their correct localization in these regions is no lesser than any other point of the coastline. Hence, using fixed boundary conditions is appropriate only if one is able to determine the location of these end points accurately in advance. On the other hand, setting no boundary conditions may result in uncompleted detection of the coastline in case the end points depart from the image margins in the course of the evolution. Since in our case it is known that the end points must lie on the image margins, one may set these margins as free boundary conditions, and allow the curve to localize the coastline end points on the image margins as part of the evolution process, as Fig 2 shows, this provides superior results which neither classical OAC can provide. The same figure shows the application of the same tool in agricultural setting, though here the boundary condition curves are defined inside the image. The initial contour evolves to lock down on the correct object contour and performs completion of the occluded areas. The external energy term was defined as g = −γ |∇(Gσ ∗ I(C))| with σ = 2. Note that for the detection of linear features such as those discussed above, the setting of the boundary conditions requires less extra care or scrutiny as would be necessary with fixed boundary conditions. This represents a generic advantage of OAC with free boundary conditions – it allows the user to specify boundary conditions easily and quickly and be sure that the resulting curve will
Free Boundary Conditions Active Contours with Applications for Vision
(a)
(b)
(c)
(d)
(e)
(f)
187
(g)
Fig. 2. A parametric OAC successfully locks down on linear features. The initial (a, shown via discrete control points), and final (b) convergence state of an OAC on a coastline satellite image (α = 0.5, γ = 2). Boundary curves (red) coincide with the image margins. Closeup of the ROI in panel b, and results using free (c), fixed (d), and no (e) boundary constraints. Three corresponding videos are provided in the supplementary material. Panel f and the closeup in panel g show similar application in agricultural setting where The very rough initial contour (green) evolves to lock down on the correct object contour (blue). With proper definition of the external energy, this process can also completes the occluding contour of the fruit over the occluders.
not escape these regions during the evolution of the curve. In practice, all that is requires is a curve that intersects the linear feature, rather than a point that sits exactly on it. As shown in Fig. 2, this provides better results than traditional OAC up to the effect of transversality as discussed in Sec. 3.3. 4.2
Feature Tracking
Active contours have been used successfully for tracking ever since their introduction to the vision community, and similar to their use for detection and segmentation, most tracking applications employ closed curves [30]. If the feature to be tracked is linear in structure, however, the natural active contour to use would be open. Still, since end points of linear features also tend to exhibit motion in video sequences (just imagine the margins of a road viewed from the driver’s seat), the use of fixed boundary conditions becomes improper. In many cases, however, the end points are restricted to move along a known curve in the image plane, either because the linear feature slides along the image margins or along an occluder, or because of some known physical constraint in the world. In such cases, using OAC with free boundary conditions is the natural choice for insuring accurate tracking results. One example is shown in Fig. 3 (upper row). Here the goal is to track the margins of the road, whose position changes from one frame to another. Since during the video sequence the road margins slide along the image margins, it is possible to defined these boundaries (or parts of them) as the free boundary conditions of an OAC that tracks the road from frame to frame. The convergence state at each frame serves as the initial OAC for the next frame (in this case the first frame was initialized manually), which is then driven to its optimal state to lock again on the road margins. In this example external energy term was defined by the negative values of the image gradient, i.e., g = −γ |∇(Gσ ∗ I(C))|
188
M. Shemesh and O. Ben-Shahar
with σ = 2 in order to extend the attraction region of the curve to the road strips. Note how the free boundary conditions ensure that the road is tracked successfully along its full extent. The additional tracking example shown in Fig. 3 (lower row) demonstrates a case where the boundary curves are determined by a physical constraint in the world. Here we seek to follow the rising spark in Jacob’s ladder1 , which is formed between two static wires but is free to move along them. This configuration fits naturally the free boundary conditions setup, where the boundary curves are defined along the wires. The external energy was defined by the negative value of the gray level image intensity, i.e., g = 1 − |(Gσ ∗ I(C))| (σ = 2) in order to attract the curve to the bright spark. Note that tracking configurations like this are prevalent in natural scenarios, and include traveling waves between walls, behavior of soap films, constrained viscous fluid dynamics, and more.
frame(1)
frame(30)
frame(60)
Fig. 3. Upper row: Road extraction and tracking using two OACs (α = 2, γ = 1). Superimposed on each frame are the initial and final OACs configurations (in green and blue, respectively). The green OACs in frames 30 and 60 are the final OAC from the previous panels (frames 1 and 30, respectively). and the free boundary conditions are set to the red curves. A video is provided in the supplementary material. Lower row: The rising spark in Jacob’s ladder (α = 0.1, γ = 1) (right, showing time exposures of the phenomenon) The images show seven consecutive frames, each showing the initial and final configurations (in green and blue, respectively) Boundary curves (red) were aligned with the wires in all frames.
4.3
Seam Carving
Seam carving [31,32] is a popular approach for content-aware image resizing. In this method the image size and aspect ratio are changed by repeatedly removing or inserting low energy paths of pixels called seams. While seam carving employs discrete methods for finding optimal seams, a natural alternative is using 1
Jacob’s ladder is a device for producing a continuous train of sparks which rise upwards. The spark gap is formed by two wires, approximately vertical but gradually diverging away from each other towards the top in a narrow ”V” shape.
Free Boundary Conditions Active Contours with Applications for Vision
189
the continuous framework of OAC with free boundary conditions. In this case, instead of a monotonic and connected path of low energy pixels crossing the image from left to right (or top to bottom), representing seams as OAC with free boundary conditions provides the opportunity to consider them in a continuous domain, where image boundaries define the free boundary conditions between which the curves are stretched (as shown in Figs. 1b and 4 (red) and defined in Sec. 4.1). The use of OAC for image resizing using seam carving can be performed with the following steps in a repeated fashion. First, an initial contour connecting opposite margins of the image is selected, either arbitrarily or using a more informative selection process. Then the OAC evolves until a minimal energetic configuration is found while the proper pair of image boundaries serve as free boundary conditions curves. Finally, the parametric OAC is converted into a set of pixels for removal or insertion. The flexibility of the OAC framework allows to define each one of these operations in one of many ways according to the desired application of interest. Although using OAC with free boundary conditions for seem carving deserves a research program in its own right, it is clear that representing seams as such OAC could carry many advantages over the traditional discrete representation used thus far. First, they facilitates much larger space of optimal seams, since OAC are not restricted to maintain monotonicity as opposed to the traditional seam definition. For the same reason, such OAC are not restricted to a [−45◦ , 45◦ ] sector, as dictated by monotonicity in a regular pixel grid. Furthermore, these objects allow for easy incorporation of geometrical constraints in addition to image-based criteria and they readily facilitate super resolution seam carving by their very continuous nature. Fig. 4 shows two examples of seam carving using OAC with free boundary conditions. In this example, each initial contour was set as a traditional optimal seam and then evolved to minimal energetic state. Then, all li pixels that intersect the OAC in row i of the image were replaced with li − 1 pixels by averaging the color of each two neighboring pixels into one. In this case, we used α = 0.05, γ = 1, and σ = 1 and the external energy was defined as −g from Eq. 8 in order to attract the contour towards homogeneous areas.
5
Summary
We presented and discussed a new class of open active contours with free boundary conditions, in which the end points of the open active curve are restricted to lie on two parametric boundary curves in the image plane. Being continuous and detached from the discrete and regular structure of the pixel array, these objects facilitate the extension of several applications in vision, they simplify the initialization procedure, and they provide a natural framework for sub pixel computation in the context of many applications.
190
M. Shemesh and O. Ben-Shahar
(a)
(b) 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
(e)
(f)
(g)
0
(c)
(d)
Fig. 4. (a) An initial OAC (green) and the final convergence configuration (blue). (b) A result of seam carving using OAC applied for 50 iterations. (c) A closeup of the region of interest from panel a shows the dramatic change in the seam that occurred during the evolution from the traditional seam to the minimal energy OAC. (d) Same two close up seams, shown on the energy map of the image, and explain why the OAC elected to evolve to this final result.(e-g) An original image and the result of carving out 30 and 55 vertical seams using OAC.
Acknowledgments. This work was funded in part by the European Commission in the 7th Framework Programme (CROPS GA no 246252). We also thank the generous support of the Frankel fund and the Paul Ivanier center for Robotics Research at Ben-Gurion University.
References 1. Kass, M., Witkin, A., Trezopoulos, D.: Snakes: Active contour models. Int. J. Comput. Vision 1, 321–331 (1988) 2. Berger, M.O., Mohr, R.: Towards autonomy in active contour models. In: ICPR, vol. 1, pp. 847–851 (1990) 3. Della Rocca, M.R., Fiani, M., Fortunato, A., Pistillo, P.: Active contour model to detect linear features in satellite images. Int. Arch. Photo. Remote Sens. Spat. Inf. Sci. 34 (2004) 4. Li, H., Shen, T., Smith, M.B., Fujiwara, I., Vavylonis, D., Huang, X.: Automated actin filament segmentation, tracking and tip elongation measurements based on open active contour models. In: IEEE Int. Symp. on Biomed. Imaging, vol. 15 (2009) 5. Williams, D.J., Shah, M.: A fast algorithm for active contours and curvature estimation. CVGIP 55 (1992) 6. Basu, S., Mukherjee, D.P., Acton, S.T.: Active contours and their utilization at image segmentation. In: Proceedings of Slovakian-Hungarian Joint Symposium on Applied Machine Intelligence and Informatics Poprad., pp. 313–317 (2007) 7. Velasco, F.A., Marroquin, J.L.: Robust parametric active contours: the sandwich snakes. Machine Vision and Applications, 238–242 (2001) 8. Xu, C., Prince, J.: Snakes, shapes, and gradient vector flow. IEEE Trans. Image Processing 7, 359–369 (1998)
Free Boundary Conditions Active Contours with Applications for Vision
191
9. Leymarie, F., Levine, M.D.: Tracking deformable objects in the plane using active contour model. IEEE Trans. Pattern Anal. Mach. Intell. 15 (1993) 10. Sawano, H., Okada, M.: Road extraction by snake with inertia and differential features. In: ICPR, vol. 4, pp. 380–383 (2004) 11. Srikrishnan, V., Chaudhuri, S.: Stabilization of parametric active contours using a tangential redistribution term. IEEE Trans. Image Processing 18, 1859–1872 (2009) 12. Cohen, L.D.: On active contour models and balloons. CVGIP 53, 218–221 (1991) 13. Saban, M.A., Altinok, A., Peck, A.J., Kenney, C.S., Feinstein, S.C., Wilson, L., Rose, K., Manjunath, B.S.: Automated tracking and modeling of microtubule dynamics. In: IEEE Int. Symp. on Biomed. Imaging, pp. 1032–1035 (2006) 14. Sagan, H.: Introduction to the Calculus of Variations. McGraw-Hill, Inc., New York (1969) 15. Cohen, L.D., Kimmel, R.: Global minimum for active contour models: A minimal path approach. Int. J. Comput. Vision 24, 57–78 (1999) 16. Melonakos, J., Pichon, E., Angenent, S., Tannenbaum, A.: Finsler active contours. IEEE Trans. Pattern Anal. Mach. Intell. 30, 412–423 (2008) 17. Gunn, S.R., Nixon, M.S.: Global and local active contours for head boundary extraction. Int. J. Comput. Vision 30, 43–54 (1998) 18. van Brunt, B.: The Calculus of Variations. Springer-Verlag New York Inc., New York (2004) 19. Amini, A., Tehrani, S., Weymouth, T.: Using dynamic programming for minimizing the energy of active contours in the presence of hard constraints. In: ICCV, pp. 95–99 (1988) 20. Velasco, F.A., Marroquin, J.L.: Growing snakes: active contours for complex topologies. Pattern Recognition 36, 475–482 (2003) 21. Basu, S., Mukherjee, D.P., Acton, S.T.: Implicit evolution of open ended curves. In: ICIP, pp. I:261–I:264 (2007) 22. Caselles, V., Catte, F., Coll, T., Dibos, F.: A geometric model for active contours in image processing. Numerische Mathematik 66, 1–31 (1993) 23. Malladi, R., Sethian, J.A., Vemuri, B.C.: Shape modeling with front propagation: a level set approach. IEEE Trans. Pattern Anal. Mach. Intell. 17, 158–175 (1995) 24. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. Int. J. Comput. Vision 22, 61–79 (1995) 25. Cerimele, M.M., Cinque, L., Cossu, R., Galiffa, R.: Coastline detection from SAR images by level set model. In: CAIP, pp. 364–373 (2009) 26. Fua, P., Leclerc, Y.G.: Model driven edge detection. Machine Vision and Applications 3, 45–56 (1990) 27. Kimmel, R., Bruckstein, A.: Regularized laplacian zero crossing as optimal edge integrators. Int. J. Comput. Vision 53, 225–243 (2003) 28. Gelfand, I., Fomin, S.: Calculus of Variations. Prentice-Hall, Inc., Englewood Cliffs (1963) 29. Lesage, F., Ganon, L.: Experimenting level-set based snakes for contour segmentation in radar imagery. In: Proc. of the SPIE, vol. 4041, pp. 154–162 (2000) 30. Blake, A., Isard, M.: Active Contours: The Application of Techniques from Graphics, Vision, Control Theory and Statistics to Visual Tracking of Shapes in Motion, 1st edn. Springer-Verlag New York, Inc., New York (1998) 31. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Trans. Graph. 26, 10 (2007) 32. Rubinstein, M., Shamir, A., Avidan, S.: Improved seam carving for video retargeting. ACM Trans. Graph. 27, 1–9 (2008)
Evolving Content-Driven Superpixels for Accurate Image Representation Richard J. Lowe and Mark S. Nixon School of Electronics and Computer Science, University of Southampton, UK
Abstract. A novel approach to superpixel generation is presented that aims to reconcile image information with superpixel coverage. It is described as content-driven as the number of superpixels in any given area is dictated by the underlying image properties. By using a combination of well-established computer vision techniques, superpixels are grown and subsequently divided on detecting simple image variation. It is designed to have no direct control over the number of superpixels as this can lead to errors. The algorithm is subject to performance metrics on the Berkeley Segmentation Dataset including: explained variation; mode label analysis, as well as a measure of oversegmentation. The results show that this new algorithm can reduce the superpixel oversegmentation and retain comparable performance in all other metrics. The algorithm is shown to be stable with respect to initialisation, with little variation across performance metrics on a set of random initialisations.
1
Introduction
Superpixels are a form of image oversegmentation that aim to represent the image in a more useful way. They are a way of agglomerating pixels into regions such that they retain the properties of pixels. This paper presents a new superpixel algorithm that offers high stability in addition to an accurate and efficient representation of images. The first use of superpixels by Ren and Malik[1], uses superpixels as a pre-processing stage to achieve segmentation of an image by matching superpixel boundaries with human labelled data. This formulation uses N-cuts graph segmentation[2] by recursively dividing the image into a pre-determined number of superpixels. In addition, Felzenszwalb and Huttenlocher [3] created an algorithm based on pairwise region comparisons that approaches linear speed. This does not require specification of superpixel initialisation, but it includes a parameter for tuning the attention to scale. More recently, Turbopixels [4], an approach based on level-sets [5], improves the speed of superpixel generation considerably. A fixed number of seeds is used that are dilated to obtain the superpixels. The seed placement is optimised to best extract the homogeneous regions, and overlap is prevented by way of a skeleton frame. Recent works include imposing a lattice structure on the creation of superpixel boundaries [6,7], thereby retaining pixel-like structure. This restriction on superpixel generation does not hinder the ability of the algorithm to extract well-defined superpixels. In addition, this was extended in [8] G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 192–201, 2011. c Springer-Verlag Berlin Heidelberg 2011
Evolving Content-Driven Superpixels for Accurate Image Representation
193
to vary superpixel size in accordance with the scene information. This however is still limited by the requirement of the desired number of superpixels. It has been observed that many existing superpixel algorithms lack repeatability due to the choice of segmentation method [9]. Fundamentally, superpixels seek to represent an image in a reduced manner without loss of information. The pixel property that changes is colour and therefore it is of paramount importance that superpixel colour accurately reflects the region it covers. Superpixel algorithms are usually designed such that there is absolute control over the number of generated superpixels and, historically, algorithms are criticised for lacking such control. However, when selecting superpixel initialisations, changing the quantity of superpixels even slightly can lead to dramatic changes in the result [9].
(a) Superpixels generated using N- (b) Superpixels generated using CD cuts superpixels Fig. 1. A comparison of CD superpixels with existing work using the same number of superpixels
This paper introduces a novel algorithm that adapts superpixel coverage to local structure. Superpixels are generated by growing a set of chosen image pixels into regions; dividing them if they become inhomogeneous with respect to colour. This is achieved without initialisation parameters. Figure 1 shows the difference in approach between N-cuts and our approach. While existing algorithms can control the number of superpixels, this control provides change on a linear scale. The result is that all regions will grow or shrink accordingly. This will inevitably lead to some regions being over or under represented. Our new algorithm is not designed to scale regions linearly, rather it is designed to produce varying sizes of superpixels dependent on the local image complexity. This method retains the detail in the original image whilst simultaneously reducing colour oversegmentation. As a direct consequence, high-stability can be demonstrated across a set of random initialisations. The term content-driven superpixels is used as the algorithm directly responds to image variation and the results reflect the local properties of an image without having to respond to initialisation parameters. For convenience, our new algorithm will henceforth be referred to as ‘CD superpixels’.
194
2
R.J. Lowe and M.S. Nixon
Growing Superpixels
Superpixels are evolved using a single pixel to initialise each superpixel. There are then two phases to superpixel evolution: growth and division. Growth incrementally increases the size of a superpixel, by adding new pixels to the edges of the superpixel. Image information is used within division, as ACWE (Active Contours Without Edges) considers whether to split a superpixel at each iteration. Splitting cannot always occur due to the distribution of pixels within the superpixel. Segmentation, and therefore the creation of new superpixels, is successful only if there is colour inhomogeneity within the superpixel. Mathematically, a superpixel is defined as the set of pixels over which it has grown. Growth adds new pixels to this set; division creates new superpixels by making new sets that correspond to the newly segmented pixels. 2.1
Growth
Superpixel growth is achieved using a distance transform [10] of every superpixel. This transforms each superpixel S such that set of pixels at locations i, j within the superpixel display the distance D to the background (in this case, the region in which superpixels have yet to form). Superpixel edges therefore have a distance of one from the background. This is shown in Equation 1. A binary image I is used to calculate the distance transform where True denotes that a superpixel covers this point in the image and False otherwise. The background is therefore all the False points. The same image is used to individually grow each superpixel. (i − k)2 + (j − l)2 (1) D(i, j) = min k,l:I(k,l)=F alse
S
= S ∪ {(k, l): D(i, j) = 1}
(2)
Equation 2 shows the iteration, t, of the superpixel to include the background location k, l adjacent to the superpixel edge. By only considering the pixels that have a distance of one from the background, this ensures cells do not overlap. Any pixel inside the superpixel or adjacent to another superpixel is not connected to the background and hence a superpixel cannot grow at these locations. As the distance transform is stable for any shape, the superpixels can grow from any initial size and shape. The stopping criterion is a direct consequence of the algorithm, terminating once the superpixels cannot grow any further. This occurs when they are completely bordered by other superpixels or image boundaries. 2.2
Division
Superpixel division could theoretically use any region-based segmentation technique. In this case, the works of Chan and Vese [11,12] on ACWE segmentation are used. This segmentation algorithm cannot usually be used in low-level segmentation as it only separates an object from the background. However in this case, as it is required to split the superpixel into two new regions, this technique
Evolving Content-Driven Superpixels for Accurate Image Representation
195
is ideal. A further benefit is the addition of localised smoothing introduced by the approach. This helps to restrict superpixel division; a necessary requirement due to the greedy nature of the algorithm. ACWE is an approximation of the Mumford-Shah functional [13]; treating the functional as a sum of piecewise constant functions. This allows segmentation of an image by minimising the functional F into two optimally separated regions: one inside the contour and one outside. The problem can then be represented as an energy minimisation framework. In this application, the problem is reduced further by considering only a subset of the image: the area within the superpixel. Considering these smaller regions makes the problem tractable as an iterative algorithm. Chan-Vese segmentation of superpixels works by considering two regions u, v that form the positive and negative parts of a signed distance function, ΩD . A force F iteratively updates the distance function (Equation 3) such that each pixel is ‘moved’ toward the region it best matches by computing the distance to the mean of the regions u, v (Equation 4). The new superpixels, Cu , Cv , are taken to be the positive and negative parts of the newly formed distance function, ΩD , as in Equation 7. = ΩD · F (3) ΩD The vector form of this method is given in Equation 4, where Ω(x, y) is defined as the set of pixels within the superpixel. This form of the equation deals with RGB colour images, where the segmentation criterion of a region u, v is given as the average of the means (ui ,vi ) of each of the N colour channels Ii as in Equations 5 and 6. Division occurs if there is a significant difference in any of the colour channels. N N 1 1 2 |Ii (x, y) − ui | dxdy − |Ii (x, y) − vi |2 dxdy (4) F = Ω N i=1 Ω N i=1 Ii (x, y)dxdy (5) , ∀ΩD (x, y) > 0 ui = Ω Ω(x, y)dxdy Ω Ii (x, y)dxdy vi = Ω (6) , ∀ΩD (x, y) ≤ 0 Ω Ω(x, y)dxdy Cu = {(x, y) : ΩD (x, y) > 0}, Cv = {(x, y) : ΩD (x, y) ≤ 0}
(7)
To retain the property of spatial connectivity, there is one final problem to overcome. It is quite common in variational segmentation for the result to be spatially separated. In fact, it is often a desired result of the segmentation. However, this leads to problems when attempting to grow a superpixel that is not completely connected. Therefore, any parts of the segmentation that are not connected are allocated to separate superpixels using connected component labelling [14]. 2.3
Control
The algorithm is designed not to require controlling parameters as the idea is to create the best possible reconstruction of the image. However, one can still influence
196
R.J. Lowe and M.S. Nixon
the result by smoothing the image using a simple Gaussian filter with standard deviation σ. This still detects the larger image variation though finer details are missing, as expected.
3
Method of Analysis
The results generated in this paper are compared against those of [3] and [1] as these are well established techniques. As our new algorithm does not directly control the superpixels, all comparisons are achieved by using the output from CD superpixels to specify the equivalent parameters in the other algorithms. To assess the quality of our algorithm, results are generated on each image using varying levels of Gaussian smoothing. This allows a comparison to be drawn as the number of superpixels changes. The results come from twenty images of the test set from the Berkeley Segmentation Dataset (BSDS) [15]. Twenty images was deemed to be sufficient to show the attributes of the technique at this early stage. BSDS includes human segmented annotations of the original images, typically five for each image. Given that the labels are much larger than a typical superpixel, each label will contain multiple superpixels. There are three things to consider in evaluation: the accuracy; efficiency; and stability of the algorithm. The compression ratio c used here is defined in Equation 8 as the ratio of P pixels in the original image to S superpixels in the result. A low compression ratio indicates many superpixels are generated. c= 3.1
P S
(8)
Accuracy
Superpixel approaches aim to reduce the number of pixels without losing any image information. As this paper focuses on assessing superpixel quality by homogeneity criteria, evaluation is focused on the ability to reconstruct an image using generated superpixels. Mode label analysis is introduced in order to identify object undersegmentation error, the average proportion of each superpixel that matches the modal annotated human label using BSDS. However, the human labelling used for mode analysis is subjective. Labels are drawn on the assumption that they are equally important to the user. While this is somewhat accounted for by the averaging, some important image information is ignored. For this reason, superpixel quality is not exclusively assessed by metrics on BSDS. To overcome this problem, the ‘explained variation’ [7] is also calculated, providing a measure of superpixel accuracy independently of the human-labelled edges in BSDS. It calculates how well the mean of each superpixel matches the pixels within by calculating the variation about the global mean. This is given in equation 9, where xi represents the pixel value, μi is the mean of the superpixel containing the pixel xi and μ is the global mean of the image.
Evolving Content-Driven Superpixels for Accurate Image Representation
(μi − μ)2 R = i 2 i (xi − μ) 2
3.2
197
(9)
Efficiency
Firstly, it is necessary to define under and over segmentation in this context. Undersegmentation occurs if a region does not contain an adequate number of superpixels to represent the colour information of that region, regardless of whether that region is an object of interest or not. Oversegmentation can be similarly defined as a region that contains too many superpixels to represent the colour information of the region. The result of this perspective is that an image object can be considered oversegmented with regard to the object (as is the aim of a superpixel algorithm) yet with regard to colour, the object could be represented in the optimum number of superpixels. It is this colour oversegmentation which is implicitly minimised by CD superpixels as the superpixels are generated as required by the image. Explained variation, described in Section 3.1, provides the measure of colour undersegmentation. If the average colour of a superpixel does not accurately represent the pixels within it, that region is undersegmented. In addition to colour undersegmentation, oversegmentation must also be considered. Oversegmentation can be considered an inherent consequence of superpixel algorithms. Quantifying oversegmentation helps to illustrate the ability of superpixels to represent an image in as few superpixels as possible without allowing undersegmentation. Colour oversegmentation can be measured by the average Euclidean distance between the mean colour value of each connected superpixel. If this value is low, then the average distance between superpixels is small, and therefore superpixels are less distinct in colour, implying oversegmentation. This measure is given in Equation 10 where (r, g, b)[0, 1] represent the colour of connected superpixels i, j. C represents the sum all superpixel connections and c represents a single connection. (ri − rj )2 + (gi − gj )2 + (bi − bj )2 c (10) d= C 3.3
Stability
As shown in Figure 2, even though the output can be controlled using initialisation, the reconstruction is largely unaffected despite the difference in resulting superpixels. In addition, all results are similarly affected by smoothing the image. For this reason, the initialisation will not be changed while testing the effects of smoothing. However, the invariance to initialisation is tested. This invariance is tested on a single image using random Gaussian perturbations of the grid pattern at increasing levels of standard deviation.
198
R.J. Lowe and M.S. Nixon
(a) Initialisation A: (b) Reconstruction (c) Initialisation (d) Reconstruction 9 superpixels; result: using initialisation B: 36 superpix- using initialisation 8871 superpixels. A. els; result: 7248 B. superpixels. Fig. 2. Illustrating the difference between two different initialisations arranged in an evenly spaced grid. Despite the difference in the initialisation the reconstruction is hardly affected.
4 4.1
Results Accuracy
Figure 3 illustrates two metrics: explained variation and mode label. As Figures 3(a) and 3(b) show, the explained variation and modal label accuracy of CD superpixels are similar to that of the best result, but only for c < 500. As c increases, the Gaussian smoothing in CD Superpixels reduces the accuracy of the algorithm. Smoothing will begin to blur image boundaries and inherently reduce the quality of the result.
Fig. 3. Showing how mode label and explained variation vary as a function of compression
4.2
Efficiency
Figure 4 shows the relationship between colour oversegmentation and undersegmentation. Here, the best result will be obtained by maximising the value on both axes. Both N-cuts and Felzenszwalb approaches show, to varying degrees, that as the explained variation of the result increases (how well is the image reconstructed by superpixels), the colour oversegmentation increases. CD superpixels remains almost constant, irrespective of the quality of reconstruction. This means that a high
Evolving Content-Driven Superpixels for Accurate Image Representation
199
Fig. 4. Showing the difference in colour between superpixel neighbours as a function of explained variation
quality result using CD superpixels does not imply oversegmentation as occurs in other algorithms. This is not unexpected as other algorithms design superpixels to be of similar size, meaning that large regions of one colour will contain the same number of superpixels as a much more complex region. 4.3
Stability
Figure 5 shows the results of perturbing the initialisation by a random Gaussian variable of increasing variance. Explained variation and modal label vary very
Fig. 5. Showing how mode label and explained variation vary as the initialisation is perturbed by a Gaussian random variable of standard deviation σ.
200
R.J. Lowe and M.S. Nixon
l little, having standard deviation 0.3 and 0.01 respectively. Colour difference is not assessed as the previous experiment has shown it is almost constant. The results in Figure 5 show little instability in the algorithm. As CD superpixels are parameterised on image properties, the number of superpixels is forced to adhere to the image and consequently, there is little possibility of the initial superpixel arrangement affecting the result. This helps to resolve one major problem of superpixel algorithms: that they are unstable due to initialisation parameters. Figure 6 shows additional results from these tests.
Fig. 6. Results on selected images from BSDS
5
Conclusions
This paper has demonstrated that by parameterising superpixels not by number but by complexity, superpixel stability can be improved. This overcomes one of the notable criticisms of superpixels. This paper has also introduced the concept of explicitly measuring oversegmentation to better evaluate superpixels. This provides an insight into superpixel quality that has previously been overlooked. This metric of superpixel efficiency has shown that while existing algorithms perform well on undersegmentation, they sacrifice efficiency by oversegmenting large regions of the image. In contrast, CD superpixels has been shown to provide a constant measure of oversegmentation throughout. It can be attributed to the use of ACWE as a splitting algorithm. As ACWE uses colour differences to divide, and almost all neighbouring superpixels have occurred through division, the constant difference in colour must be attributed to how ACWE separates a region. The benefit of this is improved stability when generating superpixels. However, despite the fact that this paper goes some way to providing a more rigorous analysis of superpixels, there is no method that will directly test superpixel density as a function of image complexity. Evaluating the complexity of the image would allow the superpixel density to be evaluated, i.e. how many superpixels should be in any given area. This would allow for better evaluation of superpixel algorithms in the future. The results of CD superpixels can suffer at high compression due to the Gaussian smoothing used to control the algorithm. It may be possible to reduce the error by using a more robust smoothing algorithm such as anisotropic diffusion [16] that provides Gaussian smoothing while retaining image edges.
Evolving Content-Driven Superpixels for Accurate Image Representation
201
Superpixels remain to have their uses truly explored. The ability to reduce image complexity whilst retaining high-level features is highly desirable in many areas of computer vision. A potential area to explore would be the invariance properties of superpixels. As the superpixels generated here are not fixed in number or size, this can be justifiably explored. For example, does halving the image size correspond to half the number of superpixels, or can this algorithm produce the same number of regions in equivalent locations? This property would make CD superpixels highly useful in object recognition.
References 1. Ren, X., Malik, J.: Learning a classification model for segmentation. In: IEEE Proc. Computer Vision, pp. 10–17 (2003) 2. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. PAMI 22, 888–905 (2000) 3. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. IJCV 59, 167–181 (2004) 4. Levinshtein, A., Stere, A., Kutulakos, K.N., Fleet, D.J., Dickinson, S.J., Siddiqi, K.: TurboPixels: Fast Superpixels Using Geometric Flows. IEEE Trans. PAMI 31, 2290– 2297 (2009) 5. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic active contours. IJCV 22, 61–79 (1997) 6. Moore, A.P., Prince, S.J.D., Warrell, J.: “Lattice Cut”-Constructing superpixels using layer constraints. In: IEEE CVPR, pp. 2117–2124 (2010) 7. Moore, A.P., Prince, S.J.D., Warrell, J., Mohammed, U., Jones, G.: Superpixel lattices. In: IEEE CVPR, pp. 998–1005 (2008) 8. Moore, A.P., Prince, S.J.D., Warrell, J., Mohammed, U., Jones, G.: Scene shape priors for superpixel segmentation. In: ICCV, pp. 771–778 (2009) 9. Tuytelaars, T., Mikolajczyk, K.: Local Invariant Feature Detectors: A Survey. Foundations and Trends in Computer Graphics and Vision 3, 177 (2007) 10. Borgefors, G.: Distance transformations in digital images. Computer Vision, Graphics, and Image Processing 34, 344–371 (1986) 11. Chan, T., Sandberg, B., Vese, L.: Active Contours without Edges for Vector-Valued Images. Visual Communication and Image Representation 11, 130 (2000) 12. Chan, T.F., Vese, L.A.: Active Contours Without Edges. IEEE Trans. Image Processing 10 (2001) 13. Mumford, D., Shah, J.: Optimal approximations by piecewise smooth functions and associated variational problems. Comm. Pure Appl. Math 42, 577–685 (1989) 14. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice Hall, Englewood Cliffs (2001) 15. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics. In: Proc. 8th Int’l. Conf. Computer Vision, pp. 416–423 (2001) 16. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. PAMI 12, 629–639 (1990)
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation Guillaume Cerutti1,2 , Laure Tougne1,2 , Antoine Vacavant3 , and Didier Coquin4 1
Universit´e de Lyon, CNRS Universit´e Lyon 2, LIRIS, UMR5205, F-69676, France 3 Universit´e d’Auvergne, ISIT, F-63000, Clermont-Ferrand 4 LISTIC, Domaine Universitaire, F-74944, Annecy le Vieux 2
Abstract. In this paper we present a system for tree leaf segmentation in natural images that combines a first, unrefined segmentation step, with an estimation of descriptors depicting the general shape of a simple leaf. It is based on a light polygonal model, built to represent most of the leaf shapes, that will be deformed to fit the leaf in the image. Avoiding some classic obstacles of active contour models, this approach gives promising results, even on complex natural photographs, and constitutes a solid basis for a leaf recognition process.
1
Introduction
Over the past years of progress and urbanization, the world of plants has lost the part it formerly had in our everyday life, and the names and uses of the many trees, flowers and herbs that surround us now constitute a knowledge accessible only to botanists. But nowadays, with the rising awareness that plant resources and diversity ought to be treasured, the will to regain some touch with nature never felt so present. And what better tool to achieve this than the ubiquitous mobile technology, that has the opportunity of placing a flora book in everyones pocket? Botanists traditionally rely on the aspect and composition of fruits, flowers and leaves to identify species. But in the context of a widespread nonspecialist-oriented application, the predominant use of leaves, which are possible to find almost all year long, simple to photograph, and easier to analyze, is the most sensible and widely used approach in image processing. Considering the shape of a leaf is then the obvious choice to try to recognize the species. Starting by tree leaves which are the easiest to spot, our goal is to build a system to retrieve relevant geometric criteria to classify a leaf, in a photograph taken by a smartphone camera. In this paper we present a method to achieve a first part of this process, estimating the general shape of a leaf that will guide the segmentation process. In Section 2 we present the related publications. Section 3 expounds the model used to represent the general shape of a leaf, and Section 4 the so-called parametric
This work has been supported by the French National Agency for Research with the reference ANR-10-CORD-005 (REVES project).
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 202–213, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation
203
active polygon segmentation algorithm we use. In Section 5, we show how this system could be used for leaf shape classification. Section 6 relates the results of experiments.
2
Related Work
Plant recognition has recently been a subject of interest for various works. Few of them though consider acquiring images of leaves or flowers in a complex, natural environment, thus eluding most of the hard task of segmentation. 2.1
Plant Recognition and Segmentation
Nilsback and Zisserman [1] adressed the problem of segmenting flowers in natural scenes, by using a geometric model, and classifying them over a large number of classes. Saitoh and Kaneko [2] focus also on flowers, but in images with hard constraints on out-of-focus background. Such approaches are convenient for flowers, but lose much of their efficiency with leaves. Many works on plant leaf recognition tend to avoid the problem by using a plain sheet of paper to make the segmentation easy as pie. Their recognition systems are then based on either statistical or geometric features : CentroidContour Distance (CCD) curve [3], moments [4,3], histogram of gradients, or SIFT points [1]. Some more advanced statistical descriptors, such as the Inner Distance Shape Context [5], or the Curvature Scale Space representation [6] that allows taking self-intersections into account, have also been applied to the context of leaf classification, while developed in a general purpose. As a matter of fact, isolating green leaves in an overall not less green environment seems like a much tougher issue, and only some authors have designed algorithms to overcome the difficulties posed by a natural background. Teng, Kuo and Chen [7] used 3D points reconstruction from several different images to perform a 2D/3D joint segmentation using 3D distances and color similarity. Wang [4] performed an automatic marker-based watershed segmentation, after a first thresholding-erosion process. All these approaches are complex methods that seem hardly reachable for a mobile application. In the case of weed leaves, highly constrained deformable templates have been used [8] to segment one single species, providing good results even with occlusions and overlaps. 2.2
Active Contour Models
The concept of active contours, or snakes, have been introduced by Kass, Witkin and Terzopulos [9] as a way to solve problems of edge detection. They are splines that adjust to the contours in the image by minimizing an energy functional. This energy is classically composed of two terms, an internal energy term considering the regularity and smoothness of the desired contour, and an external or image energy accounting for its adequation with the actual features in the image, based on the intensity gradient.
204
G. Cerutti et al.
To detect objects that are not well defined by gradient, Chan and Vese [10] for instance based the evolution of their active contour on the color consistency of the regions, using a level set formulation. Another region-based approach, relying this time on texture information, was proposed by Unal, Yezzi and Krim [11] who also introduced a polygonal representation of the contour. But to include some knowledge about complex objects, a deformable template [12] can be used, with the asset of lightening the representation and storage space of the contour by the use of parameters. Felzenszwalb [13] represents shapes by deformable triangulated polygons to detect precisely described objects, including maple leaves, and Cremers [14] includes shape priors into level set active contours to segment a known object, but both approaches lack the flexibility needed to include knowledge about the shape of any possible leaf.
Fig. 1. Segmentation issues with classic region-based active contour model
In our case, the similarity between the background and the object of interest, and the difficulty to avoid adjacent and overlapping leaves constitute a prohibitive obstacle to the use of unconstrained active contours (Figure 1). The idea of using a template to represent the leaves is complicated by the fact that there is much more variety in shapes than for eyes or mouths. The only solution to overcome the aforementioned problems is however to take advantage of the prior knowledge we may have on leaf shapes to design a very flexible time-efficient model to represent leaves.
3
A Polygonal Model for Simple Tree Leaves
Despite the large variety in leaf shapes, even when considering only the trees, it is necessary to come up with some kind of template that can rather accurately be fitted to basically any kind of leaf. We considered in a first time only trees with simple, non-palmate leaves, which represent about 80% of French broad-leaved tree species [15] and roughly the same proportion in all European species. 3.1
Describing Botanic Leaf Shapes
Botanists have a specialized vocabulary to describe leaf shapes, among which around 15 terms are used to characterize the global shape of a leaf. Examples of such shapes are diplayed in Figure 2. The interesting point is that these terms are also the ones used in flora books to describe the lobes of palmate leaves and
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation
205
Fig. 2. Examples of leaf shapes used by botanists : Lanceolate, Ovate, Oblong, Obovate, Cordate and Spatulate. Images taken from [15].
the leaflets of compounded leaves, making them the base element to describe all the leaf shapes. Knowing this general shape is a key component of the process of identifying a leaf, though non sufficient. However, given the high variability within a single species, floras often reference a species with two or three leaf shapes, making this notion overall blurry. This impression is reinforced by the difficulties in seeing where exactly a denomination ends and the other one begins. It is yet possible to derive a basic model to sketch all the shapes used in botany. The idea is then to use a light polygonal model, suitable for a time and space effective mobile implementation. It is only assumed to have a vertical axis of symmetry joining the base and the tip, and from this, by playing on a set of parameters, one should be able to describe the whole set of shape types. This kind of representation has two main advantages. First the parameters that define the construction of the model can later be directly used as descriptors for the leaf, and using such representation in segmenting leaves is a way to create a cooperation between the two classically unconnected processes of segmentation and feature extraction. Additonally, working on numeric values allows us to slightly reduce the uncertainty produced by the hard classification into not-welldefined botanical terms of natural objects, which intrinsically bear a part of variability. 3.2
The Parametric Polygonal Leaf Template
The chosen model relies on two points, base B and tip T , that define the main axis of the leaf. From this axis, we construct the 10 points defining the polygon, using 4 numeric integer parameters : – – – –
αB , the opening angle at the base αT , the opening angle at the tip w, the relative maximal width p, the relative position where this width is reached
The model is then built as follows. In the next steps, we will call a the direction vector of the line segment [BT ], n its normal vector, and h its length, so that −→ BT = h.a. The construction of the model is illustrated in Figure 3. 1. The center point C is defined as BC = p.BT and perpendicularly to the main axis is drawn the segment of maximal width [Cl Cr ], centered on C. In −−→ −−→ other terms, we have CCl = w2 h.n and CCr = − w2 h.n.
206
G. Cerutti et al.
Fig. 3. Building the leaf model in 4 steps
2. At the points Cl and Cr we trace two tangents, whose length depend on the height and the parameter p, defining 4 new points ClB , ClT , CrB and CrT . −−−→ −−−−→ For instance, Cl ClT = (1 − 2|p − 0.5|) h5 .a and Cl ClB = −(1 − 2|p − 0.5|) h5 .a, and the same is done for the right tangent. 3. At the tip and at the base, we build two isosceles triangles, with a top angle respectively of αT and αB and with equal sides of fixed length relatively to the height. This creates 6 additional points Tl , Tr and TT for the tip and −−→ Bl , Br and BB for the base. The points at the tip are defined by T Tl = −−→ −−→ h αT αT αT h h 5 sin 2 .n, T Tr = − 5 sin 2 .n and T TT = 5 cos 2 .a and the points at the base are defined identically. 4. Finally, the polygon is obtained by linking the 10 points BB , Bl , ClB , ClT , Tl , TT , Tr , CrT , CrB and Br . This model proves to cover, with a very restraint set of parameters, the major part of the shapes used by botanists to describe leaves. It is then possible to adjust the parameters to recreate models that visibly correspond to most of the archetypal leaf shapes, as depicts Figure 4. Other parameters could have been added to account for deformations that occur in real images, but it would have compromised the lightness of the model.
Fig. 4. Main leaf shapes and their corresponding hand-tuned models
4
Parametric Active Polygon
As mentioned earlier, using a deformable template approach based on this model has the advantage of encapsulating some prior knowledge on the shape of the object in the very definition of the model. It is also a great benefit in terms of computation time to have a parametric representation, as there will be only a few elementary deformations, making the exploration of the possible variations easier.
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation
4.1
207
Elementary Deformations
The values that can be changed to deform the model are very few since the considered polygon is entirely defined by the indication of the 2 points defining the main axis and the 4 integer parameters. Consequently, a given model will have only 16 neighbours in the model space, corresponding to a variation in one direction or the other of a parameter (8 deformations) and the move of a point to one of its neighbours in the image in the sense of 4-connectivity (8 deformations). This results in a very limited set of elementary variations to be considered when deforming the model, compared with a classic active contour that can be moved pointwise locally. 4.2
Representing the Color
The polygonal model being an approximation of a shape, we can not expect it to fit edges in the image. That is why we use a region-based approach, using color information inside the region rather than gradient information on its contour. But formulating an a priori model for color, that would be accurate whatever the leaf, season and lighting is likely an impossible task. It is therefore an absolute necessity to learn the particular model for the leaf we want to segment from the image. And this can obviously only be achieved with some constraints on the position of the leaf, that will allow us to know where to evaluate a color model. In the following, we assume that the leaf is roughly centered and vertically-oriented, so that an initialization of our template in the middle of the image contains almost only leaf pixels. Typically, a leaf in the image taken on purpose by a user will be large enough to ensure the correctness of the initialization. We considered then that the color of a leaf can be modelled by a 2-component GMM estimated in the initial region, accounting for shaded and lighted or shiny areas, and defined by the parameters (μ1 , σ12 , α1 ) and (μ2 , σ22 , α2 ), respectively the mean, variance and weight of each gaussian distribution. Then the distance of a pixel x to the color model, in a 3-dimensional colorspace, is defined by a Mahalanobis 1-norm distance, considering the variance matrices are diagonal, |x −μ | written as following : d(x, μ1,2 , σ1 , μ2 , σ2 ) = ming=1,2 3i=1 iσg,ig,i Based on this formulation, we can compute for every pixel in the image its distance to the color model, resulting in a distance map, where leaf pixels should appear in black and background pixels in gray-white. Based on the aspect of this map for different leaves and different color spaces, we chose to work in the Lab colorspace, for which leaves were standing out best in distance maps. 4.3
Energy Functional
The internal energy term that classically appears in the energy formulation can be considered as implicit here, as it is included in the construction rules of our model, giving it already some kind of rigidity. The remaining term is the external energy, representing the image forces, and it will be based on the distance map
208
G. Cerutti et al.
discussed above, as we ultimately want the final polygon to contain only pixels with as small distance to the leaf color model as possible. The energy the model strives to minimize throughout its evolution is then expressed as : E(M ) = (d(x, μ1 , σ1 , μ2 , σ2 ) − dmax ) (1) x∈M
The value dmax actually represents a balloon force, that will push the model to grow as much as possible. It can also be seen as a threshold, the distance to the color model for which a pixel will be costly to add into the polygon. The expected outcome is to produce the biggest region with as little color-distant pixels as possible. At each step, the algorithm will then select, among the 16 possible elementary variations, the one that leads to the most important decrease of the energy, until no deformation can bring it any lower. As such, this method has the major drawback of getting stuck in local energy minima, that most certainly do not correspond to the searched leaf. To get around this problem, we use a heuristic close to simulated annealing. 4.4
Evolution Constraints
Even if the polygonal template models quite accurately most of the leaf shapes, it has the unwanted counterpart of being slightly too flexible, and thus able to model shapes that are not likely to ever be reached by leaves. This has negative consequences when the model is let to evolve freely, at will undergo any possible deformation to minimize its energy, often resulting in unnatural, yet surprising, optimal shapes. It is necessary then to constraint the evolution of the template to make sure it keeps an acceptable leaf shape. To achieve this, the model is forced to have its parameters within an authorized range. This range have been learned by manually adjusting the model to leaves from a database1 . A template can then be told as leaf or non-leaf by estimating its distance to the empiric leaf-model in the parameter space. Only the variations that give a leaf template are then examinated in the evolution process, ensuring the final polygon will show a satisfactory shape. The Figure 5 shows sample results of the whole process.
5
Shape Classification
To measure with more accuracy the performance of our approach, we use the parameters returned by the algorithm as descriptors to estimate the global leaf shape. It is important to keep in mind though that the final goal is to recognize a plant species, and that the numeric parameter values convey more information than a rigid classification into rather unclear terms. These constitute actually 1
LEAF - Tree Leaf Database, Inst. of Information Theory and Automation ASCR, Prague, Czech Republic, http://zoi.utia.cas.cz/tree_leaves
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation
209
Fig. 5. Initialization, distance map and resulting polygon on two natural images
global descriptors that will later be combined with local geometric features and other heteregenous data such as date or GPS coordinates to recognize species. This shape classification step will consequently not even be a middle step in our process, but it appeared as the best way to evaluate the behaviour of our algorithm. 5.1
Learning Leaf Shapes
The first step is to learn the different classes of leaf shapes and the parameter values they correspond to. As discussed earlier, the problem is that the botanic classification is a very subjective and uncertain work, the boundaries between classes being vaguely defined, and a tree species impossible to associate with a single class. It is then judicious to let an almost unsupervised learning algorithm determine the best classification, based of the set of parameters manually extracted from the leaf database1 . This is simply performed by a k-Means clustering algorithm that classifies points in the parameters space, and returns k classes, each one represented by its centroid and its variance for each parameter. The value of k that produced the best, visibly identifiable, models was k = 10. The clustering was however guided by initializing the centroids on hand-designed leaf shape archetypes, but was then left free to evolve. This way, we obtain ten canonical models, displayed in Figure 6 that can later be labeled by the name of the botanic class(es) they match best.
Fig. 6. The 10 models representing the classes of leaf shapes used for classification
5.2
k-Means Classification
The model resulting from the parametric active polygon algorithm can then be compared to each of the class centroids, by computing a 1-norm distance in the parameters space. As illustrated in Figure 7 the model is labeled with the closest class k ∗ , or with the two closest ones if mink=k∗ M −Mk 1 < β. mink M −Mk 1 (β > 1).
210
G. Cerutti et al.
Fig. 7. Obtained models and closest leaf shape centroids
It is actually frequent that the shape of a leaf can not be assigned to one class only, even by an experimented botanist. Returning often 2 canonical shapes as a classification result is therefore not strictly speaking approximative, but rather complying with a natural reality.
6
Experiments and Results
6.1
Experimental Procedure
Tests were performed on two databases : a white-background leaf database2 and a set of natural scene leaf images, centered and rotated to fit the desired conditions. Unfortunately, we have not yet collected images from smartphone cameras, taken by users from a prototypal interface, which would correspond best to the images the algorithm was designed to work on. Each image is then set to roughly the same size and the model launched on it, to estimate the shape of the leaf. A ground-truth class labeling was first performed by a group of about 15 specialists and non-specialists provided with the images and the 10 class models. Each image was then associated with one or two ordered classes depending on the votes of the labelers, and the classification in turn responds with one or two classes. 6.2
Performance Measure
Measuring the accuracy of such answers, consisting most of the time in an ordered pair of classes to be compared with an other ordered pair, is not a common problem, and there does not seem to be a standard procedure to do so. We tried however to build simply a confusion matrix estimating how fitting the answer is. If the first actual class is present among the two predicted ones, 2 points are placed on the corresponding diagonal cell. If the second actual class is present, 1 point is placed on the corresponding diagonal cell, and if the first class was not recognized, 1 point is placed on the cell from the line of the first class, in the column of the other recognized class. And finally, if none on the classes are present, 1 point is placed on each of the cells from the lines corresponding to the actual classes, in the columns corresponding to the predicted classes. The points are then normalized to have their total equal to 1. 2
Database collected by P.-Y. Landouer on his website http://www.lesarbres.fr
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation Table 1. Confusion matrix for 99 white background images M0 M1 M2 M3 M4 M5 M6 M7 M8 M9
M0 M1 100 0 6 93 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M2 0 0 92 1 0 3 0 0 3 0
M3 0 0 5 94 0 0 0 0 0 0
M4 M5 0 0 0 0 0 0 0 0 100 0 0 96 0 0 0 0 0 0 0 0
M6 0 0 0 0 0 0 93 0 3 0
M7 M8 0 0 0 0 0 0 2 0 0 0 0 0 0 3 100 0 0 92 4 0
M9 0 0 2 2 0 0 3 0 0 95
211
Table 2. Confusion matrix for 113 natural scene images M0 M1 M2 M3 M4 M5 M6 M7 M8 M9
M0 87 0 0 1 0 0 1 10 3 2
M1 5 81 0 0 0 0 3 10 4 2
M2 2 0 79 6 13 1 8 0 12 0
M3 0 5 8 82 8 5 8 0 3 5
M4 5 8 0 2 79 0 0 0 0 0
M5 0 2 4 2 0 88 0 0 0 0
M6 0 0 0 2 0 0 76 0 3 13
M7 0 2 2 0 0 5 0 80 0 5
M8 0 0 0 3 0 0 2 0 75 0
M9 0 2 6 3 0 0 2 0 3 73
In the case of white background images (Table 1) the model shows a very good classification rate, and recognizes one of the classes in almost every case. Some classes already seem hard to discriminate like the M2 (Ovate) and the M3 (Elliptic) as it is a problem even for humans when to say ovate or elliptic. Another tendency is to label Cordate (M8) some Triangular leaves (M6). This may come from the very definition of the polygonal model that has difficulties fitting exactly round shapes, and sometimes, a heart-shaped base allows it to get closer to the actual border of a triangular leaf. This issue results in a wrong classification but remains close to the truth in terms in terms of segmentation. For natural images (Table 2) the overall performance is obviously lower, but remains acceptable. The main problem is the variability of the color inside the leaf, factor that does not exist on white background images thanks to uniform controlled lighting. This often leads the model to segment accurately part of the leaf but not to enter in the areas that are too far from the color model, resulting in a bad classication though the segmentation result would still be exploitable. The same tendencies appear, notably the difficulty to discriminate between Elliptic and Ovate, Ovate and Triangular, or Triangular and Orbicular (M9). 6.3
Comparison with Standard Active Contours
The main advantage our model proves to have over a regular snake algorithm based on the same energy term is that it is not sensitive to overflowing. The hard constraints contained in the definition and the evolution of the model impose that it remains in an acceptable shape and forbids any attemps of exploring unwanted regions. This is a guarantee that the final shape will not have outgrown out of
Fig. 8. Comparison between our polygonal model and standard active contour [16]
212
G. Cerutti et al.
the leaf in some places, and is a substantial condition for the robustness of the rest of the process. It will also be more robust with respect to holes or spots inside the leaf that will not change the shape as much as it would with a contour that could go around them. Such irregularities are smoothed by the fact that we consider a whole region of fixed shape.
7
Conclusions
We have presented a method to perform a first segmentation of a leaf in a natural scene, based on the optimization of a polygonal leaf model whose parameters constitute also a first set of descriptors with an eye to species identification. The model is flexible and covers most of the simple tree leaf shapes, and is a good base for the segmentation of lobed and compound leaves. It is however rigid enough to avoid classic problems of active contour approaches, that allow for instance overflows to adjacent similar regions, and make use of a color model that is robust to uncontrolled lighting conditions. Improvements are to be considered. In some cases, relying only on color information does not seem to be the best choice ; the model could benefit from the use of an additional texture model of the leaf. As for the color model, 2 gaussians may not be enough to represent the diversity of the colors in a single leaf. A model with an adaptive number of gaussians is a possible enhancement. The next step is then to refine the contour of the obtained polygon to fit the actual contour of the leaf in the image. The idea would be to use an active contour this time, constraining it to remain within a certain distance to the polygon. With a more accurate segmentation, we will be able to extract the local geometric features that characterize leaves, and to begin classifying leaf images into species. The model should also be extended to take into account lobed and compound tree leaves, and even non-tree leaves, and be tested over images from smartphone cameras. Nevertheless, it constitutes as such a good approach to estimate the shape of a leaf in a natural environment, without performing a thorough segmentation, and to associate the corresponding terms used by botanists to describe leaves.
References 1. Nilsback, M.E., Zisserman, A.: Delving into the whorl of flower segmentation. In: British Machine Vision Conference, vol. 1, pp. 570–579 (2007) 2. Saitoh, T., Kaneko, T.: Automatic recognition of blooming flowers. In: International Conference on Pattern Recognition, vol. 1, pp. 27–30 (2004) 3. Wang, Z., Chi, Z., Feng, D., Wang, Q.: Leaf image retrieval with shape features. In: Laurini, R. (ed.) VISUAL 2000. LNCS, vol. 1929, pp. 477–487. Springer, Heidelberg (2000) 4. Wang, X.F., Huang, D.S., Du, J.X., Huan, X., Heutte, L.: Classification of plant leaf images with complicated background. Applied Mathematics and Computation 205, 916–926 (2008)
A Parametric Active Polygon for Leaf Segmentation and Shape Estimation
213
5. Belhumeur, P., Chen, D., Feiner, S., Jacobs, D., Kress, W., Ling, H., Lopez, I., Ramamoorthi, R., Sheorey, S., White, S., Zhang, L.: Searching the world’s herbaria: A system for visual identication of plant species. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 116–129. Springer, Heidelberg (2008) 6. Mokhtarian, F., Abbasi, S.: Matching shapes with self-intersections: Application to leaf classification. IEEE Transactions on Image Processing 13, 653–661 (2004) 7. Teng, C.H., Kuo, Y.T., Chen, Y.S.: Leaf segmentation, its 3d position estimation and leaf classification from a few images with very close viewpoints. In: Kamel, M., Campilho, A. (eds.) ICIAR 2009. LNCS, vol. 5627, pp. 937–946. Springer, Heidelberg (2009) 8. Manh, A.G., Rabatel, G., Assemat, L., Aldon, M.J.: Weed leaf image segmentation by deformable templates. Journal of Agricultural Engineering Research 80, 139–146 (2001) 9. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1, 321–331 (1988) 10. Chan, T., Vese, L.: Active contours without edges. IEEE Transactions on Image Processing 10, 266–277 (2001) 11. Unal, G., Yezzi, A., Krim, H.: Information-theoretic active polygons for unsupervised texture segmentation. International Journal of Computer Vision 62, 199–220 (2005) 12. Yuille, A., Hallinan, P., Cohen, D.: Feature extraction from faces using deformable templates. International Journal of Computer Vision 8, 99–111 (1992) 13. Felzenszwalb, P.: Representation and detection of deformable shapes. PAMI 27, 208–220 (2004) 14. Cremers, D., Tischhuser, F., Weickert, J., Schnrr, C.: Diffusion snakes: introducing statistical shape knowledge into the mumford-shah functional. Journal of Computer Vision 50, 295–313 (2002) 15. Coste, H.: Flore descriptive et illustr´ee de la France de la Corse et des contr´ees limitrophes (1906) 16. Mille, J.: Narrow band region-based active contours and surfaces for 2d and 3d segmentation. Computer Vision and Image Understanding 113, 946–965 (2009)
Avoiding Mesh Folding in 3D Optimal Surface Segmentation Christian Bauer1,2 , Shanhui Sun1,2 , and Reinhard Beichel1,2,3 1
Deptartment of Electrical and Computer Engineering 2 The Iowa Institute for Biomedical Imaging 3 Deptartment of Internal Medicine, The University of Iowa, Iowa City, IA 52242, USA [email protected]
Abstract. The segmentation of 3D medical images is a challenging problem that benefits from incorporation of prior shape information. Optimal Surface Segmentation (OSS) has been introduced as a powerful and flexible framework that allows segmenting the surface of an object based on a rough initial prior with robustness against local minima. When applied to general 3D meshes, conventional search profiles constructed for the OSS may overlap resulting in defective segmentation results due to mesh folding. To avoid this problem, we propose to use the Gradient Vector Flow field to guide the construction of non-overlapping search profiles. As shown in our evaluation on segmenting lung surfaces, this effectively solves the mesh folding problem and decreases the average absolute surface distance error from 0.82±0.29 mm (mean±standard deviation) to 0.79 ± 0.24 mm.
1
Introduction
Segmentation of 3D medical images is of importance for a large variety of applications. Incorporation of constraints regarding the approximate shape and position of the target structure proofed beneficial in several applications because it improves the robustness of the segmentation process. In this case, the segmentation problem simplifies to adapting the initially roughly known objects surface to the underlying image data. Within this context, the graph-theoretic approach for Optimal Surface Segmentation (OSS) presented by Li et al. [1] has attracted attention and has been applied successfully in various application domains [2,3,4,5,6]. Compared to other methods which adapt an initial surface to the underlying image data (e.g., active contours), the OSS approach does not perform a local optimization which may get stuck in local minima. Instead it allows obtaining a globally optimal solution, with respect to given constraints, using a graph-theoretic formulation that can be solved in in low order polynomial computation time using a max-flow/min-cut algorithm [7]. The OSS approach introduced by Li et al. [1] and further developments based on this work allow not only incorporation of prior knowledge about the topology of the object to be segmented, but also incorporation of surface smoothness constraints [1,8], G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 214–223, 2011. c Springer-Verlag Berlin Heidelberg 2011
Avoiding Mesh Folding in 3D Optimal Surface Segmentation
(a)
(b)
215
(c)
Fig. 1. Optimal surface segmentation result showing mesh folding. (a) 3D Mesh. (b) Enlarge subregion from (a) showing folding area. (c) 2D segmentation result.
modeling of multiple interacting surfaces related to an object [2], and modeling of interactions between multiple objects [1,8,2], thus making it well suited for 3D segmentation applications. Using the OSS framework, the 3D surface segmentation task is transformed into a 2.5D segmentation problem consisting of the 2D manifold surface (mesh) in the 3D space and “distance” information from initial vertex positions along a search profile. The search profile describes possible positions of surface points with related costs for the surface to pass through this point. The OSS identifies the final segmentation result by placing the initial surface points at the “optimal” position along the search profile; considering the relation to neighboring surface point positions and their costs. As introduced in the work of Li et al. [1], these search profiles were straight lines. Such straight search profiles are sufficient in some application domains and were used by several authors [3,6,1,4]. In case of 3D meshes, the direction of the search profile associated with a vertex is typically obtained as the average surface normal direction of the adjacent mesh triangles [3,6]. However, such an approach may lead to problems in some applications domains as the search profiles may overlap what consequently may lead to mesh folding as shown in Fig. 1. Such mesh folding represents a severe problem for practical applications as there is no physically meaningful interpretation of the object anymore; it cannot be determined what is inside and what is outside of the object. This leads to severe problems ranging from misleading visualizations (Fig. 1) to problems in postprocessing of the meshes (e.g. voxelization). Thus, avoiding mesh folding is of importance in the context of OSS. In the literature this problem has only been addressed in a few publications by one research group [2,9,10,11]. In their earlier work they utilized a formulation based on the distance transformation [9], while in their later works they constructed the search profiles based on a force field produced by the initial surface points which improved the segmentation quality compared to their previous
216
C. Bauer, S. Sun, and R. Beichel
work [10]. However, the computation of the force vector at a certain point in space requires considering all vertex points of the initial surface. As the authors mention, this leads to prohibitive large computation times with increasing number of vertex points [11]. Another drawback of such potential force fields is that the force at a point in space depends on all surface points – even in case of another separating surface in between – which may lead to an undesired behavior regarding the search profile positions. This problem of potential force fields has been discussed in the literature in the context of skeletonization [12]. In this work, we present a solution to avoid mesh folding in OSS applications that addresses the drawbacks of above mentioned approaches. In particular, it’s also well applicable to high resolution meshes. The remainder of this work is structured as follows. First, we review basic methods used in our approach. Second, we describe our approach in detail. Third, we present an evaluation on lung surface segmentation utilizing robustly matched Active Shape Models (ASM) [6] as priors and compare the conventional approach for search profile construction with our proposed approach. Fourth, we analyze the results and discuss qualitative and quantitative improvements achieved with our approach.
2
Related Prior Work
In the following we will briefly introduce Optimal Surface Segmentation and the Gradient Vector Flow. Both represent basic concepts our method relies on, and the introduced notation will be utilized in the later sections of this work. 2.1
Optimal Surface Segmentation
Given a rough prior segmentation, Optimal Surface Segmentation (OSS) [1] allows to identify the surface of an object more accurately. Following the ideas of Li et al. [1], the surface is considered as a terrain-like structure. The 3D surface is considered a planar structure where each surface vertex v is assigned a discrete “distance” value z ∈ SP (v) = {−D, .., D} from the initial surface z = 0. Each of the possible surface positions vz along the search profile SP (v) are assigned costs c(vz ) related to local image properties that the desired surface passes through the given point. Based on these positions along the search profile and the assigned costs a node-weighted graph is constructed where each z-position of each vertex corresponds to a node (Fig. 2). The nodes are connected by infinite cost edges between the nodes along a search profile and between the nodes of the search profiles of neighboring vertices (Fig. 2). These edges between nodes of neighboring search profiles guarantee a smooth surface segmentation result, where the z-value between neighboring vertices cannot change by more than Δd . A globally optimal solution to the segmentation problem represented by this graph is obtained using a max-flow/min-cut algorithm [7]. In the original work of Li et al. [1], the initial surfaces and search profiles where quite regular structures (planar, cylindrical). Later the method has also been applied to general surface meshes in 3D. In this general case, the search
Avoiding Mesh Folding in 3D Optimal Surface Segmentation
217
(a) Fig. 2. Illustration of the graph structure used by the OSS approach for two neighboring vertices v 1 and v 2
profiles are typically constructed following the normal direction n(v0 ) of each vertex v0 on the initial surface, estimated as the average of the surface normals of the adjacent triangles of the mesh [3,6]. Assuming a distance of d mm between them, the 3D coordinates of the search profile points vz are determined as vz = v0 + d · z · n(v0 ). 2.2
Gradient Vector Flow
The Gradient Vector Flow (GVF) has been introduced by Xu and Prince [13] as an external force field to guide the movement of an active shape model towards image features. Given an initial vector field F (x) derived from a gradient magnitude image, the GVF performs a smoothing of the vector field, preserving large magnitude vectors and producing a smoothly varying vector field in between. Therefore, Xu and Prince [13] propose a variational formulation to obtain the GVF vector field V (x): μ|∇V (x)|2 + |F (x)|2 |V (x) − F (x)|2 dx, (1) E(V ) = Ω
where Ω is the image domain and μ a regularization parameter.
3
Method
As outlined above, using straight search profiles along the surface normal direction is commonly utilized in OSS applications, but may lead to problems such as mesh folding when the search profiles overlap (Fig. 3(b)). Our proposed solution to avoid this problem is using the GVF vector field derived from the initial shape as a force field to guide the construction of non-straight non-overlapping search profiles. An example illustrating this basic idea and a comparison to the standard approach is shown in Fig. 3. To obtain an appropriate GVF vector field for search profile construction, an initial vector field is required with high gradients at the surface of the initial
218
C. Bauer, S. Sun, and R. Beichel
(a)
(b)
(c)
(d)
Fig. 3. Illustration of the basic problem and our proposed solution. (a) Initial surface with surface normal direction of the surface points. (b) Straight search profiles build along surface normal direction overlap and can cause mesh folding. (c) GVF vector field resulting from initial vector field shown in (a). (d) Search profiles build following the direction vectors in the GVF vector field. Note that the search profile may get close, but do not overlap.
shape pointing outwards. Assuming that initially a binary prior segmentation Ib (one inside and zero outside) is given based on which the OSS is performed, the initial vector field F (x) is obtained as F (x) = −∇Ib . In case the initial prior segmentation is available as a mesh, a voxelization of the dataset has to be performed or the surface normal direction of surface vertex points can be mapped to the location of the closest voxel in the image. Based on this initial vector field F (x) the GVF vector field V (x) is obtained using (1). The search profiles are constructed starting from the surface points v0 , by following the vector field in the direction of the GVF vectors (outside of the initial object) and against the direction of the GVF vectors (inside of the initial object): ⎧ ⎨ vz−1 + d · V n (vz−1 ) if z > 0 if z = 0 (2) vz = v0 ⎩ vz+1 − d · V n (vz+1 ) if z < 0 where V n (x) = V (x)/|V (x)| is the normalized GVF vector at location x. Linear interpolation is used to obtain the vector direction at non-voxel centers. Please note, that the step length d between surface profile points has to be smaller than half the distance between two voxels to avoid sampling artifacts that could otherwise lead to minor mesh folding. As can be seen from the GVF vector field in Fig. 3(c), the GVF vectors vary smoothly. Following them starting from a surface point of the initial mesh in either direction allows building non-overlapping search profiles (Fig. 3(d)). In Fig. 4 a comparison of the two approaches on a lung surface mesh is shown. In areas where the initial surface is almost planar, both approaches for search profile construction show very similar behavior with straight search profiles running in parallel. However, in highly curved areas the standard method results in overlapping search profiles, whereas the search profiles constructed using the GVF vector field bend and avoid overlap and thus mesh folding.
Avoiding Mesh Folding in 3D Optimal Surface Segmentation
219
Fig. 4. Examples of search profiles build based on surface normal direction (top row) and GVF vector field (bottom row)
4
Evaluation and Results
In this section we compare the standard method for search profile construction with our proposed GVF based approach and present quantitative and qualitative results on segmentations of 16 lung segmentations in CT datasets (8 right lungs and 8 left lungs). The datasets were from cancer patients with large lung tumor regions in either the right or left lung. For each of the lungs, reference segmentations were generated semi-automatically by experts utilizing the commercial lung image analysis software “Pulmonary Workstation 2” (PW2) from Vida Diagnostics, Coralville, Iowa. The experts first applied the automated
220
C. Bauer, S. Sun, and R. Beichel
segmentation of PW2 and in a second step inspected the segmentations sliceby-slice and corrected all segmentation errors manually, which may take up to several hours for a single dataset. As prior segmentations for the OSS, robustly matched ASMs [6] were available, that roughly describe the lungs surfaces, and are not able to capture all smaller variations found in the datasets. For both search profile construction approaches, all parameters and input data of the methods were the same to allow for a fair comparison. The parameters for the OSS were roughly estimated prior to evaluation on test datasets and set as follows: D = 50 search profile points inside and outside with a distance of d = 0.25 mm between them and a smoothness constraint parameter of Δd = 5. The GVF is computed using the iterative update scheme presented in [13] and 500 iterations were calculated with a regularization parameter of μ = 0.5. As cost function for the OSS we used the normalized gradient magnitude c(x) = 1.0 − g(x)/gmax with g(x) = |∇(Gσ I)(x)| and gmax as the highest gradient magnitude in the image I. Gσ is a Gaussian filter kernel with variance σ = 2 mm. For quantitative evaluation the average absolute distance error da is used. Given a surface S that is under evaluation and the surface of a reference seg 1 |d(x, S )|dx, with |S| denoting the mentation S , da is defined as da = |S| x∈S number of points x of surface S and the distance of a point to the reference surface as d(x, S ) = minx ∈S x − x . We computed this error measure for each
(a)
(b)
(c)
Fig. 5. Segmentation results in 2D slices of the CT datasets showing the (a) initialization surface (prior) for OSS, and derived OSS segmentation results using search profiles based on surface normal direction (b) and the GVF based approach (c).
Avoiding Mesh Folding in 3D Optimal Surface Segmentation
221
of the 16 segmentations with the search profiles of the OSS constructed using the surface normal direction and the GVF based approach. Using the standard approach for building search profiles based on the surface normal directions, an average absolute distance error of da = 0.82 ± 0.29 mm (mean±standard deviation) is achieved, while using the search profiles build based on the GVF vector field the average value decreased to da = 0.79 ± 0.24 mm. For comparison, the error of the initial ASM model surface is da = 2.26 ± 0.55 mm. Qualitative results comparing the approaches in 2D slices of the CT datasets and in 3D are presented in Figs. 5 and 6, respectively.
(a)
(b)
(c)
Fig. 6. Segmentation results showing parts of the 3D meshes of the (a) initialization surface (prior) for OSS, and derived OSS segmentation results using search profiles based on surface normal direction (b) and the GVF based approach (c).
5
Discussion
As can be seen from the examples presented in Figs. 5(b) and 6(b), the standard approach for building search profiles for OSS based on the surface normal direction may lead to defective segmentation results in curved areas of the surfaces. Interpreting these results becomes challenging and they may cause problems in further processing steps (e.g. voxelization). On the other hand, using our approach for building the search profiles based on a vector field obtained from the GVF, solves this issue and avoids mesh folding problems (Figs. 5(c) and 6(c)). As shown by our quantitative evaluation, using the GVF based approach also lead to a measurable quantitative improvement, in terms of surface distance
222
C. Bauer, S. Sun, and R. Beichel
error. Arguably the quantitative difference is small, however, the differences between the segmentation results are only local in the folding areas while for large planar areas of the prior mesh the resulting search profiles and segmentation results are identical (Figs. 3, 4, 5, and 6). In the case of mesh priors with high curvature surfaces, noisy surfaces, or larger search areas (longer search profiles), the problems of using straight search profiles that are build along the surface normal direction would become more severe as overlapping search profiles are more likely.
6
Conclusion
In this work, we presented a solution to avoid mesh folding in OSS applications. The OSS approach presented by Li et al. [1] represents a powerful and flexible framework that allows segmenting a surface based on a rough prior with a robustness against segmentation errors due to local minima of the energy function. However, the standard approach for constructing the required search profiles on arbitrarily shaped 3D surfaces based on the surface normal direction is prone to mesh folding problems which limited its applicability. To solve this issue, we proposed to use a vector field derived from the GVF to guide the construction of non-overlapping search profiles. As shown in our evaluation, this effectively solves the mesh folding problem and allows for a quantitative improvement of the segmentation quality. Thus, the here presented approach represents a valuable advancement of the OSS and makes it applicable to an even wider range of application domains. Acknowledgements. The authors thank Dr. Eric A. Hoffman and Dr. Joseph M. Reinhardt at the University of Iowa for providing lung data sets. This work was supported in part by NIH/NIBIB grant 5R01EB004640-05 and in part by NIH/NCI grant U01CA140206-010A.
References 1. Li, K., Wu, X., Chen, D.Z., Sonka, M.: Optimal surface segmentation in volumetric Images-A Graph-Theoretic approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 119–134 (2006) 2. Yin, Y., Zhang, X., Williams, R., Wu, X., Anderson, D.D., Sonka, M.: LOGISMOS– layered optimal graph image segmentation of multiple objects and surfaces: cartilage segmentation in the knee joint. IEEE Transactions on Medical Imaging 29, 2023–2037 (2010) 3. Lee, K., Johnson, R.K., Yin, Y., Wahle, A., Olszewski, M.E., Scholz, T.D., Sonka, M.: Three-dimensional thrombus segmentation in abdominal aortic aneurysms using graph search based on a triangular mesh. Computers in Biology and Medicine 40, 271–278 (2010) 4. Garvin, M.K., Abramoff, M.D., Kardon, R., Russell, S.R., Wu, X., Sonka, M.: Intraretinal layer segmentation of macular optical coherence tomography images using optimal 3-D graph search. IEEE Transactions on Medical Imaging 27, 1495– 1505 (2008) PMID: 18815101
Avoiding Mesh Folding in 3D Optimal Surface Segmentation
223
5. Li, K., Jolly, M.: Simultaneous detection of multiple elastic surfaces with application to tumor segmentation in CT images. In: Proceedings of SPIE, San Diego, CA, USA, pp. 69143S–69143S–11 (2008) 6. Sun, S., McLennan, G., Hoffman, E.A., Beichel, R.: Model-based segmentation of pathological lungs in volumetric ct data. In: Proc. of Third International Workshop on Pulmonary Image Analysis, pp. 31–40 (2010) 7. Boykov, Y., Kolmogorov, V.: An experimental comparison of Min-Cut/Max-Flow algorithms for energy minimization in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 26, 1124–1137 (2004) ACM ID: 1018355 8. Song, Q., Wu, X., Liu, Y., Smith, M., Buatti, J., Sonka, M.: Optimal graph search segmentation using Arc-Weighted graph for simultaneous surface detection of bladder and prostate. In: Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (eds.) MICCAI 2009, Part II. LNCS, vol. 5762, pp. 827–835. Springer, Heidelberg (2009) ACM ID: 1691283 9. Yin, Y., Zhang, X., Sonka, M.: Optimal multi-object multi-surface graph search segmentation: Full-joint cartilage delineation in 3d. In: Medical Image Understanding and Analysis, pp. 104–108 (2008) 10. Yin, Y., Song, Q., Sonka, M.: Electric field theory motivated graph construction for optimal medical image segmentation. In: Torsello, A., Escolano, F., Brun, L. (eds.) GbRPR 2009. LNCS, vol. 5534, pp. 334–342. Springer, Heidelberg (2009) 11. Yin, Y.: Multi-surface, multi-object optimal image segmentation: application in 3D knee joint imaged by MR. PhD thesis, The University of Iowa (2010) 12. Hassouna, M.S., Farag, A.A.: Variational curve skeletons using gradient vector ow. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 2257–2274 (2009) 13. Xu, C., Prince, J.L.: Snakes, shapes, and gradient vector ow. IEEE Transactions on Image Processing 7, 359–369 (1998)
High Level Video Temporal Segmentation Ruxandra Tapu and Titus Zaharia Institut Télécom / Télécom SudParis, ARTEMIS Department, UMR CNRS 8145 MAP5, Evry, France {ruxandra.tapu,titus.zaharia}@it-sudparis.eu
Abstract. In this paper we propose a novel and complete video structuring/segmentation framework, which includes shot boundary detection, keyframe selection and high level clustering of shots into scenes. In a first stage, an enhanced shot boundary detection algorithm is proposed. The approach extends the state-of-the-art graph partition model and exploits a scale space filtering of the similarity signal which makes it possible to significantly increase the detection efficiency, with gains of 7,4% to 9,8% in terms of both precision and recall rates. Moreover, in order to reduce the computational complexity, a two-pass analysis is performed. For each detected shot we propose a leap keyframe extraction method that generates static summaries. Finally, the detected keyframes feed a novel shot clustering algorithm which integrates a set of temporal constraints. Video scenes are obtained with average precision and recall rates of 85%.
1 Introduction The increasing amount of visual content (fixed images, video streams, 2D/3D graphical elements ...) available today on the Internet is challenging the scientific community to develop new methods for reliably accessing the multimedia data. Existing approaches, currently deployed in industrial applications are based mostly on textual indexation, which shows quickly its limitations, related to the intrinsic poly-semantic nature of the multimedia content and to the various linguistic difficulties that need to be overcome. Moreover, when considering the specific issue of video indexing, the description exploited by the actual commercial searching engines (e.g., Youtube, Daily Motion, Google Video...) is monolithic and global, treating each document as a whole. Such an approach does take into account neither the informational and semantic richness, specific to video documents, nor their intrinsic spatio-temporal structure. As a direct consequence, the resulting granularity level of the description is not sufficiently fine to allow a robust and precise access to user-specified elements of interest (e.g., objects, scenes, events…). Within this framework, video segmentation and structuring represents a key and mandatory stage that needs to be performed prior to any effective description/classification of video documents. Our contributions concern the optimized shot boundary detector, a novel keyframe selection algorithm and a shot grouping mechanism into scenes based on temporal constraints. The rest of this paper is organized as follows. After a brief recall of some G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 224–235, 2011. © Springer-Verlag Berlin Heidelberg 2011
High Level Video Temporal Segmentation
225
basic theoretical aspects regarding the graph partition model exploited, Section II introduces the proposed shot detection algorithm. In Section III, we describe the keyframe selection procedure. Section IV introduces a novel scene extraction algorithm based on hierarchical clustering and temporal constraints. In Section V, we present and analyze the experimental results obtained. Finally, Section VI concludes the paper and opens some perspectives of future work.
2 Shot Boundary Detection The shot-boundary detection algorithms have been intensively studied in the last decade, as testifies the extremely rich literature dedicated to the subject. This high interest in the field can be explained by the fact the shot is generally considered as the “atom” which provides the basis for the majority of video abstraction and high level semantic description applications. The challenge is to elaborate robust shot detection methods, which can achieve high precision and recall rates whatever the movie quality and genre, the creation date and the techniques involved in the production process, while minimizing the amount of human intervention. 2.1 Related Work The simplest way to determine a camera brake is based on the absolute difference of pixels color intensities, between two adjacent frames. Obviously, such methods are extremely sensitive even for reduced local object and camera motion, since they captures all details in a frame [1], [2]. A natural alternative to pixel-based approaches are methods based on color histogram comparisons, due to the invariance properties of the histograms to the spatial distribution [3], [4]. Methods based on edges/contours have inferior performances compared to algorithms implementing color histogram, although the computational burden is grater [5]. However, such edge features can be useful for removing false alarms caused by abrupt illumination [3]. Algorithms using motion features [6] developed in compressed domain [7] create an alternative to pixel and histogram-based methods due to their robustness to camera and object displacement. Regarding the detection efficiently these methods return inferior detection rates than histogram based approaches due to motion vectors incoherence. 2.2 Graph Partition Model In this section, we propose an improved shot boundary detection algorithm, based on the graph partition (GP) model introduced in [3] and considered as state of the art for both abrupt (i.e., cuts) and gradual transition (i.e., fades, wipes…) detection. Let us first recall the considered model proposed in [8]. Within this framework, the video sequence is represented as an undirected weighted graph. We denote by G an ordered pair of sets (V, E) where V represents the set of vertices, V = {v1,v2,…,vn}, and E ⊂ V × V denote a set of pair-wise relationships, called edges. An edge ei,j = {vi, vj} is said to join the vertices vi and vj, that are considered neighbors (or adjacent). In the shot boundary context, each frame of the video is represented as a vertex in the graph structure, connected with each others, by edges. To each edge eij, a weight wij is associated with, which expresses the similarity
226
R. Tapu and T. Zaharia
between nodes vi and vj. In our case, we considered the chi-square distance between color histograms in the HSV color space, defined as follows: j ( H i − H k )2 | i − j | ×e wi, j = ∑ k j k H ki + H k
,
(1)
where H i denotes the HSV color histogram associate to frame i. The exponential term in equation (1) is used in order to take into account the temporal distance between frames: if two frames are located at an important temporal distance it is highly improbable to belong to the same shot. The video is segmented using a sliding window that selects a constant number of N frames, centered on the current frame n. A sub-graph Gn is thus computed for each frame n, together with the similarity matrix Sn which stores all the distances (weights) between the N nodes (frames) of the graph. Let Vn = {v1n , vn2 , L, vnN } denote the vertices of graph Gn at frame n. For each integer k ∈ {1, K , N − 1}, a partition of the graph Gn into two sets (Ank = {v n1 , K , v nk }, B nk = {v nk +1 , K , v nN }) is defined. To each partition, the following objective function is associated with.
Mcut ( Ank , Bnk ) =
cut ( Ank , Bnk ) assoc ( Ank )
+
cut ( Ank , Bnk ) assoc ( Bnk )
,
(2)
where cut and assoc respectively denote the measures of cut (i.e. dissimilarity between the two elements of the partition) and association (i.e. homogeneity of each element of the partition) and are defined as described in the equation (3): cut(Ank ,Bnk ) =
∑
wi,j i∈Ank ,j∈Bnk
, assoc( A k ) = n
∑ wi, j
, assoc( B k ) = n
i , j∈Ank
∑ wi, j
i , j∈Bnk
,
(3)
The objective is to determine the optimal value of k which maximizes the objective function in equation (2). In this way, the cut between the elements of the partition is maximized while the corresponding associations are simultaneously minimized. Finally, the maximal value of the Mcut measure is associated to frame n. A local dissimilarity vector v is thus constructed. For each frame n, v(n) is defined as: v( n) =
max
k∈{1, K, N -1}
{ Mcut (A
k k n , Bn
)}
,
(4)
A straightforward manner to detect shot transitions is to determine peaks in the dissimilarity vector v that are greater than a pre-defined threshold value Tshot. However, in practice, determining an optimal value for the threshold parameter Tshot is a difficult challenge because of large object motion or abrupt and local changes of the lightening conditions, leading to both false alarms and missed detections (cf. Section 5). 2.3 Scale Space Filtering
For these reasons, we propose to perform the analysis within the scale space of derivatives of the local minimum vector v. More precisely, let v’(n) denote the first order derivative of vector v(n), defined as the following finite difference:
High Level Video Temporal Segmentation
v ' ( n) = v ( n) − v ( n − 1) ,
227
(5)
We construct the set of cumulative sums {vk' (n)}kN=1 , over the difference signal v(n) up to order N, by setting:
v k' ( n ) =
k
∑ v ' (n − p ) ,
(6)
p =0
The signals v’k(n) can be interpreted as low-pass filtered versions of the derivative signal v’(n), with increasingly larger kernels, and constitute our scale space analysis. After summing all the above equation the cumulative sum v’k(n) can be simply expressed as:
vk' (n) = v (n) − v (n − k ) ,
(7)
Fig. 1 illustrates the set of derivative signals obtained. We can observe that smoother and smoother signals are produced, which can be helpful to eliminate variations related to camera/large object motions. The peaks which are persistent through several scales correspond to large variations and can be exploited to detect transitions. In order to detect such peaks, a non-linear filtering is first applied to the multi-scale representation. More precisely, the following filtered signal is constructed:
{
}
{
}
d ( n) = max vk' ⋅ h( k ) = max v ( n) − v (n − k ) ⋅ h( k ) , k
k
(8)
where the weights h(k) are defined as: ⎧ −k ⎪e , ⎪ h( k ) = ⎨ ⎪e N −1− k , ⎪⎩
⎡ N − 1⎤ k ∈ ⎢0, 2 ⎥⎦ ⎣ ⎡ N +1 ⎤ k∈⎢ , N⎥ ⎣ 2 ⎦
Fig. 1. The set of scale space derivatives obtained
(9)
228
R. Tapu and T. Zaharia
The shot detection process is applied on the d(n) signal thus obtained. The weighting mechanism adopted privileges derivative signals located at the extremities of the scale space analysis (Fig. 2). In this way, solely peaks that are persistent through all the considered scales are retained.
Fig. 2. False alarms due to large object motion are avoided when using the scale-space filtering approach
2.4 Two Pass Approach
We focused next on reducing the computational complexity of the proposed algorithm. In the following part we introduce a two-step analysis process. In a first stage, the algorithm detects video segments that can be reliable considered as belonging to the same shot. Here, a simple (and fast) chi-square comparison of HSV color histograms associated to each pair of consecutive frames is performed, instead of applying the graph partition model. In the same time, abrupt transitions presenting large discontinuity values are detected here. In the second stage, we consider the scale space filtering method described above, but applied uniquely to the remaining uncertain video segments. This second step makes it possible to distinguish between gradual transitions and fluctuations related to large camera/object motion.
3 Keyframe Extraction The objective is to determine, for each detected shot, a set of keyframes that might represent in a pertinent manner the associated content. The keyframe selection process is highly useful for video indexing applications. On one hand, keyframes can be exploited for video summarization purposes [9]. On the other hand, they may be further used for high level shot grouping into scenes [10]. In our case, we have adopted a keyframe representation system that extracts a variable number of images from each detected shot, depending on the visual content variation. The first keyframe extracted is selected by definition, N (i.e., the window size used for the shot boundary detection) frames away after a detected transition, in order to ensure that the selected image does not belong to a gradual effect. Next, we
High Level Video Temporal Segmentation
229
introduced a leap-extraction method that consider for analysis only the images located at integer multipliers of window size and not the entire video flow as [11], [12]. The current frames are compared with the existing set of already extracted keyframes. If the visual dissimilarity (chi-square distance of HSV color histograms) between the analyzed frames is significant (above a pre-establish threshold), the current image is added to the keyframe set. By computing the graph partition within a sliding window, the method ensures that all the relevant information will be taken into account. Let us also note that the number of detected keyframes set per shot is not fixed a priori, but automatically adapted to the content of each shot.
4 Scene Segmentation The extracted keyframes are exploited to form scenes defined as a collection of shots that present the same theme and share similar coherence in space, time and action [13]. 4.1 Related Work
In the recent years various methods to partition video into scenes have been introduced. In [12], authors transform the detection problem into a graph partition task. In [14], the detection is based on the concept of logical story units and inter-shot dissimilarity measure. Different approaches [15] proposed combine audio features and low level visual descriptors. In [16], color and motion information are integrated in the decision process. A mosaic approach is introduced in [17] that use information specific to some camera setting or physical location to determine boundaries. More recent techniques [18], [19] apply concepts as temporal constraints and visual similarity. 4.2 The Proposed Scene Segmentation Algorithm
The proposed temporal constraint hierarchical clustering consists of iteratively merging shots falling into a temporal analysis window and satisfying certain clustering criteria. The width of the temporal analysis window, denoted by dist, is set to a value proportional to the average number of frames per shot:
dist = α ⋅
Total number of frames Total number of shots
,
(10)
with α a user-defined parameter. We consider further that a scene is completely described by its constituent shots: Nl → Sl : s ( Sl ) = ⎧⎨sl , p ⎫⎬ ⎩ ⎭p =1
⎧ ⎪ ⎨ ⎪⎩
N ⎫ l
{ fl , p,i ) }inl=, p1⎪⎬
⎪⎭ p =1
,
(11)
where Sl denotes the lth video scene, Nl the number of shots included in scene Sl, sl,p the pth shot in scene Sl, and fl,p,i the ith keyframe of shot sl,p.
230
R. Tapu and T. Zaharia
Our scene change detection algorithm based on shot clustering consists of the following steps: Step 1: Initialization – The first shot of a film is automatically assigned to the first scene S1. Scene counter l is set to 1. Step 2: Shot to scene comparison – Consider as current shot scur the first shot which is not yet assigned to any of the already detected scenes. Detect the sub-set Ω of scenes anterior to scur and located at a temporal distance inferior to parameter dist. Compute the visual similarity between the current shot and each scene Sk in the sub-set Ω , as described in the following equation:
∀ S k ∈ Ω,
SceneShotS im ( s cur , S k ) =
n matched , n k , p ⋅ N k ⋅ n cur
(12)
where ncur is the number of keyframes of the considered shot and nmatched represents the number of matched keyframes of the scene Sk. A keyframe from scene Sk is considered to be matched with a keyframe from shot scur if the visual similarity measure between the two keyframes is superior to a threshold Tgroup. Let us note that a keyframe from the scene Sk can be matched with multiple frames from the current shot. Finally, the current shot scur is identified to be similar to the scene Sk if: SceneShotSim( S k , s cur ) ≥ 0.5 ,
(13)
In this case, the current shot scur will be clustered in the scene Sk. In the same time, all the shots between the current shot and the scene Sk will be also affected to scene Sk and marked as neutralized. Let us note that the scenes to which initially belonged such neutralized shots disappear (in the sense that they are merged to the scene Sk). The list of detected scenes is then updated. The neutralization process makes it possible to identify the most representative shots for a current scene (Fig. 3), which are the remaining non-neutralized shots. In this way, the influence of outlier shots which might correspond to some punctual digressions from the main action in the considered scene is minimized. If the condition described in equation (13) is not satisfied, go to step3.
Fig. 3. Neutralizing shots based on visual similarity
Step 3: Shot by shot comparison – If the current shot (scur) is highly similar (i.e., with a similarity at least two times bigger than the grouping threshold Tgroup) with a shot of any scene in the sub-set Ω determined at step 2 then scur is merged in the corresponding scene together with all the intermediate shots. If scur is found highly similar to
High Level Video Temporal Segmentation
231
multiple other shots, than the scene which is the most far away from the considered shot is retained. Both the current shot and all its highly similar matches are unmarked and for the following clustering process will contribute as normal, non-neutralized shots (Fig. 4). This step ensures that shots highly similar with other shots in the previous scene to be grouped into this scene and aims at reducing the number of false alarms. Step 4: Creation of a new scene - If the current shot scur does not satisfy any of the similarity criteria in steps 2 and 3, a new scene, including scur, is created. Step 5: Refinement - At the end, scenes including only one shot are attached to the adjacent scenes depending on the maximum similarity value. In the case of the first scene, this is grouped with the following one by default. Concerning the keyframe visual similarity involved in the above described process, we have considered two different approaches, based on (1) chi-square distance between HSV color histograms, and (2) number of matched interest points determined based on SIFT descriptors with a Kd-tree matching technique [20].
Fig. 4. Unmarking shots based on high similarity values
The grouping threshold Tgroup is adaptively established depending on the input video stream visual content variation as the average chi-square distance / number of interest points between the current keyframe and all anterior keyframes located at a temporal distance smaller then dist.
5 Experimental Results In order to evaluate the shot boundary detection algorithm, we have considered a subset of videos from the “TRECVID 2001 and 2002 campaigns”, which are available on Internet (www.archive.org and www.open-video.org). The videos are mostly documentaries that vary in style and date of production, while including various types of both camera and object motion. Table 1 presents the precision, recall and F1 rates obtained for the reference graph partition shot boundary detection method proposed by Yuan et al. [3], while Table 2 summarizes the detection performances of the proposed scale-space filtering approach. The results presented clearly demonstrate the superiority of our approach, for both types of abrupt and gradual transitions. The global gains in recall and precision rates are of 9.8% and 7.4%, respectively. Moreover, when considering the case of gradual transitions, the improvements are even more significant. In this case, the recall and precision rates are respectively of 94,1% and 88,3% (with respect to R = 81.5% and P = 79% for the reference method [3]). This shows that the scale
232
R. Tapu and T. Zaharia Table 1. Yuan et al. algorithm’s performance Abrupt transitions R MD FA (%)
Video title
Nr. frames
Total Trans.
Total
D
NAD55 NAD57 NAD58 UGS01 UGS04 UGS09 23585a 10558a 06011
26104 10006 13678 32072 38674 23918 14797 19981 23918
185 73 85 180 242 213 153 141 153
107 45 40 86 161 44 80 79 81
103 39 38 78 142 43 60 68 74
4 6 2 8 19 1 20 11 7
22 6 7 14 17 8 5 10 8
TOTAL
203148
1425
723
645
78
97
P (%)
F1 (%)
Total
D
MD
Gradual transitions R FA (%)
P (%)
F1 (%)
All transitions R P F1 (%) (%) (%)
96.2 86.6 95 90.7 88.2 97.2 75 86.1 91.3
82.4 86.6 84.4 84.8 89.3 84.3 92.3 87.1 90.2
88 86 89 87 88 90 82 86 90
78 28 45 94 81 169 73 62 72
68 22 35 79 65 137 58 48 60
10 6 10 15 16 32 15 14 12
24 5 18 26 10 40 2 4 23
87.2 78.6 77.8 84 80.2 81.1 79.5 77.4 83.3
73.9 81.5 66 75.2 86.7 77.4 96.6 92.3 72.2
80 80 71 79 83 79 87 84 77
92.4 84.7 85.8 87.2 85.5 84.5 77.1 82.2 87.5
78.8 84.7 74.5 79.6 88.6 78.9 94.4 89.2 81.2
85 84 79 83 87 81 84 85 84
89.2
86.9
88
702
572
130
152
81.5
79
80
85.4
83
84
Table 2. Scale-space filtering algorithm performance Abrupt transitions R MD FA (%)
Video title
Nr. frames
Total Trans.
Total
D
NAD55 NAD57 NAD58 UGS01 UGS04 UGS09 23585a 10558a 06011
26104 10006 13678 32072 38674 23918 14797 19981 23918
185 73 85 180 242 213 153 141 153
107 45 40 86 161 44 80 79 81
107 43 38 84 153 44 75 76 76
0 2 2 2 8 0 5 3 5
9 2 4 11 8 4 4 9 6
TOTAL
203148
1425
723
696
27
57
P (%)
F1 (%)
Total
D
MD
Gradual transitions R FA (%)
P (%)
F1 (%)
All transitions R P F1 (%) (%) (%)
100 95.5 95 97.6 95.1 100 93.7 96.2 93.8
92.2 95.5 90.4 88.4 95.0 91.6 94.9 89.4 92.6
96 95 92 92 95 95 94 92 93
78 28 45 94 81 169 73 62 72
73 24 43 93 74 159 67 60 68
5 4 2 1 7 10 6 2 4
18 3 10 15 6 14 1 10 10
93.5 85.7 95.5 94.1 91.3 98.9 91.7 96.7 94.4
80.2 88.8 81.1 91.9 92.5 86.1 98.5 85.7 87.1
86 87 87 92 91 92 94 90 90
97.2 91.7 95.2 95.3 93.8 98.3 92.8 96.4 94.1
86.9 93.1 85.2 91.8 94.1 87.1 96.5 87.7 90
91 92 89 93 93 92 94 91 92
96.2
92.4
94
702
661
41
87
94.1
88.3
91
95.2
90.4
92
space filtering approach makes it possible to eliminate the errors caused by camera/object motion. The validation of our scene extraction method is done on a corpus of 6 sitcoms and 6 Hollywood movies. For the sitcoms the scenes boundaries ground truth have been established by human observers (Table 3), while for the Hollywood videos we use for comparison the DVD chapters identified by movie produces. In Table 3 and 4 we presented the detection results obtained when extracting as representative feature interest points and HSV color histogram. As it can be noticed our method efficiency is comparable in both cases. The Hollywood videos allow us to make a complete evaluation of our method with the state of the art algorithms [12], [18], [19] which yield recall and precision rates at 62% and 76%. For this corpus, our precision and recall rates are of 68% and 87% respectively, which clearly demonstrates the superiority of our approach in both parameters (Table 4). Table 3. Our scene extraction algorithm performance evaluation Ground truth scenes Sienfeld 24 Two and a half men 21 Prison Break 39 Ally McBeal 32 Sex and the city 20 Friends 17 TOTAL 153 Video name
D
FA
19 18 31 28 17 17 130
1 0 3 11 0 7 22
SIFT descriptor R P MD (%) (%) 5 95.00 79.16 3 100 81.81 8 91.17 79.48 4 71.79 87.50 3 100 85.00 0 70.83 100 23 85,52 84,96
F1 (%) 86.36 90.00 84.93 78.87 91.89 82.92 85,23
D
FA
20 17 33 24 15 17 125
0 2 0 4 1 7 14
HSV color histogram R P MD (%) (%) 4 100 83.33 4 89.47 80.95 6 100 84.61 8 84.00 75.00 5 93.75 75.00 0 70.83 100 27 81.69 89.92
F1 (%) 90.88 85.01 91.66 79.24 83.33 82.92 85.61
High Level Video Temporal Segmentation
233
Table 4. DVD chapters’ detection scores based on our proposed method Ground truth scenes 5th Element 37 Ace -Ventura 31 Lethal Weapon 4 46 Terminator 2 58 The Mask 34 Home Alone 2 29 TOTAL 235 Video name
D
FA
36 28 44 51 32 28 219
39 12 41 7 12 22 133
SIFT descriptor R P MD (%) (%) 1 97.29 48.00 3 90.32 70.00 2 95.65 52.87 7 87.93 87.93 2 94.11 71.11 1 96.55 56.00 16 93.19 62.42
F1 (%) 64.28 78.87 68.09 87.93 81.01 70.88 74.76
D 35 25 43 45 32 25 205
HSV color histogram R P FA MD (%) (%) 25 2 94.59 58.33 5 6 80.64 83.87 28 3 93.47 60.56 4 13 77.58 91.83 16 2 94.11 66.66 17 4 86.20 59.52 95 30 87.23 68.33
F1 (%) 72.16 82.23 73.50 84.11 78.04 70.41 76.66
Moreover, we have studied the impact of the temporal constraints lengths (parameter α in equation (10)) on the proposed scene detection method. In Fig. 5 we presented the precision, recall and F1 score for various values of α parameter. As it can be notice a value between 5 and 10 returns similar results in terms of the overall efficiency.
Fig. 5. Precision, recall and F1 score variation for different α value: a. scenes extraction; b. DVD chapter detection
6 Conclusion and Perspective In this paper, we have proposed a novel methodological framework for temporal video structuring and segmentation, which includes shot boundary detection, keyframe extraction and scene identification methods. Our contributions concern more specifically an improved, two-step shot boundary detection algorithm, based on scale space filtering of the similarity signal derivatives, a fast keyframe extraction mechanisms, and finally a reliable scene detection method based on shot clustering and temporal constraints validated using two types of feature: HSV color histogram and interest points. The shot boundary detection methods provide high precision and recall rates (with gains up to 9.8% and 7.4%, respectively in the case all transitions), while reducing with 25% the associated computational time. Regarding the shot merging into scenes strategy, we have adopted a grouping method based on temporal constraints that uses adaptive thresholds and neutralized shots. The experimental evaluation validates our approach, when using either interest points or HSV color histogram. The F1 measure, when applying our method on sitcoms scene detection is 85%, while for DVD chapter detection is 76%.
234
R. Tapu and T. Zaharia
Our perspectives of future work will concern the integration of our method within a more general framework of video indexing and retrieval applications, including object detection and recognition methodologies. On one hand, this can further refine the level of description required in video indexing applications. On the other hand, identifying similar objects in various scenes can be helpful for the scene identification process. Finally, we intend to integrate within our approach motion cues that can be useful for both reliable shot/scene/keyframe detection and event identification.
References 1. Zhang, H.J., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion video. Multimedia Systems (1), 10–28 (1993) 2. Lienhart, R., Pfeiffer, S., Effelsberg, W.: Video Abstracting. Communications of the ACM, 1–12 (1997) 3. Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F., Zhang, B.: A formal study of shot boundary detection. IEEE Trans. Circuits Systems Video Tech. 17, 168–186 (2007) 4. Gargi, U., Kasturi, R., Strayer, S.: Performance characterization of video shot-change detection methods. IEEE Trans. Circuits and Systems for Video Technology CSVT-10(1), 1–13 (2000) 5. Zabih, R., Miller, J., Mai, K.: A feature-based algorithm for detecting and classifying scene breaks. In: Proc. ACM Multimedia, vol. 95, pp. 189–200 (1995) 6. Porter, S.V., Mirmehdi, M., Thomas, B.T.: Video cut detection using frequency domain correlation. In: 15th International Conference on Pattern Recognition, pp. 413–416 (2000) 7. Fernando, W.A.C., Canagarajah, C.N., Bull, D.R.: Scene change detection algorithms for content-based video indexing and retrieval. IEE Electronics and Communication Engineering Journal, 117–126 (2001) 8. Hendrickson, B., Kolda, T. G.: Graph partitioning models for parallel computing. Parallel Computing Journal (26), 1519–1534 (2000) 9. Guironnet, M., Pellerin, D., Guyader, N., Ladret, P.: Video summarization based on camera motion and a subjective evaluation method. In: EURASIP (2007) 10. Hanjalic, A., Xu, Q.: Affective video content representation and modeling. IEEE Trans. Multimedia 7(1), 143–154 (2005) 11. Zhang, H., Wu, J., Zhong, D., Smoliar, S.W.: An integrated system for content-based video retrieval and browsing. Pattern Recognition 30(4), 643–658 (1999) 12. Rasheed, Z., Sheikh, Y., Shah, M.: On the use of computable features for film classification. IEEE Trans. Circuits Syst. Video Technol. 15(1), 52–64 (2005) 13. Truong, B., Venkatesh, S.: Video abstraction: A systematic review and classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 3(1), 3-es (2007) 14. Hanjalic, A., Lagendijk, R.L., Biemond, J.: Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Circuits Syst Video Technol. 9, 580–588 (1999) 15. Ariki, Y., Kumano, M., Tsukada, K.: Highlight scene extraction in real time from baseball live video. In: Proceeding on ACM International Workshop on Multimedia Information Retrieval, pp. 209–214 (2003) 16. Ngo, C.W., Zhang, H.J.: Motion-based video representation for scene change detection. Int. J. Comput. Vis. 50(2), 127–142 (2002)
High Level Video Temporal Segmentation
235
17. Aner, A., Kender, J.R.: Video summaries through mosaic-based shot and scene clustering. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002, Part IV. LNCS, vol. 2353, pp. 388–402. Springer, Heidelberg (2002) 18. Chasanis, V., Kalogeratos, A., Likas, A.: Movie Segmentation into Scenes and Chapters Using Locally Weighted Bag of Visual Words. In: Proceeding of the ACM International Conference on Image and Video Retrieval (CIVR 2009) (2009) 19. Zhu, S., Liu, Y.: Video scene segmentation and semantic representation using a novel scheme. Multimedia Tools and Applications 42(2), 183–205 (2009) 20. Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 1–28 (2004)
Embedding Gestalt Laws on Conditional Random Field for Image Segmentation Olfa Besbes1 , Nozha Boujemaa2 and Ziad Belhadj1 1
URISA - SUPCOM, Parc Technologique 2088 Ariana, Tunisia 2 INRIA Saclay ˆIle-de-France, 91893 Orsay, France {olfa.besbes,ziad.belhadj}@supcom.rnu.tn, [email protected]
Abstract. We propose a higher order conditional random field built over a graph of superpixels for partitioning natural images into coherent segments. Our model operates at both superpixel and segment levels and includes potentials that capture similarity, proximity, curvilinear continuity and familiar configuration. For a given image, these potentials enforce consistency and regularity of labellings. The optimal one should maximally satisfy local, pairwise and global constraints imposed respectively by the learned association, interaction and higher order potentials. Experiments on a variety of natural images show that integration of higher order potentials qualitatively and quantitatively improves results and leads to more coherent and regular segments.
1
Introduction
In computer vision, segmentation refers to the process of partitioning a digital image into multiple coherent segments. Despite many thoughtful attempts, this long standing problem still remains a challenging task, mainly for the following two reasons: – First, natural images are ambiguous because of the visual content heterogeneity and the intra-region appearance variations. – Second, it is difficult to achieve a good balance between objective criteria that depend mostly on the intrinsic low-level statistics of regions and subjective criteria that attempt to imitate human perception. In order to address these issues, most studies generally defined the segmentation problem as finding the labelling of an image that minimizes a specific energy function, formulated using either generative or discriminative approaches. For generative methods, the posterior probability over the labellings given the observations is expressed in terms of likelihood and prior functions. The likelihood measures the similarity of pixels (or superpixels) to a set of statistical models, from which coherent segments can be generated. The prior imposes regularity G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 236–245, 2011. c Springer-Verlag Berlin Heidelberg 2011
Embedding Gestalt Laws on Conditional Random Field
237
on boundaries [1], conformity to predefined shapes [2] or smoothness on labels [3,4]. Various segmentation methods have been proposed to perform inference in such Bayes-based models. For instance, Level Set methods [5] use a coupled system of PDE equations in order to evolve curves from their initial positions to boundaries. Data-Driven Markov Chain Monte Carlo paradigm [6] simulates the posterior probability by stochastic samplings and speeds up its convergence using discriminative proposal probabilities. Recently, graph theoretic approaches such as Graph Cuts [7] and Swendsen-Wang Cut [8] have been demonstrated more efficient since they gave more quickly a solution with lower energy. In these frameworks, Gestalt cues of proximity and good continuation were also incorporated by exploiting explicitly the graph representation. Such generative methods ensure a consistent solution from which the observed image is generated. However, they require solving a difficult inference problem as they exploit various models to deal with natural images. In contrast, discriminative methods are usually convenient to implement and computationally attractive since they focus directly on objective criteria. We cite spectral segmentation [9,10] methods which seek to partition a graph into salient groups, where by maximizing a similarity criterion: Both the total dissimilarity between the different groups as well as the total similarity within the groups should be high. Recently, conditional random field models [11,12] have proved more discriminative and leaded to more consistent labellings than similarity criteria. CRF models define directly the posterior probability of labels given the observations in terms of conventionally used unary and pairwise potentials. CRFs have been applied to figure/ground labelling [13], image segmentation [14,15,16], as well as contour completion [17]. Gestalt cues such as similarity, proximity, good continuation and curvilinear continuity were incorporated to perform these tasks. Instead of a simple knowledge about the shape of one object of interest [13], we integrate generic knowledge (so called familiar configuration) about natural scenes [18] using high order CRFs [19]. In fact, we augment the pairwise CRF model by higher order potentials in order to embed Gestalt laws of similarity, proximity as well as familiar configuration. This is a significant distinction in our work from the related approaches [14,16] which used a pairwise CRF model for grouping. Besides, we learn our model using a database of human segmented natural images so that we boost low-level segmentation by useful additional source of knowledge. Finally, we use Swendsen-Wang Cut algorithm [8] to perform inference in our high order CRF model so that we combine the representational advantages of CRF and Graph Cut approaches. The paper is organized as follows: In section 2, we introduce our high order CRF model which incorporates Gestalt laws of similarity, proximity and familiar configuration to deal with natural image segmentation task. In section 3, we define for visual entities (superpixels, segments) a set of features including color, texture and contour cues. In section 4, we learn the high order CRF potentials. In section 5, we describe the inference using the Swendsen-Wang Cut algorithm. Finally, we present experimental results in section 6 and conclude in section 7.
238
2
O. Besbes, N. Boujemaa, and Z. Belhadj
A High Order Conditional Random Field for Image Segmentation
We treat image segmentation as a graph partitioning problem and propose an efficient discriminative model for segmenting the graph. A complete weighted undirected graph G = (V, E) of superpixels is constructed where an edge is formed between every pair of vertices. In grouping, we look for partition the set of vertices V into disjoint segments, V = ∪N k=1 Vk and Vk ∩Vj = ∅, ∀k = j, where by optimizing the CRF posterior probability p (L |Y , θ) = Z1 exp −E (L |Y , θ) over labellings L = {li }i∈V given the observations Y = {yi }i∈V and the model parameters θ. Z is a normalizing constant known as the partition function and E (L |Y , θ) = c∈C ψc (Lc , Y ; θ) is the Gibbs energy defined on the set of all cliques C [19]. The term ψc (Lc , Y ; θ) is the potential function of the clique c ∈ C where Lc = {li , i ∈ c}. In order to embed Gestalt laws of similarity, proximity and familiar configuration, we define this Gibbs energy on unary, pairwise as well as higher order cliques so that it takes the following form: E (L |Y , θ) = α ψi (li |Y, θu ) + β ψij (li , lj |Y, θp ) + γ ψc (Lc |Y, θh ) , i∈V
i∈V j∈Ni
c∈S
(1) where θ = {α, β, γ, θu , θp , θh } is the set of model parameters. N is the adjacency neighbourhood defined on the set of superpixels. S refers to the set of all segments and ψc is the high order potential function defined on them. The ˆ of the conditional random optimal or maximum a posterior (MAP) labelling L ˆ field is defined as: L = arg maxL p (L |Y , θ) = arg minL E (L |Y , θ). We will now describe in detail how these three potential functions are specified. 2.1
Unary Potential
The unary potential ψi measures how likely a superpixel i ∈ V will take independently a label li given the observations Y . We use a local discriminative model p1 = 1+ρ1 −1 based on the likelihood ratio ρ = PPsame to define it. This dif f model estimates the significance of the similarity between a superpixel i and its associated segment Vk . In contrast to [14,16], we reduce the number of local model parameters to only one weighting parameter α. Let χ2 (Hi , HVk ) be the distance between their descriptors in the feature space (see Sect. 3). This potential is defined as: (2) ψi (li |Y, θu ) = − log p1 χ2 (Hi , HVk ) . 1 In fact, we exploit the logistic function f (x) = 1+exp(−x) , where x = log ρ to predict the likelihood of a superpixel i being assigned to a segment Vk .
2.2
Pairwise Potential
The pairwise potential ψij measures how labels (li , lj ) of two adjacent superpixels (i ∈ V, j ∈ Ni ) should interact given the observations Y . Two superpixels are
Embedding Gestalt Laws on Conditional Random Field
239
adjacent if they share a segment of boundary on the image plane. We use a data-dependent discriminative model: 2 p1 χ (Hi , Hj ) (1 − P bij ) if li = lj ψij (li , lj |Y, θp ) = − log , (3) p0 χ2 (Hi , Hj ) P bij otherwise where p0 = 1−p1 and P bij is the probability of boundary of the contour between superpixels i and j. Thus, ψij encourages label consistency in adjacent superpix els based on their likelihood ratio ρ χ2 (Hi , Hj ) and probability of boundary P bij . 2.3
High Order Potential
The high order potential ψc assesses the goodness of each segment Lc of the current image partition L = {Lc }c∈S . Its role is to promote labellings that are more similar to learned familiar configurations. These are described by the learned discriminative conditional distribution p (Lc |Y, θh ) (see Sect. 4) so that: ψc (Lc |Y, θh ) = − log p (Lc |Y, θh ) .
3
(4)
Feature Set Description
For an input image, we compute a set of multi-cue features which has been demonstrated to be pertinent in region discrimination. It consists of color channels and 4 texture features which are given by the second order matrix and the TV flow based local scale feature [20]. Regions are so described by color property, magnitude, orientation and local scale of their textures. In order to facilitate the segmentation process, a coupled edge preserving smoothing process is applied on this set of features. This so-called nonlinear diffusion overcomes the outliers in the data, closes structures and synchronizes all feature channels. Finally, each superpixel is described by a 16-bin normalized histogram for each feature. The grouping by similarity is based on comparing these concise and discriminative descriptors with the Chi-square difference operator χ2 (Hi , Hj ) = (Hi [p]−Hj [p])2 1 2 (Hi [p]+Hj [p]) . Furthermore, we compute the probability of boundary P b p
provided by a separate boundary classifier [21] using local brightness, texture and color gradient cues.
4
Learning the High Order CRF Potentials
In this section, we focus on the learning of the likelihood ratio ρ = PPsame and dif f the discriminative conditional distribution p (Lc |Y, θh ) incorporated in our high order CRF energy Eq. 1.
240
O. Besbes, N. Boujemaa, and Z. Belhadj
Fig. 1. (a) Binary classification results of superpixels: (Ωsame , Ωdif f ) are the sets of respectively 1 and 0 labelled pixels. (b) The learned discriminative probabilities Psame and Pdif f and their ratio.
4.1
Learning the Likelihood Ratio
We take a supervised training approach on the BSDB ground truth [21] to learn both Psame and Pdif f . Let Ω denote the set of all pairs of adjacent superpixels. We apply a two-class classification on Ω to discriminate same-segment pairs from different -segment pairs so that we obtain Ω = Ωsame ∪ Ωdif f (Fig. 1a). In fact, a pair of adjacent superpixels is in Ωsame if all human subjects declare them to lie in the same segment otherwise it is in Ωdif f . In the next step, we compute for all pairs in each subset the χ2 distances between histograms of their constituent superpixels. Then, we collect Psame and Pdif f , the distributions of the χ2 distances on respectively Ωsame and Ωdif f (Fig. 1b). 4.2
Learning the Segment Global Classifier
First, we extract all segments from the ground truth and a set of randomly labellings of BSDB training images to collect respectively positive and negative examples. Then, each segment Lc is described by a set FLc of the following global features: – Its occupancy rates in the image given by its normalized area |Lc | and perimeter |∂Lc |. – The entropy of curvilinear continuity of its boundary ∂Lc . This mid-level cue is measured by collecting on ∂Lc the continuity strength between its fragments. Each fragment belongs to a superpixel. The continuity strength between two adjacent fragments depends on their tangential angle at their common endpoint (Fig. 2). The larger this angle, the less smooth the segment boundary. The curvilinear continuity is a powerful cue as much as color, texture and edge cues [17]. – Two similarity measures, according to color and texture cues, inside the segment Lc and along its boundary ∂Lc (Fig. 2), which are given respectively by:
Embedding Gestalt Laws on Conditional Random Field
Sintra = exp
Sinter = exp
2 2 − n−1 χ (H , H ) − μ q L intra c Lc q∈Lc
, 2 2σintra 2 2 H − μ χ , H − n−1 q L c inter ∂Lc q∈∂Lc 2 2σinter
241
(5)
,
(6)
where nLc and n∂Lc are the numbers of superpixels respectively inside Lc and along ∂Lc . Lc is an adjacent segment to Lc with respect to a superpixel 2 q ∈ ∂Lc . We assume that these distances χ are random variables with means 2 2 (μintra , μinter ) and variances σintra , σinter learned in advance. Sintra evaluates the similarity of superpixels in a segment. However, Sinter estimates the dissimilarity of superpixels in different segments. – The average P b∂Lc of probabilities of boundary measured along the boundary ∂Lc .
Fig. 2. The intra-segment similarity compares the descriptor of a superpixel q to the segment Lc containing it. The inter-segment similarity compares the descriptor of a superpixel q on the boundary of Lc to the adjacent segment Lc . Curvilinear continuity of Lc is measured by the tangent changes at superpixel endpoints along the boundary ∂Lc .
In the next step, we learn a two-class classifier using the Joint Boosting algorithm [22] which explicitly learns to share features across classes. This algorithm iteratively builds a strong classifier as a sum of weak classifiers C (FLc , lc ) = M m=1 cm (FLc , lc ), lc ∈ {1, 2}, and performs simultaneously a feature selection. Each weak classifier is a regression stump based on a threshold feature response and takes so the form: a1(F k >μ) + b if lc ∈ S Lc , (7) cm (FLc , lc ) = klc otherwise with parameters {a, b, μ, klc ∈S / , S}. These parameters are estimated so as to minimize an weighted square error. By using a logistic function, the Joint Boosting classifier approximates finally the probability of a segment Lc to be a good segment as: 1 . (8) p (Lc |Y, θh ) = 1 + exp (−2 C (FLc , 2))
242
5
O. Besbes, N. Boujemaa, and Z. Belhadj
Inference by the Swendsen-Wang Cut Algorithm
Grouping superpixels into coherent segments is accomplished by a stochastic Markov Chain Monte Carlo (MCMC) mechanism. The posterior distribution p (L |Y, θ ), over the labellings of the higher order conditional random field Eq. 1, is optimized by a cluster sampling method, the Swendsen-Wang Cut (SWC) ˆ is inferred by simulating p (L |Y , θ) algorithm [8]. The optimal image partition L via reversible jump Markov chain. The SWC method follows the perspective of Metropolis Hastings mechanism and designs reversible Markov chain by splitting and merging moves. Besides, it performs large sampling moves between different graph configurations so that it provides fast simulation and optimization. Let L and L be two states which differ in the labelling of a connected component R. The SWC method iterates three steps : – A data-driven clustering step : A band probability qij is defined for each link e = i, j ∈ E. It determines how likely two vertices i and j should be grouped together. In fact, it should encourage the merge of similar adjacent superpixels. We evaluate the similarity based on the χ2 distance between two histograms. The band probability is thus designed as: 2 χ (Hi ,Hj ) exp − if j ∈ Ni σb . (9) qij = 0 otherwise – A label flipping step : A transition probability q(l |R, L, Y ) defines how likely it is for a connected component R to be merged with its adjacent connected component Rl . Based on the learned local discriminative model p1 , a label l is assigned to a connected component R with a transition probability: if (Rl , R) are neighbors 10p1 χ2 HR , H Rl q (l |R, L, Y ) = . (10) 2 p1 χ HR , HRl otherwise – An acceptance step : An acceptance probability is defined for the proposed labelling as:
(1 − qij ) q (l|R, L , Y ) p (L |Y, θ) e∈C
α (L → L ) = min{1, l
e∈Cl
(1 − qij ) q (l |R, L, Y ) p (L|Y, θ)
},
(11)
where Cl (resp. Cl ) is the set of edges between the connected components R and Rl \ R (resp. Rl \ R). α (L → L ) can be computed directly given p(L|Y, θ) and data-driven proposal probabilities.
6
Experimental Results
The parameters involved in our experiments are related to the phases of visual content description and learning. The number of iterations of nonlinear diffusion
Embedding Gestalt Laws on Conditional Random Field
243
Fig. 3. Sample results: (a) The original images. (b) The ground truth segmentations. (c, d) The segmentation results obtained respectively by pairwise and high order CRF models.
244
O. Besbes, N. Boujemaa, and Z. Belhadj
is 200. Each image is over-segmented into an average of 250 superpixels in a preprocessing stage. The global classifier of segments is learned accurately based on 480 weak classifiers. Finally, we fix the weighting parameters to {α = 1.0, β = 0.8, γ = 0.3}. The unary potential’s weight α adjusts the contribution of the local classifier which assigns superpixels to segments. The pairwise potential’s weight β avoids over-segmentation. However, the high order potential’s weight γ affects the regularity of a labelling. Figure 3 shows some results of image segmentation for both pairwise and high order CRF models on natural images of BSDB database. We quantitatively compare the performance of them by computing recall (R), precision (P) and Fmeasure (F) as illustrated in Table 1. These qualitative and quantitative results demonstrate that integration of higher order potentials improves segmentation of natural images. Improvements are due to additional constraints imposed on the solution. Such constraints guide the segmentation process to labellings which are more similar to familiar configurations. Table 1. Quantitative evaluation of segmentation results shown in Fig. 3
Image 42049 38092 296059 145086 241004 3096 41069 197017 87046
7
Pairwise CRF R P F 0.779627 0.736118 0.757248 0.556514 0.590924 0.573203 0.595551 0.787106 0.678059 0.633333 0.716981 0.672566 0.786585 0.800950 0.793702 0.469001 0.865466 0.608340 0.519126 0.693163 0.593652 0.497972 0.533889 0.515305 0.498286 0.420873 0.456319
High Order R P 0.833333 0.799136 0.642670 0.666227 0.648694 0.814833 0.780185 0.700114 0.815780 0.811907 0.678938 0.840042 0.551336 0.721030 0.623745 0.523550 0.490481 0.497892
CRF F 0.815877 0.654236 0.722334 0.737984 0.813839 0.750947 0.624867 0.569272 0.494159
Conclusion
In this paper, we proposed a high order conditional random field model for natural image segmentation which embeds Gestalt laws of similarity, proximity, curvilinear continuity and familiar configuration. Our experiments showed that augmentation of pairwise CRF by higher order potentials, defined on segments, improved segmentation results. In the future, we would like to investigate the use of more informative features to make the high order CRF potentials more discriminative and so obtain in even better performance.
References 1. Zhu, S.C., Yuille, A.: Region competition: Unifying snakes, region growing, and bayes/mdl for multiband image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18, 884–900 (1996)
Embedding Gestalt Laws on Conditional Random Field
245
2. Chen, Y., Tagare, H., Thiruvenkadam, S., Huang, F., Wilson, D., Gopinath, K., Briggs, R., Geiser, E.: Using prior shapes in geometric active contours in a variational framework. Int. J. Comput. Vision 50, 315–328 (2002) 3. Kato, Z., Pong, T.C., Lee, J.C.M.: Color image segmentation and parameter estimation in a markovian framework. Pattern Recogn. Lett. 22, 309–321 (2001) 4. Bertelli, L., Sumengen, B., Manjunath, B., Gibou, F.: A variational framework for multiregion pairwise-similarity-based image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1400–1414 (2008) 5. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level set segmentation: integrating color, texture, motion and shape. Int. J. Comput. Vision 72, 195–215 (2007) 6. Tu, Z., Zhu, S.C.: Image segmentation by data-driven markov chain monte carlo. IEEE Trans. Pattern Anal. Mach. Intell. 24, 657–673 (2002) 7. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 1222–1239 (2001) 8. Barbu, A., Zhu, S.C.: Generalizing swendsen-wang to sampling arbitrary posterior probabilities. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1239–1253 (2005) 9. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000) 10. Kim, T.H., Lee, K.M., Lee, S.U.: Learning full pairwise affinities for spectral segmentation. In: IEEE CVPR, pp. 2101–2108 (2010) 11. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, vol. 18, pp. 282–289 (2001) 12. Kumar, S., Hebert, M.: Discriminative random fields. Int. J. Comput. Vision 68(2), 179–201 (2006) 13. Ren, X., Fowlkes, C.C., Malik, J.: Figure/ground assignment in natural images. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3952, pp. 614–627. Springer, Heidelberg (2006) 14. He, X., Zemel, R.S., Carreira-Perpinan, M.A.: Multiscale conditional random fields for image labeling (2004) 15. He, X., Zemel, R.S., Ray, D.: Learning and incorporating top-down cues in image segmentation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 338–351. Springer, Heidelberg (2006) 16. Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. Int. J. Comput. Vision 81, 2–23 (2009) 17. Ren, X., Fowlkes, C., Malik, J.: Learning probabilistic models for contour completion in natural images. Int. J. Comput. Vision 77, 47–63 (2008) 18. Ren, X., Malik, J.: Learning a classification model for segmentation. In: IEEE ICCV, vol. 2, pp. 10–18 (2003) 19. Kohli, P., Ladicky, L.U., Torr, P.H.: Robust higher order potentials for enforcing label consistency. Int. J. Comput. Vision 82, 302–324 (2009) 20. Brox, T., Weickert, J.: A tv flow based local scale estimate and its application to texture discrimination. J. of Visual Communication and Image Representation 17, 1053–1073 (2006) 21. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 26, 530–549 (2004) 22. Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing visual features for multiclass and multiview object detection. IEEE Trans. Pattern Anal. Mach. Intell. 29, 854– 869 (2007)
Higher Order Markov Networks for Model Estimation Toufiq Parag and Ahmed Elgammal Dept of Computer Science, Rutgers University, Piscataway, NJ 08854 {tparag,elgammal}@cs.rutgers.edu
Abstract. The problem we address in this paper is to label datapoints when the information about them is provided primarily in terms of their subsets or groups. The knowledge we have for a group is a numerical weight or likelihood value for each group member to belong to same class. These likelihood values are computed given a class specific model, either explicit or implicit, of the pattern we wish to learn. By defining a Conditional Random Field (CRF) over the labels of data, we formulate the problem as an Markov Network inference problem. We present experimental results for analytical model estimation and object localization where the proposed method produces improved performances over other methods.
1
Introduction
The objective of this paper is to label datapoints into two classes when the information about the datapoints is available in terms of small groups1 of data. The information encode the likelihood that all the members of the corresponding group should fall into the same class. These likelihood measures are generated utilizing a (local or global) model, either analytical or implicit, of the entity or pattern we are trying to decide about. Given the subsets and their corresponding likelihood measures, we wish to find the labels of data having maximal agreement with these measures. It should be noted here that we are interested in scenarios where higher order knowledge about data subsets of sizes k > 2 is the primary source of information for labeling; we may or may not have any information about the individual data sample or pairs of samples. Furthermore, we may have data subsets of different sizes, i.e., k = 3, 4, . . . , each with corresponding likelihood measure, to decide the individual label from. There are many examples of such problem in machine learning and vision literature. In part-based object recognition for computer vision, we may learn an implicit statistical model among groups of detected parts to decide which of the detected parts actually belong to the object we are trying to recognize. It 1
Group simply implies a collection of datapoints. The mathematical definition of group is not used in this paper. Similarly, likelihood simply implies a numerical weight.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 246–258, 2011. c Springer-Verlag Berlin Heidelberg 2011
Higher Order Markov Networks for Model Estimation
247
is well known that larger groups of parts capture more geometrical information than pairs of parts. This is a necessary step in recognition due to the fact that object part-detectors often generate many false alarms. In website classification problems, a subset comprising more than two words could be more informative about the category than that comprising one or two words. Subset-wise information is also utilized in model estimation algorithms. The problem of model estimation is to determine the parameters of a model given a set of (noisy) data and an analytic form of a model. A standard algorithm, namely RANSAC [1], for this sort of problem randomly samples smaller subsets of data, computes target model parameters and calculates an error value for each of the subsets. RANSAC and its variants [2] are used for fundamental matrix computation, affine motion estimation etc in computer vision [3] literature. We develop an algorithm that can solve any model estimation problem that algorithms of RANSAC family can handle. We propose a novel approach to solve this labeling problem given higher order information in a principled fashion. Each datapoint is associated with a binary label variable. These label variables are modeled by a probabilistic graphical model, namely as Conditional Random Field (CRF) [4], with appropriate neighborhood system and clique definitions. We propose a (family of) potential function for each clique which increases with the number of mismatches among the labels of clique members. The choice of our potential function plays the central role in deciding labels of data samples. We discuss the tractability of inference for the resulting CRF and suggest different techniques for different instances of the potential function. This CRF formulation is able to cope with the situation when the likelihood values are available for different size subsets. Higher order information has been recently incorporated in random field models primarily as smoothness factor. Several papers suggested the use of higher order information in Markov Networks for various applications [5][6] [7][8]. There is a fundamental difference between our work and those exploiting higher order information in the probabilistic graphical models. Almost all the previous CRF studies utilize pointwise information as the ‘driving force’for labeling datapoints – higher order information is exploited for enforcing smoothness or regularity/contunuity among labels. As a result, these algorithms can not handle situations where we only have higher order information about the dataset. Furthermore, as we describe by a motivating example later, there exist scenarios when modeling higher order information by a (non-discriminative) smoothness function is inadequate to produce sufficient information for labeling. We propose a new method for processing discriminative information from higher order interactions using probabilistic graphical models. Our discoveries shed a new light on how to utilize higher order information in a manner different from past studies (i.e., not as just regularity preserving terms) as noted above. The proposed technique enables us to address new types of problems such as analytical model estimation where only higher order information about the data is available. We devise the required clique function for this purpose and also discuss some of their properties to understand which inference method is
248
T. Parag and A. Elgammal
necessary. We also experiment how object localization can be formulated as higher order labeling problem and show that the proposed method perform better than a higher order MRF model in this scenario. These findings reveal new insights for grouping (or clustering) data with higher order information. Modeling higher order information as non-discriminative continuity preserving factors has been shown to improve the results obtained by pairwise MRF/CRFs [6] [8]. However, this types of modeling constrains the use Markov Networks to certain domain of applications. A toy labeling problem is described in Section 2.2 shows how continuity preserving factors defined over groupwise terms fail to extract sufficient information for labeling even though the groupwise knowledge is clearly discriminative. In this paper, we investigate the situation/problems where subset-wise information as smoothness functions fail to produce meaningful labeling and suggest a new type of clique function for these problems. The paper is organized as follows: Section 2, and the subsections within, depict the overall CRF framework for labeling, the clique function required for our task and show a toy example illustrating its necessity. The next section (Section 3) discusses the properties of the clique function in connection to inference. We repot the experimental setup and results in Section 4 before concluding in Section 5.
2
Markov Network Formulation
Suppose there are n datapoints V = {vi | 1 ≤ i ≤ n} that we wish to separate into two categories A and B. We wish to label these samples using binary variables {xi ∈ {0, 1}| 1 ≤ i ≤ n} where xi = 1 implies vi ∈ A and xi = 0 implies vi ∈ B. Throughout the paper, we use the term ‘group’ to imply (interchangeably) a subset of {vi1 , . . . , vik } of size k where 1 ≤ il ≤ n and 1 ≤ l ≤ k. The likelihood that all members of {vi1 , . . . , vik } belong to A is denoted by λ1 (vi1 , . . . , vik ) and that they belong to B by λ0 (vi1 , . . . , vik ). The term ‘weight’ is used interchangeably with likelihood measures in this paper. As already stated, the objective of this work is to propose an algorithm to label the datapoints vi , 1 ≤ i ≤ n mainly based on the information provided in terms of small groups of them. We may or may not have some information about individual sample vi or pair of samples, but the focus is on how to utilize the knowledge we have for the subsets. To this end, the input to the algorithm is a set of small groups or subsets of data samples along with their likelihood values. Let V k is the set of all groups {vi1 , . . . , vik } that satisfy a certain condition2 . To establish a neighborhood Nik for any datapoint vi , we define vj is a neighbor of vi if both vi and vj are members of any group V k ∈ V k . Nik = { vj | j = i and ∃V k {vi , vj } ⊆ V k }.
(1)
For inputs with different size groups, k1 , k2 , . . . , we can similarly define a neighborhood system for each subset size. 2
For example, a condition check retains only groups with weights larger (or errors less) than a threshold δ k .
Higher Order Markov Networks for Model Estimation
249
Now, we can define a Markov Network over the label variables xi assuming the Markov property that the value of xi depends on xj only if vj is a neighbor of vi [4]. Any subset V k ∈ V k defines a clique of size k in the neighborhood system. Also, let X k denote the labels of members of V k , i.e., X k = {xi1 . . . xik } and X denote labels of all n datapoints {x1 , . . . , xn }. We assume a data-dependent network, namely the Conditional Random Field (CRF) over X. It is well known that the probability of any assignment p(X|V ) depends on what is known as the Gibbs energy function E(X|V ) [4]. The energy function E(X) is the summation of potential functions defined on cliques. Let E k (X k |V k ) denote an appropriately defined clique potential function for the clique V k = {vii , . . . , vik } of size k. With different size cliques, the Gibbs energy function equals to sum over all energy function of different sizes [4]. E(X | V ) =
K
E k (X k | V k )
(2)
k=1 V k ∈V k
The optimal assignment X = {x1 , . . . , xn } should minimize this Gibbs energy function. Next two sections describe the form of potential function E(X k | V k ) proposed in this paper to solve the specific problems using higher order information and attempts to convince the reader its necessity by a motivating toy example. 2.1
Proposed Clique Function
In our present scenario, we have weights associated with each data group that implies the likelihood that all members of the group belong to either of the two categories. We need a clique potential function whose value increases (or to be precise, not decrease) with the number of disagreement of the values of xil , 1 ≤ l ≤ k, within a clique. This characteristic of a clique potential function would encourage all the datapoints to acquire same labels when we try to minimize the Gibbs energy function in Equation 2. Let ηc be3 the number of variables within the subset X k labeled as class c. Furthermore, denote by λ1 (V k ) and λ0 (V k ) the likelihood measures that all vil ∈ V k belong to A and B respectively. We define a clique potential function E k (·) for clique V k as a linear combination of functions of η0 and η1 . E k (X k | V k ) = β1 λ1 (V k ) g1 (k − η1 ) + β0 λ0 (V k ) g0 (k − η0 ).
(3)
In this definition, β1 and β0 are two nonnegative parameters. The two functions g1 (·) and g0 (·) quantify how sensitive the clique function is to the increase in 1 − c. Ideally they should be monotonically non decreasing so that the more variables are labeled to the opposite class 1 − c , the higher the penalty gc (·) becomes. With high likelihood λ1 (V k ) value for category A with respect to that for B, the clique potential would be prone to increase η1 and tolerate a small penalty λ0 (V k ) g0 (k − η0 ) as λ0 (V k ) is small. In cases where the information λ0 (V k ) is not available, we assume that the λ1 (V k ) are normalized to [0, 1] and set λ0 (V k ) = 1 − λ1 (V k ). 3
The precise notation should be ηc (X k ), we are omitting X k for simplicity.
250
T. Parag and A. Elgammal
Table 1. Pointwise costs for toy txample Table 2. Groupwise weights for toy txample Cost v1 v2 v3 v4 v5 E 1 (1) 0.1 0.1 0.1 0.45 0.45 E 1 (0) 0.9 0.9 0.9 0.55 0.55
2.2
Weight {v1 , v2 , v3 } {v1 , v4 , v5 } {v2 , v4 , v5 } λ1 0.9 0.2 0.2 λ0 0.1 0.8 0.8
Motivation for a New Clique Function
Now, we wish to handle the situations where the higher order weights give us decisive information (e.g., geometry, model conformation). Pointwise weights are either absent or are unable to offer a complete picture of the situation. Let us first examine whether or not clique functions of the existing works can solve the problems that we are discussing. Authors of [9] proposed a robust higher order submodular potential that was shown to produce excellent results for image segmentation using higher order cliques [6]. Recall, η0 and η1 denote the number of the datapoints in any subset V k take labels 0 and 1 respectively. The higher order clique function defined in [9] is the truncated minimum of a non-decreasing concave function Fc (·) of k − ηc , where c ∈ {0, 1} in our context, k Esmooth (X k |V k ) = min{ min Fc (k − ηc ), γmax }. c∈{0,1}
(4)
However, the paper [9] later concentrates on a simpler form for Fc where it varies linearly between [γc , γmax ], with c ∈ {0, 1}. In what follows, we provide an example scenario where this potential can not generate a non-trivial labeling. Toy Example: Let us assume a toy dataset with 5 samples that we wish to label as 1 or 0 (also class A or B). The energy to be minimized comprise only pointwise and triplet-wise costs. The costs to assign any individual sample to a specific class is listed in Table 1. The weights of three subsets of size k = 3 (as defined in first paragraph of Section 2) are given in Table 2. These weights will be utilized to compute the cost as defined in Equation 4 (which is different from the cost suggested by the proposed clique function defined in Section 2.1). The pointwise costs in Table 1 indicate that the first three samples v1 , v2 and v3 should be labeled as category A since their E 1 (1) costs are much lower than E 1 (0). However, E 1 (·) costs are not decisive for labels of v4 and v5 . Table 2, on the other hand, tells us that v1 , v2 and v3 should be put in the category A since the weight λ1 is much larger than λ0 . But, the datapoints v4 and v5 should not be labeled the same as v1 and v2 since the λ1 weights for the triples are much lower than corresponding λ0 . Observe, first of all, that without the pointwise penalties, the clique function defined in Equation 4 will always produce trivial labeling, i.e., it will assign all datapoints to c∗ = arg minc∈{0,1} Fc . For linear form of Fc , the label will be the one corresponding to minimum γc . The min operator over Fc plays the pivotal role for deciding the labels in this case. Therefore, the result would not change even if we use a non-linear Fc .
Higher Order Markov Networks for Model Estimation
251
k The clique function Esmooth of [9] generates a trivial solution even when the pointwise costs are incorporated. Let us use the linear form4 of Fc = γc + γmax −γc (k − ηc ) with parameters γc = 0 and γmax = λ1 . The resulting label will k be all ones, 1, 1, 1, 1, 1. Due to the min operators, the result does not change if we use non-liner Fc or constant γmax = 1 or if we use γmax = λ0 for F0 and γmax = λ1 for F1 . Changing parameter values would not be able to produce a non-trivial solution either with this clique function. c If we apply the proposed clique function E k with gc (k − ηc ) = k−η on the k toy example, the resulting label is 1, 1, 1, 0, 0 respectively for the 5 datapoints (with or without pointwise penalties). This labeling perfectly confroms with the pointwise costs and triple likelihoods that we have in Tables 1 and 2. An additive combination of two functions g1 and g0 avoids trivial solutions for the problem we are dealing with. We will show examples of practical scenarios where the proposed clique is advantageous over the smoothness clique functions of [6,9] in results section (Section 4.2).
3
MRF Inference
The inference algorithm to be used to minimize the energy function E(X) depends on the choice of g1 (·) and g0 (·) functions. The following two subsections describes two algorithms that can solve the inference problem. These optimization methods are are well known in the Markov Network literature. 3.1
Special Case: Linear gc (·)
The linear form of gc (·), c ∈ {0, 1} increases from lc to hc in proportion to k −ηc . gc (k − ηc ) = lc +
hc − lc (k − ηc ). k
(5)
Intuitively, an increase of lc or a decrease of hc will lower the number of datapoints to be labeled as category c since the former would introduce a penalty to assign label c to any sample in a subset and the latter would reduce the penalty to assign 1 − c. It is straightforward to see that the value of xi is simply the label for which the summation of likelihood ratios weighted by the slope of gc (·) is minimum (omitted due to size constraint). hc − l c . (6) βc λc (V k ) x∗i = arg min k c∈{0,1} k k k V ∈V ∧vi ∈V
This solution can be computed in O(n + |V k |), with one pass over all the subset weights and another pass over all the datapoints. For multiple k, the corresponding weights for each tuple size will be added. We used this simple inference algorithm in all our experiments. We do not need to apply other standard methods like max-flow min-cut algorithm for this simple decision though they are capable of solving it. 4
This is how the authors of [9] defined γmax in [6].
252
3.2
T. Parag and A. Elgammal
General Case: Nonlinear gc (·)
For nonlinear gc (·), c ∈ {0, 1}, the likelihood weights can not be distributed to each of the datapoints. Problems with nonlinear gc (·) can be solved using a linear programming (LP) formulation suggested in [10][7]. A linear programming relaxation for MRF inference for was first suggested by Schlesinger in [11]. Since then, several algorithms have been proposed to solve this problem for pairwise and higher order random field models, see [12][13] and references therein to learn more about various inference algorithms available in the literature. It is not clear yet how a general higher order clique function can be reduced efficiently to pairwise interactions (see [14,15]) to apply Graph-cut algorithm. Furthermore, there has been no study that demonstrates Graph-cut algorithm can work exclusively with higher order information (i.e., without pointwise factors). Therefore, Graph-cut algorithms may not be an option for the types of problems we are dealing with. Other algorithms like sum-product belief propagation [16] or more efficient variants of it can also be used for the proposed method. Finally, it is worth mentioning that, to theoretically analyze the tractability of inference, it is useful to investigate the Submodularity[15] of the proposed clique function– especially what properties of gc functions make the clique function submodular . Submodularity of the proposed clique function guarantees the existence of polynomially computable solutions for inference. However, we will not analyze the potential function theoretically beyond this point in this paper since we will be utilizing the linear form of gc to demonstrate novel applications of graphical models in vision. In practice, Markov Network inference algorithms can generate satisfactory result with non-submodular functions as well.
4 4.1
Experiments and Results Model Estimation
One important application of the proposed method is model estimation. Let us suppose that we have n datapoints vi , i = 1, . . . , n, in some feature space. Part of these data samples are generated from some model that we wish to estimate. We sample T k subsets of size k, fit the candidate model to these subsets and compute the model estimation errors. The value of k is kept larger than or equal to s which is the minimum number of points required to fit a model (e.g. s = 2 to fit a line). The subsets producing an error less than a problem specific threshold δ k constitute the set V k . The estimation error of any member V k ∈ V k is transformed to a likelihood measure by a suitable transformation, e.g., λ(V k ) = 1 − if we know 0 ≤ ≤ 1 or λ(V k ) = exp(−/σ). It is important to mention here that, in model estimation, only higher order information is available. We do not have any knowledge about how likely each sample is to follow a certain model. For linear penalty functions for the proposed clique potentials, the inference algorithm (Equation 6) breaks the higher order costs into pointwise costs. Reducing the higher order costs into pointwise ones
Higher Order Markov Networks for Model Estimation
h1 = 1.20, l1 = 0.00, h0 = 1.00, l0 =0.00
Input
h1 = 1.00, l1 = 0.10, h0 = 1.00, l0 =0.00
253
h1 = 1.00, l1 = 0.05, h0 = 1.05, l0 =0.00
9 8
8
8
8
6
6
6
4
4
4
2
2
2
0 −2
0 −2
7 6 5 4 3 2 1 0 −2
0
2
4
6
8
10
0
2
4
6
8
Precn: 70.97, recll: 98.51
10
0
2
4
6
8
Precn: 75.00, recll: 89.55
10
0 −2
0
2
4
6
8
10
Precn: 78.05, recll: 95.52
Fig. 1. Performance on synthetic data with varying parameter values (displayed on top)
is not equivalent of using unary potentials only, because we do not have a tool to compute this unary potentials. Synthetic data: We plotted a line submerged in 55% noise samples. For the proposed method, we sampled T k = 2000 subsets of size k = 3 and used δ k = 0.5, linear gc (·), and [h1 , l1 , h0 , l0 ] = [1.0, 0.2, 1.0, 0]. Figure 1 shows the input and output data with varying parameters and the corresponding precision-recall values. Fundamental matrix computation: Given two images of the same scene from different viewpoint, the objective is to find the point matches that conform with the camera model expressed by the fundamental matrix. Four pairs of images from the Corridor, Valbonne, Merton II, Library datasets of the standard Oxford database5 were used in this experiment. We selected the two images with the largest variation in viewpoint, usually the first and last images (see Table 3). For the proposed method each match is considered as a datapoint. We sampled 2500 subsets of size k = 8 (we know s = 7 in this case). Other parameter values are, threshold δ k = 1, and linear parameters [h1 , l1 , h0 , l0 ] = [1.01, 0.02, 1.0, 0] for the clique function. For RANSAC, we sampled a subset of size 7 for at most 2500 times and used δransac = 0.001 as distance threshold (that produces the best result). We also compare the performance with three other variants of RANSAC, namely MLESAC, MAPSAC and LO-RANSAC, that were shown to be able to compute higher number of inliers in the survey paper [2]. The maximum number of iterations were kept the same as that of RANSAC and parameters in original implementations were retained. Table 3 shows the image names (ratio of true inliers) and the result statistics such as fraction of missed inliers (ratio of missed example over inliers only) and ratio of false positive inliers (ratio of false positives over all matches) in 100 runs for all the methods. As we can see, in almost all the cases, the proposed labeling method generates lower false positive rates than that of all the variants of RANSAC with similar or close miss rates. In model estimation, lower false positive rates are desirable up to a small increase of miss rate since there will be less impurities to influence the computation of the model. RANSAC and variants, decide based on the performance of the most recently sampled V k and 5
http://www.robots.ox.ac.uk/ vgg/data/data-mview.html
254
T. Parag and A. Elgammal
Table 3. Performance comparison for Model Estimation. Each column head shows the name and indices of the image used and the ratio of correct over incorrect matches. Each row shows the mean and std deviation of the missed inliers and false positives (FP) produced by the corresponding method. method Miss FA Miss MLESAC FA Miss MAPSAC FA Miss LO-RANSAC FA Miss Proposed FA RANSAC
Corridor:{000, 010} Valbonne:{000, 010} Merton:{002, 003} Library:{002, 003} 50/150 30/90 50/150 50/150 0.07 ± 0.06 0.13 ± 0.06 0.01 ± 0.02 0.02 ± 0.01 0.10 ± 0.03 0.13 ± 0.02 0.38 ± 0.01 0.25 ± 0.01 0.01 ± 0.03 0.13 ± 0.04 0.00 ± 0.00 0.01 ± 0.01 0.07 ± 0.02 0.11 ± 0.03 0.33 ± 0.00 0.17 ± 0.01 0.04 ± 0.04 0.19 ± 0.03 0.00 ± 0.00 0.02 ± 0.01 0.05 ± 0.02 0.09 ± 0.01 0.30 ± 0.00 0.15 ± 0.01 0.17 ± 0.07 0.19 ± 0.03 0.03 ± 0.03 0.01 ± 0.01 0.11 ± 0.04 0.10 ± 0.03 0.37 ± 0.04 0.22 ± 0.01 0.09 ± 0.02 0.16 ± 0.02 0.14 ± 0.02 0.05 ± 0.01 0.06 ± 0.01 0.05 ± 0.01 0.25 ± 0.01 0.13 ± 0.01
Proposed method, largest # Miss= 12
MAPSAC, largest # Miss= 22
Proposed method, largest # FA= 13
MAPSAC, largest # FA= 20
miss: proposed
miss: MAPSAC
FA: proposed
FA: MAPSAC
Fig. 2. Worst case qualitative performances for detecting valid correspondences between two images. Showing the largest miss+FP case. Yellow (red) line: matching point pair missed (falsely detected) by the method.
replace the set of inliers, if needed, and discard the estimation results from all previous subsets. The proposed method aggregates the results of all past model estimation results in order to obtain a better decision about which datapoints in the set should be labeled inliers. In Figure 2, we show qualitative comparison in worst case. The results of proposed methods are compared with MAPSAC, because of its superior performance w.r.t. other variants, where they produced the largest number of missed correct matches and false detections (miss+FP). This figure shows that, in the worst case, the proposed method would miss less correct matches and generate fewer false alarms than those of MAPSAC. In our experiments, we experienced same pattern for all the images. 4.2
Object Localization
Part based representation of an object has gained much attention recent years in computer vision. Simple object part detectors will produce many detection outside the object region. We will apply the proposed grouping method to localize the object in an image by selecting correct object parts among these detected parts. THe candidate locations are spatially clustered Kadir-Brady salient
Higher Order Markov Networks for Model Estimation
255
region centers [17]. Each candidate location is a datapoint for the problem and the algorithm should detect which of these datapoints belong to the object. We use Geometric Blur descriptor [18] for these datapoints to create a codebook or bag of features (in a reduced dimensional feature space obtained through PCA). We sample size k subsets {vi1 , . . . , vik } of datapoints and represent them using parameters of geometric relation present among the members of the subset. Let π be a permutation of i1 , . . . , ik , which orders vπ(l) , 1 ≤ l ≤ k in increasing xcoordinate; ζ = [ζl ]kl=1 where ζl is a code that represents vπ(l) ; and g be a vector of geometric parameters for the relation present in vπ(l) . Now, we wish to derive a probability for g, π given object model description Θgo , Θao corresponding to geometric and appearance parameter for the object. The probability p(g, π|Θgo , Θao ) can be expressed as a joint probability of g and π occurring together given the model. o p(g, π | Θgo , Θa )=
o o p(g, π | ζ, Θgo , Θa ) p(ζ | Θa )=
ζ
o o p(g | ζ, Θgo ) p(π | ζ, Θa ) p(ζ | Θa ).
ζ
We assume independence between geometric parameters g and relative positions π given ordered code representation ζ, among πl , l = 1, . . . , k given ζl and among ζl themselves. Therefore, the above expression can be simplified as follows. p(π| ζ, Θao ) =
k l=1
p(πl | ζl , Θao ) and p(ζ, | Θao ) =
k
p(ζl | Θao ).
(7)
l=1
The probability p(g | ζ, Θgo ) for geometric parameters are estimated by Kernel Density Estimation (KDE) with diagonal bandwidth matrix. The probability p(πl | ζl , Θao ) is modeled using the feature descriptors φ(vπl ) and φ(ζl ) for vπl and ζl respectively, i.e., p(πl | ζl , Θao )) ∼ N (φ(ζl ), 0.2I) where I is the identity matrix. The priorsp(ζl | Θao ) on the codes are computed empirically counting number of times ζl appeared in object. An analogous model is learned for background subsets of parts. For each test image, subsets of size k is collected the same way. The likelihood weight for object class A is λ1 (V k ) = p(g, π | Θgo , Θao ) and that for background class B is λ0 (V k ) = p(g, π | Θgb , Θab ) where Θgb , Θab describe the background model. A group will have a large likelihood weight λ1 (V k ) if most of the codes representing the object parts have high prior probabilities and the geometric arrangement among them also has high likelihood value given object model. Given these weights, we run the proposed inference algorithm with linear gc (·), c ∈ {0, 1} to decide which parts vi should have xi = 1, i.e., which part should belong to the object. The datasets used for this experiment are the Caltech Cars (rear, with 126 images), Motorbikes, Airplanes images. Each dataset is split into two subsets to be used for training and testing. The training datasets are used to learn the codebook and object model probabilities. For both object and background parts, we identify the 3 nearest neighbors and sample subsets of sizes k = 3 respectively out of the part and its neighbors. The lengths of the sides of the triangle generated by subsets of size k = 3 (normalized by the perimeter) are used to encode the geometrical information among members of the group in addition to the internal angles.
256
T. Parag and A. Elgammal
motorbikes
airplanes 1
0.8
0.8
0.8
0.6
0.4
0.2
0 0
Code prior Proposed−tri [14] 0.02
0.04 0.06 0.08 0.1 Avg false positive per image
0.12
Avg hit rate
1
Avg hit rate
Avg hit rate
cars 1
0.6
0.4
0.2
0 0
Code prior Proposed−tri [14] 0.02
0.04 0.06 0.08 0.1 Avg false positive per image
0.12
0.6
0.4
0.2
0 0
Code prior Proposed−tri [14] 0.02
0.04 0.06 0.08 0.1 Avg false positive per image
0.12
Fig. 3. Object localization rates for cars (left), motorbikes (middle) and airplanes (right) images
We compared the results with a naive approach where the empirical probability of the code representing each part is used to determine its label. That is, the label xi of any vi , is set to 1 if p(ζi |Θao ) > p(ζi |Θab ), where ζi is the code representing vi , and set to 0 otherwise. This method will be referred as Code Prior procedure hereafter. Without an explicit model description, it is easy to realize that RANSAC can not be applied here. Furthermore, as we discussed before, the algorithms for higher order MRF inference in [9,5,6] can not be applied in this object model without pointwise costs. We utilized the the empirical code probability in order to employ the technique of [9] for Graph-Cut method with swap move in the present scenario with binary labels and only point-wise and triple-wise weights. Each location vi is considered as a node in the graph and its point-wise cost is inversely proportional to corresponding p(ζl |Θao ) andp(ζl |Θab ). The max weight for a 3-way clique, denoted by γmax in [6,9], is set to ζ p(g | ζ, Θgo ) p(π | ζ, Θao ) for object class and ζ p(g | ζ, Θgb ) p(π | ζ, Θab ) for background class. This conforms with how the authors of [9] defines γmax to be proportional to a goodness measure of the corresponding clique in Equation 12 of [6]. The higher order terms of [9] are distributed into members of the clique via an auxiliary variable. This is different from the manner that we break the higher order terms, compare Equation 6 of this article with Theorem 1 and Figure 3 of [9]. We used the parameters that produced best results for both the proposed method and for the method in [9]. For quantitative comparison, we plot the average number of images with a hit against the fraction of false detections allowed per image. If at least 5 of detected object parts fall within the object boundary of an image, we count it as one hit. The results are shown in Figure 3. It is quite natural that even using the code prior probability, and no geometric information, one can achieve good localization performance for Motorbikes dataset as there are many images in this dataset where the object bounding box itself occupies more than 95% of the image area. For the other two datasets, where the code prior probabilities are not sufficient to identify the object parts, the proposed method attains a higher hit rate with lower false positives than both the Code Prior approach and that of [9]. As discussed before, higher order information is not used for discrimination in [9]; it is rather used as smoothness constraint. On the other hand, the proposed method uses the higher order information of a group for discrimination. We are showing
Higher Order Markov Networks for Model Estimation
Code Prior
Code Prior
Code Prior
Code Prior
Higher order MRF[14]
Higher order MRF[14]
Higher order MRF[14]
Higher order MRF[14]
Proposed−triangle
Proposed−triangle
Proposed−triangle
Proposed−triangle
257
Fig. 4. Qualitative object localization results- from top to bottom rows, the result for code prior, that of [9] and the proposed one respectively. The circles (filled with white) indicate the initially detected (spatially clustered) locations. The red * indicate the object parts detected by respective method.
some qualitative results in Figure 4. Notice the zero detections of method [9] (middle row of Figure 4) just as we anticipated before (see Section 2.2).
5
Conclusion
This paper proposes a novel method model estimation using probabilistic graphical models. Our method is able to compute both explicit (analytical) and implicit models defined over the whole object or over small parts. The problem of model estimation is formulated as labeling problem with higher order information, where higher order information is the conformity of small subsets of datapoints to the model to be estimated. The labeling problem is addressed by an inference algorithm of a probabilistic graphical model. We believe, discoveries in this paper will promote new direction of research for model estimation. At the same time, this study will also motivate novel applications of Markov Network for labeling (or, clustering) with higher order information.
References 1. Fischler, M.A., Random, C.B.R.: sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. ACM 24, 381–395 (1981)
258
T. Parag and A. Elgammal
2. Choi, S., Kim, L., Yu, W.: Performance evaluation of ransac family. In: BMVC (2009) 3. Workshop: (25 years of ransac), in conjunction with CVPR 2006, http://cmp.felk.cvut.cz/ransac-cvpr2006/ 4. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009) 5. Kohli, P., Kumar, M., Torr, P.: P3 and beyond: solving energies with higher order cliques. In: CVPR (2007) 6. Kohli, P., Ladicky, L., Torr, P.: Robust higher order potentials for enforcing label consistency. In: CVPR (2008) 7. Komodakis, N., Paragios, N.: Beyond pairwise energies: Efficient optimization for higher-order mrfs. In: CVPR (2009) 8. Lan, X., Roth, S., Huttenlocher, D., Black, M.J.: Efficient belief propagation with learned higher-order markov random fields. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part II. LNCS, vol. 3952, pp. 269–282. Springer, Heidelberg (2006) 9. Kohli, P., Ladicky, L., Torr, P.: Graph cuts for minimizing higher order potentials. Technical report, Oxford Brookes University (2008) 10. Cooper, M.C.: Minimization of locally defined submodular functions by optimal soft arc consistency. Constraints 13, 437–458 (2008) 11. Schlesinger, M.I.: Syntactic analysis of two-dimensional visual signals in the presence of noise. Cybernetics and Systems Analysis 12, 612–628 (1976) 12. Ishikawa, H.: Higher order clique reduction in binary graph cut. In: CVPR (2009) 13. Werner, T.: A linear programming approach to max-sum problem: A review. PAMI 29, 1165–1179 (2007) 14. Freedman, D., Drineas, P.: Energy minimization via graph cuts: Settling what is possible. In: CVPR (2005) ˇ y, S., Jeavons, P.G.: Which submodular functions are expressible using binary 15. Zivn´ submodular functions? Research Report RR-08-08, Oxford University Computing Lab, UK (2008) 16. Kschischang, F.R., Frey, B.J., Andrea Loeliger, H.: Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory 47, 498–519 (1998) 17. Kadir, T., Brady, M.: Saliency, scale and image description. IJCV 45, 83–105 (2001) 18. Berg, A., Malik, J.: Geometric blur for template matching. In: CVPR (2001)
Interactive Object Graphs for Debuggers with Improved Visualization, Inspection and Configuration Features Anthony Savidis1,2 and Nikos Koutsopoulos1 1
Foundation for Research and Technology – Hellas, Institute of Computer Science, Science and Technology Park of Crete, Heraklion, Crete, GR-71110 Greece {as,koutsop}@ics.forth.gr 2 University of Crete, Department of Computer Science
Abstract. Debugging as a process involves the examination of the runtime state of objects in order to identify potential defects and the way they are actually propagated among objects (infection). Interactive tools improved the overall conduct of the process by enabling users more efficiently and effectively track down state faults. But as systems grow, the runtime state of programs explodes to encompass a huge number of objects. The later requires state inspection following runtime object associations, thus involving graph views. Existing graph visualizers are not popular because they are mostly visualization than interaction oriented, implementing general-purpose graph drawing algorithms. The latter explains why prominent development environments still adopt traditional tree views. We introduce a debugging assistant with a visualization technique designed to better fit the task of defect detection in runtime object networks, also supporting advanced inspection and configuration features. Keywords: Debugger User-Interfaces, Interactive Object Graphs, Graph Visualization, Graph Navigation.
1 Introduction Debugging is the systematic process of detecting and fixing bugs within computer programs and can be summarized (Zeller, 2005) by the steps outlined under Fig. 1. The examination of the runtime state requires inspection of object contents, associations and dependencies. Bug detection Bug fixing
1. Reproduce the error and simplify test case 2. Examine the runtime state and locate the error 3. Make a repair plan and apply required modifications 4. Verify that the error has been eliminated
Fig. 1. The overall structure of the debugging process
The latter is a difficult task, even for small-scale applications, where traditional graph visualizations proved to be rather ineffective. The remark is justified by the fact G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 259–268, 2011. © Springer-Verlag Berlin Heidelberg 2011
260
A. Savidis and N. Koutsopoulos
that most popular commercial IDEs like Visual Studio (Microsoft) and IntelliJ IDEA (Jet Brains) do not provide them, while open source IDEs like Eclipse and NetBeans have a couple of relevant third-party plug-ins which are rarely used. But still, during debugging, objects remain the primary information unit for gaining insights (Yi et al., 2008) on how errors infect the runtime state. Interestingly, while the notion of visualization generally receives positive attention, relevant implementations failed to essentially improve the state examination process. We argue that this is due to the primary focus of present tools on visualization, adopting general-purpose rendering algorithms being the outcome of graph drawing research, however, lacking other sophisticated interactive features. More specifically, general graph drawing algorithms aim to better support supervision and pattern matching tasks through the display of alternative clustering layouts. This can be valuable when visualizing static features like class hierarchies, function dependencies and recursive data structures. However, during the debugging process the emphasis is shifted towards runtime state analysis involving primarily state exploration and comparison tasks, requiring detailed and advanced inspection facilities. We argue that the inability to systematically address this key issue explains the failure of object graph visualizers in the context of general-purpose debugging. As we discuss later, most existing tools are no more than mere implementations of graph drawing algorithms. Our primary goal is to provide interactive facilities which improve the runtime state examination process for defect detection. This goal is further elaborated into three primary design requirements: Visualization style inspired by a social networking metaphor enabling easily identify who deploys objects (clients) and whom objects deploy (servers) Inspection features to easily review object contents and associations and to search content patterns (currently regular expressions only) Interactively configurable levels of information detail, supporting off-line inspection and multiple concurrent views The reported work (i-views) has been implemented as a debugger plug-in on top of the Sparrow IDE for the Delta programming language being publicly available (Savidis, 2010). This language has been chosen for ease of implementation, since it offers an XML-based protocol for extracting object contents during runtime (Savidis & Lilis, 2009). Overall, our results may be applied to any other debugging tool.
2 Related Work Graphs have been widely deployed in the context of object-oriented visualizations (Lange & Nakamura, 1997) for various purposes, besides debugging, such as: tracing of dynamic object-allocation characteristics, revealing static ownership-based object relationships (Hill et al., 2002), tracking event generation in event-based systems (Kranzlmuller et al., 1996), and investigating object constraints (Muller, 2000). Tools offering interactive object graphs exclusively for the debugging process exist for various languages. Memory visualization tools such as those by (Aftandilian et al.,
Interactive Object Graphs for Debuggers
261
2010), (De Pauw & Sevitsky, 1999) and (Reiss, 2009) are essentially memory graph visualizers targeted in displaying memory usage by programs, enabling programmers to detect memory leaks or identify memory corruption patterns. They are low-level since their focus is on memory maps and memory analysis, while fundamental program notions like objects and associations (clients and servers) are not handled. A tool working on objects is DDD (GNU, 2009), however, it allows incremental expansion only for referent (outgoing / server) structures. Comparing with previous work, it is clear that all existing tools focus merely on the graph drawing quality rather than on optimizing the primary task being object state inspection for defect detection. As a result, while they tend overall offer attractive visualizations, those are not perfectly aligned with the typical activity patterns programmers actually perform during debugging. As an example, none of the existing tools allows to identify directly the clients and servers of an object, information being very critical when examining error propagation patterns. In our approach, the previous has been treated as a primary requirement and played a key role in designing a layerbased graph visualization style. Additionally, we treated inspection as a genuine data mining process by providing users with facilities not met in existing tools, including: bookmarking, text search, lens tools, concurrent views, off-line inspection, dropping interactively examined objects, and configurable visualization parameters in real-time.
3 Task Details Our graph visualization approach does not adopt a generic graph drawing algorithm (Battista et al., 1999), like those provided by GraphViz (Gansner & North, 1999), but is explicitly designed for defect detection in object networks. We elaborate on the design details of our method showing that it is primarily task-oriented, rather than supervision-oriented or comprehension-oriented as with most graph visualizers. More specifically, the bug detection process is initiated from a starting object on which the undesirable symptoms are initially observed. Then, an iterative process is applied with essentially two categories of analysis activities to detect state corruption and identify malicious code (see Fig. 2). In this context, programmers will usually have to further inspect objects by seeking referent objects (outgoings links) from a starting object. Such referent inspection is well supported by existing visualizers like DDD (GNU, 2010) or HeapViz (Aftandilian et al., 2010). However, it is also supported by common tree views (like those of Fig. 3), which, due to their ease of use, remain at present the most preferable inspection tool. Once the type and level of state corruption is verified on an object, the analysis proceeds so as to identify the offensive code. For the latter, all use sites of infected objects must be investigated since they constitute potentially malicious code. For this step referrer objects (incoming links) should be manually traced. Since many infected objects may coexist, it is crucial to allow programmers quickly switch back and forth to referent / referrer inspection for different objects.
262
A. Savidis and N. Koutsopoulos for each server do
start
Object in which symptom initially observed
Seek referents (server objects
corrupted?
for each server do
Seek referrers (client objects
NO
Examine object for state corruption
Study invoked methods for malicious code
YES
Roll back to previous examined object
Study invoker methods for malicious code
end
Bug detected, proceed with fixing
YES
malicious?
NO
Fig. 2. Flowchart of the bug detection task
This step is the most critical and most demanding part of debugging known as the bug finding process. Only when the offensive statement is eventually found the defect detection step completes and the bug fixing process is initiated. Based on the previous remarks, we follow with the design details of our object graph visualizer.
Fig. 3. Typical debugger tree views
4 Design Approach To better support defect detection in object networks, visualizers should support referrer and referent inspection since they directly affect the generation and propagation of defects. In particular (see Fig. 4), objects are infected if a method with an offensive statement is invoked by clients (step 1). 1. client invocation of malicious method
2. corruption passed to servers
infected object 3. corruption passed to clients referrers (some are clients)
2. corruption passed to servers referents (some are servers)
Fig. 4. Defect generation and propagation in object networks
Interactive Object Graphs for Debuggers
263
Such an infection is propagated when servers (referents) are used by (step 2), or clients (referrers) are using an infected object (step 3). Based on the characteristics of the debugging process we focused on a visualization method allowing improved inspection practices for detecting object anomalies. We observed that given an object, all its direct referrers and referents are essentially its close runtime peers, semantically playing as either clients or servers. If such an inner social circle for objects is directly traceable it becomes easier to examine the actual level of infection. Following this concept, given a starting object a and a maximum social distance n we introduce layers of social peers by the rules: (i) all direct referrers and referents of a are put in layer 1; and (ii) layer i+1 contains the direct referrers and referents of objects from layer i not included in layer i or layer i-1 while if layer i+1 becomes empty then the process terminates. layer 0
layer 1
layer 2
2
2
5 1
3
5
4
6
7
1
3
6
4 7
Fig. 5. A typical digraph and a layered graph for node (1)
An example is provided under Fig. 5 showing how from a typical directed graph we get a layered graph, for a given starting node. In a layered graph the runtime peers of an object fall either in its own layer or a neighbor one. Every layer encompasses the client and server objects of its previous and next layers. By tracing layers, clients of clients or servers of servers are easily tracked down. Typically, from a starting node, the number of subsequent layers to visit during inspection is usually small, while the number of objects within layers may get very large. It should be noted that the layered view is not designed to offer a generally more attractive, or easier to assimilate, image of the graph. It is a task-specific visualization technique which for a mission other than debugging may be proved to be suboptimal.
5 Interactive Visualizer The architecture of the visualizer is provided in Fig. 6 and a snapshot is shown in Fig. 7 (objects are shown with full contents as tables with slot – value pairs). Although the default view is crowded with crossing / overlapping edges, we will briefly discuss how this is handled via the large repertoire of interactive configuration and inspection features allowing programmers to inspect object contents and identify potential state faults. We discuss the key features below.
264
A. Savidis and N. Koutsopoulos
Fig. 6. Software architecture of the visualizer
Fig. 7. Two parallel inspection sessions with different configurations
5.1 Adjustable Lens View The main view offers zooming and resizing to enable inspect the object graph from different view scales and with varying view sizes. Similarly, the lens view scale and size (see Fig. 8) can be separately adjusted. The latter, combined with the main view, offers two independent levels of detail for the inspection process. One of the most typical uses of the lens view during the inspection phase relates to the snapshot of Fig. 8: (i) the main view is chosen with a high zoom-out factor enabling to supervise the
Interactive Object Graphs for Debuggers
265
largest part of the graph although with a low-level of information detail; (ii) the lens view is configured to combine a large view size (window) with a high zoom-in factor to offer an increased level of detail; and (iii) the lens is dragged across the main view to inspect object contents and their respective associations.
Fig. 8. The lens view in various (adjustable) view scales Contents tooltip
Search dialogue
Goto source facility
Search combined with lens view
Search match found
Fig. 9. Content tooltips, text search combined with lens view and goto source
266
A. Savidis and N. Koutsopoulos
5.2 Content Tooltips and Text Search As we discuss later, the visualization can be configured in various ways, such as displaying full object contents (all slots shown) or only the object reference identifier. The latter, while it makes the resulting graphs visually less crowded, it also reduces the conveyed information content. For this purpose, content tooltips allow to quickly view object contents (see Fig. 9). Also, under a zoom-out factor (value is configurable, default is 30%), the tooltips will remain active even if full contents are shown. Another feature which, amongst others, enables to immediately spot objects with content-corruption patterns, is text search (supporting regular expressions too). As shown in Fig. 9 (right part), with every match, the tool focuses on the target object and highlights the matching slots; the search can be combined with lens views. 5.3 Bookmarking and Path Highlighting During the inspection process, content analysis and comparison between different objects is frequently performed. The goal is to identify cases where object contents disobey the patterns expected in a correct program execution. For this task, in existing debugging tools, programmers have to apply tedious heuristics. path highlighting and bookmarks
selected contents shown
path highlighting control menu
content tooltips on hidden contents
inner edges hidden
selectively rejecting (can undo) examined nodes from the graph
Fig. 10. Various interaction configuration features in the object graph
Interactive Object Graphs for Debuggers
267
For example, they manually copy contents in an editor, trace another target object in the debugger with which they need to compare, and then switch back and forth between the editor and the debugger inspection tool to indentify differences or commonalities. To facilitate such activities, we introduced bookmarks in the graph viewer which record the focus object, view origin, and zooming factor. As a result, when switching context to a bookmark, the view state is restored exactly as it was at the time the bookmark was set. Bookmarked objects are indicated with an extra marker. Another useful facility for debugging is highlighting all client and server reference paths recursively (i.e. clients of clients and servers of servers) for a given object (see Fig. 10, upper part). This provides a clear picture of the runtime interactions of an object at a given time. Additionally, the object slots involved in creating a reference path are highlighted as well. The interactive configuration features of the graph viewer have been designed to enable switching between different levels of information detail. A few examples are provided under Fig. 10 (middle and bottom parts) showing the outcome by applying configuration features. They mostly concern the way either objects (vertices) or their associations (edges) are drawn. When combined with other inspection features typical process patterns during debugging are supported. For example, consider path highlighting combined with the ability to hide contents of all objects and the option for selective content expansion on individual objects. This combination allows programmers focus on a specific client-server dependency path, commonly studied when likely corruption patterns are detected. This way, programmers avoid the information overload caused when objects irrelevant to the current investigation context are fully expanded. Additionally, it is allowed to drop (with undo support) from the graph any object that is considered of no interest during inspection. This allows programmers incrementally simplify the examined graph, thus eventually keeping only the objects assumed as candidates for state infection.
6 Summary and Conclusions We have discussed a debugging assistant offering interactive object graphs, putting emphasis on improved visualization, inspection (navigation) and configuration facilities. The design of the tool was focused on the optimal support of the primary task, rather than on the general-purpose graph drawing applicability as such. This has led to a considerable number of interactive features, each with a distinctive role in object state inspection, not met in existing tools. As explained earlier, our visualization approach has been also a spin-off of the task analysis process, which emphasized effective and efficient inspection, rather than overall comprehension. The latter is a novel view regarding the utility of graphical debugging aids. In particular, we observed that programmers study object paths based almost exclusively on infection criteria. Practically, views like dynamic object topology and linkage patterns were of less interest during debugging. They seemed to be appropriate for the design stages of a subsystem, to assimilate its runtime behaviour, but not during the bug finding process.
268
A. Savidis and N. Koutsopoulos
Our evaluation trials have shown that programmers tend to spend more time in detailed inspection of object contents, in identifying corruption patterns and in tracking down referrers and referrers. They reported that they prefer alternative drawing tools mostly for static aspects such as class hierarchies and module dependencies. In conclusion, we believe our work offers a novel insight on the design of interactive graph visualizations for debuggers. Further systematic analysis and support of debugging activities in the future may result in more advanced facilities, effectively leading to more usable and useful interactive debugging environments.
References 1. Aftandilian, E., Kelley, S., Gramazio, C., Ricci, N., Su, S., Guyer, S.: Heapviz: Interactive Heap Visualization for Program Understanding and Debugging. In: ACM SoftVis 2010, Salt Lake City, Utah, October 25-26, pp. 53–62 (2010) 2. Battista, D.G., Eades, P., Tamassia, R., Tollis, I.G.: Graph Drawing: Algorithms for the Visualization of Graphs. Prentice-Hall, Englewood Cliffs (1999) 3. De Pauw, W., Sevitsky, G.: Visualizing Reference Patterns for Solving Memory Leaks in Java. In: Liu, H. (ed.) ECOOP 1999. LNCS, vol. 1628, pp. 116–134. Springer, Heidelberg (1999) 4. Gansner, E., North, S.C.: An open graph visualization system and its applications to software engineering. Software Practice & Experience 30(11), 1203–1233 (1999) 5. GNU DDD (2009). Data Display Debugger (version 3.3), http://www.gnu.org/software/ddd/ 6. Hill, T., Noble, J., Potter, J.: Scalable Visualizations of Object-Oriented Systems with Ownership Trees. Journal of Visual Languages and Computing 13, 319–339 (2002) 7. Kranzlmuller, D., Grabner, S., Volkert, J.: Event Graph Visualization for Debugging Large Applications. In: Proceedings of ACM SPDT 1996, Philadelphia, PA, pp. 108–117 (1996) 8. Lange, D.B., Nakamura, Y.: Object-oriented program tracing and visualization. IEEE Computer 30(5), 63–70 (1997) 9. Müller, T.: Practical Investigation of Constraints with Graph Views. In: Dechter, R. (ed.) CP 2000. LNCS, vol. 1894, pp. 320–336. Springer, Heidelberg (2000) 10. Reiss, S.: Visualizing the Java heap to detect memory problems. In: Proceedings of VISSOFT 2009 5th IEEE International Workshop on Visualizing Software for Understanding and Analysis, pp. 73–80 (2009) 11. Savidis, A.: Delta Programming Language. Official web site (2010), http://www.ics.forth.gr/hci/files/plang/Delta/Delta.html 12. Savidis, A., Lilis, Y.: Support for Language Independent Browsing of Aggregate Values by Debugger Backends. Journal of Object Technology 8(6), 159–180 (2009), http://www.jot.fm/issues/issue_2009_09/article4.pdf 13. Yi, J. S., Kang, Y., Stasko, J., Jacko, J.: Understanding and Characterizing Insights: How Do People Gain Insights Using Information Visualization? In: ACM BELIV 2008, Florence, Italy, pp. 1–6 (April 5, 2008) 14. Zeller, A.: Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann, San Francisco (2005) 15. Zimmermann, T., Zeller, A.: Visualizing memory graphs. In: Diehl, S. (ed.) Dagstuhl Seminar 2001. LNCS, vol. 2269, pp. 191–204. Springer, Heidelberg (2002)
GPU-Based Ray Casting of Stacked Out-of-Core Height Fields Christopher Lux and Bernd Fr¨ ohlich Bauhaus-Universit¨ at Weimar
Abstract. We developed a ray casting-based rendering system for the visualization of geological subsurface models consisting of multiple highly detailed height fields. Based on a shared out-of-core data management system, we virtualize the access to the height fields, allowing us to treat the individual surfaces at different local levels of detail. The visualization of an entire stack of height-field surfaces is accomplished in a single rendering pass using a two-level acceleration structure for efficient ray intersection computations. This structure combines a minimum-maximum quadtree for empty-space skipping and a sorted list of depth intervals to restrict ray intersection searches to relevant height fields and depth ranges. We demonstrate that our system is able to render multiple height fields consisting of hundreds of millions of points in real-time.
1
Introduction
The oil and gas industry is continuously improving the seismic coverage of subsurface regions in existing and newly developed oil fields. During this process, very large volumetric seismic surveys are generated using the principles of reflection seismology. The resulting volumetric data sets represent the magnitude of seismic wave-reflections in the earth’s subsurface. Geologists and geophysicists use these seismic volumes to create a geological model of the most important subsurface structures in order to identify potential hydrocarbon reservoirs. Horizons are a fundamental part of the geological model representing the interface between layers of different materials in the ground. The ever increasing size of seismic surveys generates extremely large horizon geometries composed of hundreds of millions of points (Figure 1). Traditional rasterization methods cannot render such data sets in real-time. While horizon surfaces are commonly represented as height fields, general terrain rendering approaches [1] have not been adapted to deal with the specific structure of multiple horizon layers. We developed an efficient ray casting-based rendering system for the outof-core visualization of sets of large stacked horizons. Our method employs a multi-resolution data representation and makes use of a minimum-maximum quadtree over the tiled horizon height fields to speed up the ray traversal. Sorted height intervals in the quadtree cells restrict the intersection searches to relevant horizons and depth ranges. We virtualize the access to the underlying horizon height fields such that each horizon can be locally treated at different levels of detail (e. g. occluded horizon parts are represented at a much lower resolution). G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 269–280, 2011. c Springer-Verlag Berlin Heidelberg 2011
270
C. Lux and B. Fr¨ ohlich
Fig. 1. This figure shows three different horizons extracted from a common seismic survey. While the upper most horizon (red) is spatially isolated, the lower horizons (green and blue) are partially overlapping. The original resolution of each horizon is 6870 × 14300 points.
Our out-of-core data management system supports geological models consisting of multiple large height-field data sets, potentially exceeding the size of the graphics memory or even the main memory. A feedback mechanism is employed during rendering which directly generates level-of-detail information for updating the cut through the multi-resolution hierarchies on a frame-to-frame basis. Our work is motivated by the observation that with increasing screen resolutions and screen-space errors below one pixel, the geometry throughput of current GPUs is becoming a major performance bottleneck and ray casting techniques can provide comparable performance for large data sets [2,3]. However, none of the existing ray casting-based rendering systems are capable to efficiently visualize complete stacks of highly detailed, mutually occluding and overlapping horizons contained in geological models. The main contributions of our out-of-core stacked height-field ray casting system include: • A two-level acceleration structure for fast ray traversal and efficient ray intersection computations with stacked height fields. • A single-pass rendering approach, which also generates level-of-detail feedback for updating the cuts through the multi-resolution representations of the height fields. • A two-level out-of-core data management system, which provides virtualized access to multi-resolution height-field data. Our implementation demonstrates that we can render multiple horizons consisting of hundreds of millions of points in real-time on a single GPU. The ray casting approach also facilitates the integration with volume ray casting, a highly desirable property in visualization systems for subsurface data.
GPU-Based Ray Casting of Stacked Out-of-Core Height Fields
2
271
Related Work
The direct visualization of height-field surfaces using ray casting-based methods is a very active and well explored field of computer graphics research. Early CPU-based methods, primarily targeted at terrain rendering applications, were based on a 2D-line rasterization of the projection of the ray into the twodimensional height-field domain to determine the cells relevant for the actual intersection tests [4,5,6,7]. A hierarchical ray casting approach based on a pyramidal data structure was proposed by Cohen and Shaked [5]. They accelerate the ray traversal by employing a maximum-quadtree, storing the maximum heightdisplacement value of areas covered by its nodes in order to identify larger portions of the height field not intersected by the ray. Enabled by the rapid evolution of powerful programmable graphics hardware, texture-based ray casting methods implemented directly on the GPU were introduced. The first published approach by Qu et al. [7] still uses a line rasterization approach similar to the traditional CPU-based approaches. The relief-mapping [8] methods pioneered by Policarpo et al. [9,10] employ a parametric ray description combined with an initial uniform linear search which is refined by an eventual binary search restricted to the found intersection interval. Considering that the initial uniform stepping along the ray may miss high-frequency details in the height field, these methods are considered approximate. While these initially published GPU-based ray casting algorithms are not utilizing any kind of acceleration structures, later publications proposed different approaches. Donnelly [11] described the use of distance functions encoded in 3D-textures for empty-space skipping. The drastically increased texture memory requirements make this technique infeasible for the visualization of large horizon height fields. Later methods proposed by Dummer [12] and Policarpo et al. [10] exhibit drastically reduced memory requirements. They calculate cone ratios for each cell to describe empty space regions above the height field allowing fast search convergence during the ray traversal. The very high pre-computation times of these techniques only allow for the handling of quite small height-field data sets. This problem is addressed in the subsequent publications by Oh et al. [13] and Tevs et al. [14]. They build upon the traversal of a maximum-quadtree data structure on the GPU akin to the algorithm presented by Cohen and Shaked [5]. The idea of encoding the quadtree in the mipmap hierarchy of the height field results in very moderate memory requirements. Visualizing a large height-field data set requires the use of level-of-detail and multi-resolution techniques to balance between rendering speed and memory requirements. Hierarchical multi-resolution data representations are traditionally applied to the visualization of large volumetric data sets [15]. Dick et al. [3] are able to handle arbitrarily large terrain data sets employing a multi-resolution quadtree representation of the tiled terrain elevation map. The virtualization of multi-resolution data representations enables the implementation of rendering algorithms, which can be implemented such that they are mostly unaware of the underlying multi-resolution representation of the data set. Kraus and Ertl [16] describe how to use a texture atlas to store individual image sub-tiles in a single
272
C. Lux and B. Fr¨ ohlich
texture resource. They use an index texture for the translation of the spatial data sampling coordinates to the texture atlas cell containing the corresponding data. Generalized out-of-core texture data virtualization methods were introduced as a result [17]. Our multi-resolution height-field data virtualization is based on a quadtree representation of the height-field data sets. The selected quadtree cuts are stored in a 2D-texture atlas. Additionally, we are using a compact serialization of the quadtree cut similar to [18] for the translation of the virtual texture sampling coordinates to the texture atlas cell containing the corresponding data. Furthermore, to update the levels of detail represented through the multi-resolution hierarchies, we employ a feedback mechanism during rendering similar to the mechanism described by Hollemeersch et al. [19]. They require a separate rendering pass to write the required level-of-detail feedback information to a lower resolution target for fast evaluation. Using costly rendering approaches such as ray casting, this would involve a serious rendering overhead. Our system generates the feedback information directly during rendering while still allowing to sub-sample the actual screen resolution. Policarpo et al. [10] introduced a method for handling a fixed number of height fields without any acceleration structure or virtualization in a single rendering pass. They simply move along the ray using a fixed sampling step and intersect all the height fields at once using vector operations on the GPU, which works efficiently for at most four height fields encoded in a single texture resource. In contrast, we handle each height-field layer as an individual data resource. This allows us to represent different parts of a horizon surface at different local levels of detail (e. g. occluded parts on individual horizons are represented at much lower resolution). Furthermore, we employ a minimum-maximum quadtree over the tiled horizon height fields to speed up the ray traversal and use sorted intersection intervals for the individual horizons to restrict the actual intersection searches.
3
Out-of-Core Data Virtualization
This section describes our out-of-core data virtualization and rendering system. We begin with a brief overview of the basic system architecture and the relationships of the most important components followed by a more detailed description of our resource management and level-of-detail feedback mechanism. 3.1
System Architecture
The foundation of our system is an efficient out-of-core texture management system which is designed to enable the handling of geological models consisting of multiple large horizons. The height fields describing the horizon surfaces are managed as two-dimensional single-channel texture resources by our system. The main parts of the system include the page cache, the page atlas and the level-of-detail feedback mechanism (Figure 2).
GPU-Based Ray Casting of Stacked Out-of-Core Height Fields
273
GPU Page Atlas Page Trees
Render (OpenGL)
Feedback (OpenCL)
Framebuffer
Page Coverage
LOD Feedback
...
CPU
Compressed Request List
Storage ...
Feedback Evaluation
Network
HDD
Page Cache
Image Quadtrees
Fig. 2. This figures shows an overview of the out-of-core data virtualization and rendering system. Two large memory resources are maintained, one on the CPU and one on the GPU - the page cache and page atlas. The image data is maintained as quadtree representations using a compact serialization scheme on the GPU to provide efficient translations of virtual to physical texture coordinates in the page atlas.
Visualizing height-field data potentially exceeding available graphics and system memory resources requires the use of level-of-detail and multi-resolution techniques to balance between visualization quality and memory requirements. Hierarchical multi-resolution data representations are traditionally applied for the visualization of large volumetric data sets [15]. Similar to these approaches, we use a quadtree as the underlying data structure for the multi-resolution representation of the two-dimensional height-field data. All nodes in the quadtree are represented by tiles or pages of the same fixed size, which act as the basic paging unit throughout our memory management system. For each height field we compute and continuously update a cut through its multi-resolution hierarchy using feedback information gathered during rendering (Section 3.2). The leaf nodes of these cuts define the actual working set of heightfield pages on the GPU. The cut updates are incrementally performed using a greedy-style split-and-collapse algorithm, which considers a fixed texture memory budget [20]. During the update operation, only data currently resident in the main memory page cache is used and unavailable pages are requested to be asynchronously loaded and decoded from the out-of-core page pool. This approach prevents stalling of the update and rendering process due to slow transfers from external page sources. The height-field data is accessed during rendering through two resources on the GPU: a single shared large page atlas texture of a fixed size containing the pages of the current working sets of all height fields, and a set of small textures representing a serialization of the current quadtree cuts for each height field. The quadtree cuts are used for the translation of virtual texture coordinates to the physical sampling location in the shared page atlas.
274
C. Lux and B. Fr¨ ohlich
(a) Page-ID Feedback Buffer
(b) Rendering Result
Fig. 3. This figure shows the results of the level-of-detail feedback mechanism during rendering. (a) Visualization of the determined page identifiers. (b) Final rendering based on our height-field ray casting.
3.2
Level-of-Detail Feedback Generation
Two basic approaches exist for generating view-dependent information concerning the required levels of detail of a multi-resolution representation: analytical methods and direct feedback mechanisms. Analytical methods try to determine the required levels of detail through view-dependent or data-dependent heuristics. These methods usually exhibit problems to consider occlusions in the data sets without the application of sophisticated occlusion culling techniques [21]. In contrast, the direct feedback methods typically employ additional rendering passes to generate information about required level of details for the current view in off-screen buffers. After transferring these buffers to the system memory, their contents are analyzed and the derived information is used to inform the respective multi-resolution update methods. These additional rendering passes usually use only small off-screen buffers, which are a fraction of the size of the actual view port, for minimizing read back latency and processing time. Our system employs a direct screen-space feedback mechanism similar to Hollemeersch et al. [19]. The mechanism works in three stages. First, during rendering, identifiers for the actual required pages are determined for each pixel and saved into an off-screen buffer (Figure 3). Then, this buffer is used in an evaluation step directly on the GPU to generate a list of required pages and their respective pixel coverage. Finally, this compact list is transferred to the CPU where the coverage information of the pages is used to prioritize the nodes of the quadtrees. The page identifiers written to the feedback buffer encode the actual page position in the quadtree and in a height-field instance. The advantage of this approach is that a large part of the feedback buffer evaluation is performed in parallel directly on the GPU and the condensed list is generally orders of magnitudes smaller than the actual feedback buffer allowing for fast transfers to the system memory. The computation of the level-of-detail information during rendering virtualized height fields is a straight forward process. However, the high run-time
GPU-Based Ray Casting of Stacked Out-of-Core Height Fields
275
complexity of the height-field ray casting makes a separate rendering pass infeasible for feedback generation. For this reason, we generate the feedback information directly during rendering. Usually this requires that the feedback buffer is bound as an additional rendering target to the current frame buffer (a multiple render target (MRT) setup). This would force the feedback buffer to be the same size as the used view port. Using direct texture image access functionality provided by current graphics hardware1, we are able to sub-sample the view port by directly writing the feedback information to a lower resolution off-screen buffer, thereby avoiding the traditional MRT setup. 3.3
Height-Field Virtualization
Data virtualization refers to the abstraction of logical texture resources from the underlying data structures, effectively hiding the physical characteristics of the chosen memory layout. Our system stores the physical height-field tiles in the page atlas in graphics memory. In addition, we store a compact serialization of the quadtree cuts and their parent nodes up to the root node for virtual texture coordinate translation. The encoding of the quadtrees is similar to the octree encoding proposed by Lefebvre et al. [18]. Each node in this data structure holds information regarding where the corresponding page is located in the page atlas, including scaling information and child node links for inner nodes. This allows us to choose every level of detail currently available in the associated quadtree cut. Translating a virtual texture coordinate into a physical sample coordinate in the page atlas involves the following steps: first, the required level of detail is determined for the requested sampling location; then, the quadtree is traversed from the root node down to the appropriate level resulting in the indirection information required to transform the initial virtual sampling position to the physical sampling position in the page atlas. If the appropriate level of detail is not available in the quadtree cut, the traversal stops at the highest currently available level and thus returns a lower resolution version of the requested page. As an extension to this approach, tri-linear data filtering is implemented by storing the upper and lower bound of the required level of detail for a virtual sampling position during the quadtree traversal and filtering the two resulting image samples accordingly. This way, aliasing artifacts can be effectively prevented. Using the quadtree serialization method for virtual texture coordinate translation requires a tree traversal, but maintains a very small memory footprint per height-field data set. Although the traversal routine is benefiting from a good texture cache performance due to the small size and locality of the quadtree encoding, the cost of the tree traversal is not negligible. However, each ray typically samples the same height field tile multiple times and thus caching traversal information significantly reduces average traversal costs. 1
This functionality is provided through the OpenGL4 EXT shader image load store extension.
276
C. Lux and B. Fr¨ ohlich
I1 (H1)
H1
(H1)
I1
R
H2
I2 (H2)
I2
I3 (H2, H3)
I3
I4 (H3)
I4
R1 R2
(H2) (H2, H3)
H3
Horizon Min/Max Intervals
Horizon Interval List
R3
(H3)
Ray Interval Intersection
Fig. 4. This figure shows the construction and traversal of the sorted depth interval list for a single height-field tile. The left image shows the original minimum-maximum intervals for three horizons. The sorted interval list with associated horizons is derived from the intersections of the source intervals. The right image shows the ray-intersection search in the successive intervals, resulting in the found intersection of the ray interval R2 in the interval I3 .
4
Ray Casting Stacked Height Fields
Horizon surfaces are derived from a common seismic volume. Thus, the entire stack of horizon surfaces can be defined as equally sized height fields within the lateral (x-y) domain of the volume. The access to each height field is virtualized using our texture virtualization system, which allows us to access the height values of different height fields at different levels of detail using the same global texture coordinates. Our rendering approach is based on ray casting. We use a two-level acceleration structure for efficient intersections of a ray with a set of stacked height fields. A global minimum-maximum quadtree represents the primary structure for fast empty space skipping and a sorted list of height intervals is associated with each quadtree node to restrict intersection searches to relevant horizons and depth ranges. We restrict the size of the resulting quadtree data structure by building the minimum-maximum hierarchy based on tiled horizon height fields. In fact, only a cut through this global quadtree is required during rendering, which is a union of all the multi-resolution cuts of the individual height fields available on the GPU. The minimum-maximum ranges associated with each node are generated as the union of the individual ranges of the distinct height fields. However, we generate a sorted disjoint interval list for each quadtree node, which associates one or more horizons with each interval as shown in Figure 4. The ray traversal of the minimum-maximum quadtree is performed in topdown order in a similar way as described by Oh et al. and Tevs et al. [13,14]. The minimum-maximum intervals associated with each quadtree node are used to quickly discard nodes without possible intersections. When reaching a leaf node in the current cut through the global quadtree, the actual horizons
GPU-Based Ray Casting of Stacked Out-of-Core Height Fields
277
contained in the sorted horizon interval list are intersected. The evaluation order of the interval list is dependent on the actual ray direction. If the ray is ascending in the z-direction, the list is evaluated front-to-back starting with the interval with the smallest minimum value, while it is evaluated back-to-front for descending rays. If an intersection is found during the evaluation of an interval, it represents the closest height-field intersection and the evaluation process can be terminated early. In case of intervals associated with multiple horizons, we successively search all contained horizons for intersections in the appropriate order (Figure 4). Once a horizon intersection is found, we use the intersection point to restrict the subsequent searches in the remaining horizons associated with the interval. Due to the fact that the quadtree nodes represent horizon tiles containing e. g. 128 × 128 height values, we perform a linear search along the ray within a tile to find a potential intersection interval followed by a binary search for finding the actual intersection point. Even though there is no restriction on the tile size used for the construction of the global minimum-maximum quadtree, using a tile size less than or equal to the tile size utilized by the virtualization of the height fields allows for optimizations during the intersection search. Therefore, only a single query for the page atlas location of an associated height-field tile is required considering that all data lookups during the intersection search in a tile can be directly taken from this single atlas page. Thus, traversal costs in the serialized height-field cuts for localizing the atlas page are amortized over a complete search interval, thereby significantly reducing the overhead of the virtualization approach.
5
Results
We implemented the described rendering system using C++. OpenGL4 and GLSL were used for the rendering related aspects and OpenCL for the levelof-detail feedback evaluation, which is also generated directly on the GPU. All tests were performed on a 2.8 GHz Intel Core i7 workstation with 8 GiB RAM equipped with a single NVIDIA GeForce GTX 480 graphics board running Windows 7 x64. We tested our system with a data set containing a stack of three partly overlapping horizons provided to us by a member of the oil and gas industry for public display (Figure 5). The dimensions of each height field defining a horizon surface are 6870 × 14300 points using 16 bit resolution per sample. The chosen page size for the virtualization of the height fields was 1282 , the page atlas size was 128 MiB and the page cache size was restricted to 1 GiB. The rendering tests were performed using a view port resolution of 1680 × 1050. Our system is able to render scenes as shown in Figure 5 with interactive frame rates ranging from 20 Hz to 40 Hz depending on the viewer position and zoom level. We found that the rendering performance of our ray casting system is mainly dependent on the screen projection size of the height fields and the chosen sampling rate. The memory transfers between the shared resources of the data virtualization system have limited influence on the rendering performance.
278
C. Lux and B. Fr¨ ohlich
500
250
0
(a) Iteration Steps
(b) Level of Detail
(c) Final Result
Fig. 5. Horizon stack containing three partially intersecting surfaces. The resolution of the underlying height fields is 6870 × 14300 points. The left image displays the number of iteration steps required to find the final intersection, with brighter colors representing larger iteration counts. The middle image shows the quadtree cuts selected during the level-of-detail evaluation. The left image shows the final rendering result using a default color map encoding the sample depth in the generating seismic survey.
The level-of-detail feedback evaluation on the GPU introduces a small processing overhead below 0.5ms per rendering frame in our testing environment. We chose a feedback buffer size a quarter of the actual view port resolution (420 × 262). We evaluated various tile sizes for the construction of the global minimummaximum quadtree ranging from 322 to 1282 . The smaller tile sizes exhibited small improvements of 3-5% in rendering performance under shallow viewing angles compared to the largest size. However, under steep viewing angles, the introduced traversal overhead of the larger quadtree data structure resulted in a slight degradation of performance of 5-8%. The larger potential for empty space skipping using smaller tile sizes is canceled out by the complexity introduced by the depth interval list evaluation and the final intersection search in the height-field tiles. We found that using the same tile size for the global minimummaximum quadtree as for the data virtualization results in the best compromise between data structure sizes on the GPU and rendering performance. Figure 5a shows that higher iteration counts occur in areas where the ray closely misses one horizon tile and intersects another one using a 1282 tile resolution. Thus on the finest level, a ray needs to take an average of about 64 steps to find an intersection inside a tile assuming that such an intersection exists. We also experimented with binary searches for finding the horizon intervals in a quadtree node. However, the limited number of actual depth intervals in our current data sets made this approach slower than simply searching linearly for the first relevant interval. For a larger number of stacked horizons, the binary search for the first relevant interval should improve performance. Considering the caching of traversal information for the height-field virtualization mechanism during the ray-intersection search in a height-field tile, we found that an implicit caching mechanism implemented for the virtual data look ups results in better run-time performance than an explicit mechanism transforming the ray intersection search to the local coordinate system of the height-field tile in the page atlas. This allows for a much clearer implementation of the ray casting algorithm decoupled from the actual multi-resolution data representation.
GPU-Based Ray Casting of Stacked Out-of-Core Height Fields
6
279
Conclusions and Future Work
We presented a GPU-based ray casting system for the visualization of large stacked height-field data sets. Based on a shared out-of-core data management system, we virtualize the access to the height fields, allowing us to treat the individual surfaces at different local levels of detail. The multi-resolution data representations are updated using level-of-detail feedback information gathered directly during rendering. This provides a straightforward way to resolve occlusions between distinct surfaces without requiring additional occlusion culling techniques. The visualization of entire stacks of height-field surfaces is accomplished in a single rendering pass using a two-level acceleration structure for efficient ray intersection searches. This structure combines a minimum-maximum quadtree for empty-space skipping and a sorted list of depth intervals to restrict ray intersection searches to relevant height fields and depth ranges. The implementation shows that stacks of large height fields can be handled at interactive frame rates without loss of visual fidelity and moderate memory requirements. The feedback information used to guide the update of the multi-resolution representations is currently based on a purely texture space level-of-detail metric. A combination of this approach with screen-space or data-based error metrics for the tile-based level-of-detail estimation can further improve rendering quality. Our ultimate goal is a visualization system for subsurface data capable of interactively visualizing entire geological models. A highly desirable feature of such a system is the combined rendering of surface geometries and volume data. Typical geological models in the oil and gas domain can consist of a large number of highly detailed horizon surfaces and extremely large volume data sets. Currently no infrastructure exists for the efficient out-of-core management and rendering of geometry and volume data. Our ray casting-based approach to large horizon rendering is an important step in this direction, since it facilitates the efficient integration with multi-resolution volume ray casting. Acknowledgments. This work was supported in part by the VRGeo Consortium and the German BMBF InnoProfile project ”Intelligentes Lernen”. The seismic data set portrayed in this work is courtesy of Landmark/Halliburton.
References 1. Pajarola, R., Gobbetti, E.: Survey on Semi-regular Multiresolution Models for Interactive Terrain Rendering. The Visual Computer 23, 583–605 (2007) 2. Dick, C., Schneider, J., Westermann, R.: Efficient Geometry Compression for GPUbased Decoding in Realtime Terrain Rendering. Computer Graphics Forum 28, 67–83 (2009) 3. Dick, C., Kr¨ uger, J., Westermann, R.: GPU Ray-Casting for Scalable Terrain Rendering. In: Proceedings of Eurographics 2009 - Areas Papers, Eurographics, pp. 43–50 (2009) 4. Musgrave, F.K.: Grid Tracing: Fast Ray Tracing for Height Fields. Technical Report RR-639, Yale University, Department of Computer Science (1988)
280
C. Lux and B. Fr¨ ohlich
5. Cohen, D., Shaked, A.: Photo-Realistic Imaging of Digital Terrains. Computer Graphics Forum 12, 363–373 (1993) 6. Cohen-Or, D., Rich, E., Lerner, U., Shenkar, V.: A Real-Time Photo-Realistic Visual Flythrough. IEEE Transactions on Visualization and Computer Graphics 2, 255–265 (1996) 7. Qu, H., Qiu, F., Zhang, N., Kaufman, A., Wan, M.: Ray Tracing Height Fields. In: Procedings of Computer Graphics International, pp. 202–207 (2003) 8. Oliveira, M.M., Bishop, G., McAllister, D.: Relief Texture Mapping. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 2000), pp. 359–368. ACM, New York (2000) 9. Policarpo, F., Oliveira, M.M., Comba, J.L.D.: Real-time Relief Mapping on Arbitrary Polygonal Surfaces. In: Proceedings of the 2005 Symposium on Interactive 3D Graphics and Games, I3D 2005, pp. 155–162. ACM, New York (2005) 10. Policarpo, F., Oliveira, M.M.: Relief Mapping of Non-Height-Field Surface Details. In: Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games, pp. 55–62. ACM, New York (2006) 11. Donnelly, W.: Per-Pixel Displacement Mapping with Distance Functions. In: Pharr, M. (ed.) GPU Gems 2, pp. 123–136. Addison-Wesley, Reading (2005) 12. Dummer, J.: Cone Step Mapping: An Iterative Ray-Heightfield Intersection Algorithm (2006), http://www.lonesock.net/files/ConeStepMapping.pdf 13. Oh, K., Ki, H., Lee, C.H.: Pyramidal Displacement Mapping: a GPU based Artifacts-free Ray Tracing Through an Image Pyramid. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, VRST 2006, pp. 75–82. ACM, New York (2006) 14. Tevs, A., Ihrke, I., Seidel, H.P.: Maximum Mipmaps for Fast, Accurate, and Scalable Dynamic Height Field Rendering. In: Proceedings of the 2008 Symposium on Interactive 3D Graphics and Games, pp. 183–190. ACM, New York (2008) 15. LaMar, E., Hamann, B., Joy, K.I.: Multiresolution Techniques for Interactive Texture-Based Volume Visualization. In: Proceedings of IEEE Visualization 1999, pp. 355–361. IEEE, Los Alamitos (1999) 16. Kraus, M., Ertl, T.: Adaptive Texture Maps. In: Proceedings of SIGGRAPH/EG Graphics Hardware Workshop 2002, Eurographics, pp. 7–15 (2002) 17. Lefebvre, S., Darbon, J., Neyret, F.: Unified Texture Management for Arbitrary Meshes. Technical Report RR5210-, INRIA (2004) 18. Lefebvre, S., Hornus, S., Neyret, F.: Octree Textures on the GPU. In: GPU Gems 2, pp. 595–613. Addison-Wesley, Reading (2005) 19. Hollemeersch, C., Pieters, B., Lambert, P., Van de Walle, R.: Accelerating Virtual Texturing Using CUDA. In: Engel, W. (ed.) GPU Pro: Advanced Rendering Techniques, pp. 623–641. A.K. Peters, Ltd, Wellesley (2010) 20. Carmona, R., Froehlich, B.: Error-controlled Real-Time Cut Updates for MultiResolution Volume Rendering. Computers & Graphics (in press, 2011) 21. Plate, J., Grundh¨ ofer, A., Schmidt, B., Fr¨ ohlich, B.: Occlusion Culling for SubSurface Models in Geo-Scientific Applications. In: Joint Eurographics - IEEE TCVG Symposium on Visualization, pp. 267–272. IEEE, Los Alamitos (2004)
Multi-View Stereo Point Clouds Visualization Yi Gong and Yuan-Fang Wang Computer Science Department, University of California, Santa Barbara, CA 93106
Abstract. 3D reconstruction from image sequences using multi-view stereo (MVS) algorithms is an important research area in computer vision and has multitude of applications. Due to its image-feature-based analysis, 3D point clouds derived from such algorithms are irregularly distributed and can be sparse at plain surface areas. Noise and outliers also degrade the resulting 3D clouds. Recovering an accurate surface description from such cloud data thus requires sophisticated post processing which can be computationally expensive even for small datasets. For time critical applications, plausible visualization is preferable. We present a fast and robust method for multi-view point splatting to visualize MVS point clouds. Elliptical surfels of adaptive sizes are used for better approximating the object surface, and view-independent textures are assigned to each surfel according to MRF-based energy optimization. The experiments show that our method can create surfel models with textures from low-quality MVS data within seconds. Rendering results are plausible with a small time cost due to our view-independent texture mapping strategy.
1
Introduction
Research in 3D model reconstruction using multi-view stereo (MVS) algorithms has made significant strides recently in the computer vision community. As a result, 3D point cloud data are no longer derivable only from expensive and specialized devices like range scanners, but also from uncalibrated consumer-market digital cameras [1] or even community photo collections [2]. However, 3D point clouds recovered by these computer vision algorithms are not as ideal as those from specialized scanners. MVS algorithms use feature-based analysis under the Lambertian shading assumption, which have difficulty handling textureless and non-Lambertian surfaces. Combined with the inherent difficulties of solving an ill-posed, inverse 2D to 3D problem, 3D point clouds reconstructed from MVS algorithms are often sparse, irregularly distributed, noisy, and usually contain many outliers (see Figure 1). Such datasets present challenges to the recovery of object surface structure and appearance. To improve the quality of the recovered 3D point clouds, post-processing steps are often employed to adjust the positions and orientations of 3D points based on photo discrepancy constraints [1]. But such methods can be computationally very expensive. For applications requiring fast 3D model reconstruction, e.g., creating avatars for game players, a long waiting time is unacceptable. Efficient G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 281–290, 2011. c Springer-Verlag Berlin Heidelberg 2011
282
Y. Gong and Y.-F. Wang
Fig. 1. An example of MVS data. Left: input images; right: output point clouds.
and robust visualization methods for such MVS 3D datasets are important in completing the pipeline of image-based 3D reconstruction. Although computer graphics researchers have explored point-based modeling and rendering techniques for a long time, most existing work is not applicable to the raw 3D point data derived from MVS algorithms due to the data quality issues. Furthermore, deriving consistent texture maps for MVS data presents additional research issues of avoiding color mismatch and cracking using the color/texture information from multiple images. This paper is aimed at providing an efficient and generic solution for visualizing unpolished MVS point clouds. Our main contributions are: we propose a statistical method to select nearest neighbors in highly irregularly-distributed point clouds; we apply adaptive radius elliptically-shaped surfels to approximate object surfaces; we introduce Markov Random Field (MRF) based multi-view texture mapping for point splatting. Our results show the proposed methods can handle such challenging 3D datasets much better than existing ones.
2
Related Work
For relatively clean, regular and dense point clouds, many existing algorithms have been developed to extract the geometric surface precisely. They can be divided into two categories: explicit and implicit methods. Explicit methods make use of Voronoi diagram and Delaunay triangulation to connect points and form surface meshes [3, 4, 5]. As explicit methods require dense point clouds as inputs and are sensitive to noise and outliers, they are not suitable to process irregularly distributed 3D point data generated from the MVS algorithms. Implicit methods do not connect points explicitly. Instead, they define an inside/outside function f () and take f = 0 as the surface. This f () function is determined by minimizing an energy expression related to the distances from the input points P to the surface function f (). Hoppe et al. first used the distance field to extract surfaces from point clouds [6]. Later, radial basis functions [7, 8, 9], moving least squares [10], level set [11] and poisson functions [12] were also introduced to solve the implicit function. Implicit methods are relatively robust to noisy data, but require an additional triangulation step to transform implicit functions to polygon meshes. Moreover, they are sensitive to normal estimation errors, which is a common problem for MVS data due to their sparsity and irregularity. Finally, for complex scenes containing overlapping objects, implicit surface extraction
Multi-View Stereo Point Clouds Visualization
283
Fig. 2. Left: incorrect surface connection generated by implicit surface extraction; right: the result of our method
techniques will generate unexpected inter-object connections between unrelated points, as shown in Figure 2(left). To simplify the visualization of 3D point cloud data, Pfister et al. [13] and Zwicker et al. [14] proposed the EWA point splatting method which skips the surface extraction step and visualizes an object’s surface directly with oriented disks, called surfels – surface elements. Surfels overlap with each other in 3D space to cover gaps among the points. When surfels are projected onto the screen, pixels covered by multiple surfels will blend their surface attributes such as normals and textures in a weighted manner. This no-topology method avoids incorrect connections of points and thus is very suitable for MVS data visualization, whose points’ group division is especially difficult when close or overlapping objects present. Goesele et al. proposed an ambient point cloud concept that visualizes difficult backgrounds with ambient points to reduce artifacts in view interpolation [15]. But their method is more focused on the view interpolation effect and is different from our purpose. Before MVS algorithms had got mature enough to enable automatic imagebased reconstruction, quite a few researchers tried image-based rendering (IBR) methods to visualize 3D scene based on multi-view images while totally discarding assumptions of the scene geometries [16]. But these techniques usually require a huge amount of images and have blurred results due to their blending essence for new viewpoints. Debevec et al. [17] introduced view-dependent texture mapping (VDTM) with manually built 3D geometric model with much less input images to solve these problems. VDTM inherits the ability of replaying view-dependent effects from pure IBR methods. But due to its visibility based subdivision step, the model becomes so complex that requires huge memory to store, and consumes more time to render than view-independent methods. Yang et al. combined VDTM with point splatting [18]. But their experiments were mainly tested on synthetic/calibrated systems which are more clean and precise than our input, and is suitable to apply VDTM algorithm directly. For MVS reconstructed 3D points from uncalibrated camera like ours, blending multi-view textures cannot eliminate the appearance incoherence among neighboring points and neighboring views due to re-projection error and exposure difference, which are the most serious problems of MVS data.
284
Y. Gong and Y.-F. Wang
Lempitsky et al. [19] presented a view-independent texture mapping approach by minimizing an energy expression that incorporates both viewing direction matching and texture coherence. The selected image to texture a mesh triangle should have its viewing direction close to the triangle’s normal. Meanwhile, neighboring triangles are expected to sample textures from the same image. They use MRF optimization to constrain the relationship between these two factors. We also use MRF optimization to select texture based on the energy of surface orientation and coherence among neighbors. But point clouds don’t have explicit connections like polygon meshes to build the MRF directly. Our strategy is to create a graph conservatively, based on the distances between points and their sizes. The details will be explained in section 3.3.
3 3.1
Our Method Nearest Neighbors
Picking nearest neighbors is the first and key step for surface approximation from point clouds, because the surfels’ orientations, sizes and positions are mainly determined by the information from their neighbors. As we mentioned above, for MVS data, the 3D points are actually recovered from 2D features detected in input images. The distribution of the these image features is usually very sparse and uneven at featureless regions. Traditionally, each point is assigned a fixed radius for neighbor selection, but such a strategy is not ideal when applied to MVS data, because it may select either too few or too many neighbors. Fixing the number of nearest neighbors is not a good design either, as it may include some neighbors very far away that contribute little or even incorrect information to the current point’s local geometry. To collect an adequate and representative set of neighbors for normal estimation, we allow the number of nearest neighbors to vary in a range and make the decision of accepting or rejecting a neighbor by maximizing the relative difference between the accepted group of neighbors and the discarded ones. Suppose we have n nearest neighbors sorted by distance in ascending order. We will accept the first dk+1 − μ ˆ √ k (1) k ∗ = arg max k σ ˆk / k neighbors and reject the range from k ∗ + 1 to n. In the equation, dk+1 is the distance of the (k + 1)th neighbor, and μ ˆk and σ ˆk are the sample mean and sample standard deviation of the distances of the first k nearest neighbors. The formula above to maximize is essentially reduced from Welch’s t-test statistic [20] ˆB μ ˆA − μ tA,B = 2 2 /n σ ˆA /nA + σ ˆB B
(2)
which serves as a discriminant measure between the two sample groups A = {d1 , d2 , · · · , dk } and B = {dk+1 } for our case, where the sample sizes nA = k and nB = 1, and the second sample has only one element, thus a degenerate
Multi-View Stereo Point Clouds Visualization
285
100 80
Score
−9
Z
−9.5
60 40
2.5
−10 3
−10.5 0 −2 X
−4 3.5
Y
20 0 0
5 10 15 Nearest Neighbor Order
20
Fig. 3. An example of our neighbor selection. Left: a 3D point (in red) and its accepted (in blue) and rejected (in pink) neighbors; Right: scores of the tested neighbors. 2 mean μ ˆB = dk+1 and a zero variance σ ˆB = 0. The effect of this algorithm is shown in Figure 3. The current point is marked in red. The remaining dots are its nearest neighbors sorted by distance from it. From the score chart, obviously the 18th neighbor get the highest score, i.e. k ∗ equals 18. The rest two points are rejected, which are marked in pink.
3.2
Adaptive-Size Elliptical Surfels
After obtaining an estimate of a local neighborhood above, we approximate the object surface by oriented surfels. The normal of each surfel is available through Principal Component Analysis (PCA) [6], which is relatively robust for MVS data. But the size and shape need more sophisticated approach to better match local neighborhood geometry. Due to the irregular point distribution, each surfel’s size need to be adaptive to the local cloud density to cover the gap among the points. Most existing point-based rendering algorithms use circular disks as the rendering primitives, which does not work well with MVS data. For example, for points on the ridge of a sharp geometry (i.e., an edge), large surfels will jut out from approximated surface and look abrupt and wired. Thus, the surfel sizes should also relate to the local curvatures in different directions . For flat surface patches that have a small absolute value of surface curvature, a surfel can extend its area in any direction until it covers the gap in that direction or reaches the maximum radius we set. For highly curved patch, surfel radius should be small. For patches flat in some directions and highly curved in other directions, we should assign anisotropic radii to the surfels. From differential geometry, we know that the largest difference between two surface curvatures of a point occurs at two perpendicular directions on the tangent plane, i.e. the principal directions [21]. Therefore, an elliptical surfel with two different axes which can be adjusted independently perfectly meets our needs. To estimate the local curvature, we fit a least square quadratic surface over the given point and its nearest neighbors. Then the principal curvatures can be derived from Gaussian curvature and mean curvature [21] calculated based on the surface equation. As for the principal directions, we adopt Che et al.’s
286
Y. Gong and Y.-F. Wang
k method [22]. The radii along the two axes are assigned to min κ1 , k1 i=1 di , where κ is the curvature of the current point, whose reciprocal equals to the radius of the tangent circle in the given direction at that point. The second term is an estimation of the sparsity of the current point’s neighborhood, which is the average distance of the first k neighbors to current point. In our implementation, we have used k = 5 based on experiments. We also set a minimum and maximum limitation for the surfel radius to avoid extreme cases. With our algorithm, we can fill gaps among points better and avoid improperly extended surfels. 3.3
Multi-View Texture Mapping
Texture coherence among neighboring points is very important to MVS data visualization because of the non-ignorable re-projection error of reconstructed 3D points and exposure difference among different view images. Seam and mismatch are inevitable if neighboring surface points take different images as textures. As a result, simply choosing the closest orientation image as the texture source for each surfel will not lead to a satisfactory result (Figure 4). Therefore, we need to strike a balance between similarity in viewing direction and coherence of neighborhood, i.e., the texture source, among neighboring surfels. Since this coherence is only related to neighboring surfels, which conforms to Markovian property, we can define this problem as an energy minimization problem over an MRF. In contrast to Lempitsky and Ivanov’s method [19], which applies MRF to optimize texture mapping of polygon meshes, we do not have an explicit neighbor graph. Our points’s neighbor relationship is vague and implicit because our primitives have no explicit connections and the surfel’s radius is adaptive. We judge two points p1 and p2 to be neighbors if their distance is smaller than the sum of their long axes. This is a conservative estimation. But as we want the color to transfer smoothly, making close points share similar texture sources is not a drawback. According to this rule, we add edges between surfel pairs to create the MRF. Its energy function has two parts: data energy which only relates to the given point’s visibility in a certain image (already known in typical MVS result data) from MVS data and how close the camera’s viewing angle is to the point’s normal; and smoothness energy which penalizes if current surfel does not use the same texture as its neighbors. E(p) = data(p, camerai ) + λ smoothness(p, neighbor of(p))
(3)
in which λ is a regularization parameter to balance between these two terms. We use graph-cuts [23, 24, 25] to solve the MRF optimization and it works well. 3.4
Point Splatting
Our rendering method is similar to Botsch et al.’s [26]. We first disable the color buffer’s writing operation and enable the depth buffer’s to render surfels into the depth buffer without shading. Then the depth buffer is disabled and the color buffer is enabled. We shift all surfels a little closer to the camera to
Multi-View Stereo Point Clouds Visualization
287
Fig. 4. Multi-view texture mapping without (left) and with (right) MRF optimization
avoid z-fighting and render them with the assigned textures onto the screen. The colors and weights of covered pixels are accumulated by alpha blending in the frame buffer object, where weights are recorded in the alpha channel. Finally, each pixel’s color is normalized by its alpha value. This splatting process also helps us smooth the boundaries between surfels, as overlapping surfels will all contribute to the intersection area and blur the boundary naturally.
4
Experiments
We test our algorithms on a computer configured with a Core 2 Quad 2.33GHz CPU, 4GB DDR3 memory, and an Nvidia GeForce 210 video card with 512M video memory. Our input is the 3D point clouds generated by the SfM algorithm [27]. Note that the algorithm uses photos from uncalibrated, consumermarket digital cameras with arbitrary photographing modes. Illumination and exposure are not controlled in these sequences and adjacent images can exhibit large change in color and brightness. The camera’s intrinsic and extrinsic parameters are recovered by a non-linear optimization using corresponding feature points in different images. Hence, the point clouds have errors and outliers. The data contains sparse but precisely recovered 3D points from SIFT features and dense but flawed points recovered by interpolation. When calculating normals and curvatures, we give high weight to the SIFT features and low weight for interpolated points to reduce error. Though we only test performance on the data sets from this MVS algorithm, our method is applicable to point clouds from other MVS algorithms too. 3D points generated from MVS algorithms have similar attributes and formats: 3D positions, colors, 2D coordinates in visible images. Camera parameters are also available from their own recovering process. So all the information we need are accessible in typical MVS algorithms. In our experiments, the operations of searching nearest neighbors [28], calculating normal, and estimating principal curvatures are very fast. Optimizing texture selection with MRF is the most time consuming stage in our algorithms. We run 2 iterations of alpha expansion with graph-cuts which cost 14 to 18 seconds for 41k to 50k point models in our tests (Table 1). From our results, we
288
Y. Gong and Y.-F. Wang Table 1. Our test datasets and corresponding time costs
Model Building Logcabin Nandhu Tobias Medusa
# PTS # Views # MRF Edges # Used Views Time(fitting) Time(texture) 60,206 41,303 43,247 49,044 63,373
(a)
16 29 22 21 19
1,039,649 683,124 712,329 809,571 1,026,093
(b)
15 18 14 14 15
(c)
3.132s 2.142s 2.301s 2.366s 2.970s
15.253s 13.844s 17.697s 20.614s 33.159s
(d)
Fig. 5. Medusa model. (a) recovered normals; (b)(c)(d) results from different views.
can see that even without very precisely estimated normals, MRF based texture mapping strategy can still produce plausible visualization results, while selecting texture using only the best matched viewing direction leads to color jump and discontinuity (Figure 4). We also compare our result with Autodesk’s photofly [29], the state-of-the-art application of 3D reconstruction and visualization. Photofly has its own 3D reconstruction algorithm. So its input point clouds may have slight difference from ours. But it is still a good reference for the visualization part. From Figures 6 and 7, we can see the large difference between its rendering strategy and ours. Photofly blends multi-view textures to reduce the artifacts brought by low quality MVS data, which blurs the picture heavily, while we render surfels boldly without vagueness after optimizing their orientation, shape and textures.
(a)
(b)
(c)
(d)
Fig. 6. Tobias model. (a) recovered normals; (b)(c) results from different views; (d) Photofly’s [29] rendering result - in the hair area, the artifacts caused by multi-view images blending is very obvious.
Multi-View Stereo Point Clouds Visualization
(a)
289
(b)
Fig. 7. Logcabin model. Rendering results of (a) Ours and (b) Photofly’s [29].
5
Conclusions
We have presented a novel algorithm to visualize challenging point clouds generated from the MVS algorithms. Our algorithm uses a statistical metric to select representative neighbors for each 3D point and then approximates the point’s neighborhood with an elliptical surfel of an adaptive size. Each surfel’s orientation, shape and size are calculated according to its neighborhood information. To remedy the imprecision of an approximated surface, we apply MRF optimized texture mapping strategy to select most proper images as texture source for each surfel while minimizing appearance incoherence among neighboring surfels. Our results show the proposed algorithm handles the low quality 3D data well. Our future work includes improving the precision of the normal estimation by taking image information into consideration and exploring possible shape detection on MVS data to make the surface reconstruction more robust.
References 1. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view stereopsis. In: Proc. CVPR 2007, pp. 1–8 (2007) 2. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.: Multi-view stereo for community photo collections. In: Proc. ICCV 2007, pp. 1–8 (2007) 3. Edelsbrunner, H., M¨ ucke, E.P.: Three-dimensional alpha shapes. In: Proc. VVS 1992, pp. 75–82 (1992) 4. Amenta, N., Bern, M., Kamvysselis, M.: A new voronoi-based surface reconstruction algorithm. In: Proc. SIGGRAPH 1998, pp. 415–421 (1998) 5. Amenta, N., Choi, S., Kolluri, R.K.: The power crust. In: Proc. SMA 2001, pp. 249–266 (2001) 6. Hoppe, H., DeRose, T., Duchamp, T., Halstead, M., Jin, H., McDonald, J., Schweitzer, J., Stuetzle, W.: Piecewise smooth surface reconstruction. In: Proc. SIGGRAPH 1994, pp. 295–302 (1994) 7. Turk, G., O’Brien, J.F.: Shape transformation using variational implicit functions. In: Proc. SIGGRAPH 1999, pp. 335–342 (1999) 8. Carr, J.C., Beatson, R.K., Cherrie, J.B., Mitchell, T.J., Fright, W.R., McCallum, B.C., Evans, T.R.: Reconstruction and representation of 3d objects with radial basis functions. In: Proc. SIGGRAPH 2001, pp. 67–76 (2001)
290
Y. Gong and Y.-F. Wang
9. Ohtake, Y., Belyaev, A., Seidel, H.P.: Ridge-valley lines on meshes via implicit surface fitting. In: Proc. SIGGRAPH 2004, pp. 609–612 (2004) 10. Alexa, M., Behr, J., Cohen-Or, D., Fleishman, S., Levin, D., Silva, C.T.: Computing and rendering point set surfaces. IEEE Trans. Visual. Comput. Graph. 9, 3–15 (2003) 11. Zhao, H.K., Osher, S., Fedkiw, R.: Fast surface reconstruction using the level set method. In: Proc. VLSM 2001, pp. 194–201 (2001) 12. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proc. SGP 2006, pp. 61–70 (2006) 13. Pfister, H., Zwicker, M., van Baar, J., Gross, M.: Surfels: surface elements as rendering primitives. In: Proc. SIGGRAPH 2000, pp. 335–342 (2000) 14. Zwicker, M., Pfister, H., van Baar, J., Gross, M.: Surface splatting. In: Proc. SIGGRAPH 2001, pp. 371–378 (2001) 15. Goesele, M., Ackermann, J., Fuhrmann, S., Haubold, C., Klowsky, R., Steedly, D., Szeliski, R.: Ambient point clouds for view interpolation. In: Proc. SIGGRAPH 2010, pp. 95:1–95:6 (2010) 16. Shum, H.Y., Chan, S.C., Kang, S.B.: Image-Based Rendering. Springer, Heidelberg (2006) 17. Debevec, P.E., Taylor, C.J., Malik, J.: Modeling and rendering architecture from photographs: a hybrid geometry- and image-based approach. In: Proc. SIGGRAPH 1996, pp. 11–20 (1996) 18. Yang, R., Guinnip, D., Wang, L.: View-dependent textured splatting. The Visual Computer 22, 456–467 (2006) 19. Lempitsky, V., Ivanov, D.: Seamless mosaicing of image-based texture maps. In: Proc. CVPR 2007, pp. 1–6 (2007) 20. Welch, B.L.: The generalization of “student’s” problem when several different population variances are involved. Biometrika 34, 28–35 (1947) 21. Struik, D.J.: Lectures on Classical Differential Geometry. Addison-Wesley, Reading (1950) 22. Che, W., Paul, J.C., Zhang, X.: Lines of curvature and umbilical points for implicit surfaces. Computer Aided Geometric Design 24, 395–409 (2007) 23. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23, 1222–1239 (2001) 24. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1124–1137 (2004) 25. Kolmogorov, V., Zabin, R.: What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Anal. Mach. Intell. 26, 147–159 (2004) 26. Botsch, M., Hornung, A., Zwicker, M., Kobbelt, L.: High-quality surface splatting on today’s gpus. In: Proc. PBG 2005, pp. 17–141 (2005) 27. Chen, C.I., Sargent, D., Tsai, C.M., Wang, Y.F., Koppel, D.: Stabilizing stereo correspondence computation using delaunay triangulation and planar homography. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Porikli, F., Peters, J., Klosowski, J., Arns, L., Chun, Y.K., Rhyne, T.-M., Monroe, L. (eds.) ISVC 2008, Part I. LNCS, vol. 5358, pp. 836–845. Springer, Heidelberg (2008) 28. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching. J. ACM 45, 891–923 (1998) 29. Photofly, A., http://labs.autodesk.com/technologies/photofly/
Depth Map Enhancement Using Adaptive Steering Kernel Regression Based on Distance Transform Sung-Yeol Kim, Woon Cho, Andreas Koschan, and Mongi A. Abidi Imaging, Robotics, and Intelligent System Lab, The University of Tennessee, Knoxville, TN 37996, USA
Abstract. In this paper, we present a method to enhance noisy depth maps using adaptive steering kernel regression based on distance transform. Dataadaptive kernel regression filters are widely used for image denoising by considering spatial and photometric properties of pixel data. In order to reduce noise in depth maps more efficiently, we adaptively refine the steering kernel regression function according to local region structures, flat and textured areas. In this work, we first generate two distance transform maps from the depth map and its corresponding color image. Then, the steering kernel is modified by a newlydesigned weighing function directly related to joint distance transform. The weighting function expands the steering kernel in flat areas and shrinks it in textured areas toward local edges in the depth map. Finally, we filter the noise in the depth map with the refined steering kernel regression function. Experimental results show that our method outperforms the competing methods in objective and subjective comparisons for depth map enhancement.
1 Introduction We can now capture more accurate per-pixel depth data of a real scene using active range sensors [1]. The range information is usually represented by 8-bit or 16-bit data images called depth maps. In many applications, the depth map obtained from active range sensors is used with its corresponding color image as a pair [2]. In general, the resolution of the depth map is lower than the one of its color image due to many challenges in real-time distance measurement. In addition, the quality of the depth map is degraded by noise during depth data acquisition. Consequently, in industry, the practical use of the active range sensor is limited in applications involving foreground extraction and motion tracking [3]. Data-adaptive kernel regression filters have been widely used for image denoising. One of them is a famous bilateral filter [4]. The bilateral filer denoises an image using a kernel regression function derived from differences of pixel positions and intensities between an interest pixel and its neighboring pixels. A joint bilateral filter [5, 6] is also used to enhance an image from a guidance image. The guidance image has a different image modality than the target image. The kernel of the joint bilateral filter is derived from differences of intensities in the guidance image. Recently, an interesting data-adaptive filter has been introduced called “steering kernel filter” [7]. The characteristic of the steering kernel filter is that its kernel is designed more adaptively G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 291–300, 2011. © Springer-Verlag Berlin Heidelberg 2011
292
S.-Y. Kim et al.
by considering local region structures. The kernel of this filter is constructed by gradient difference between an interest pixel and its neighboring pixels. In this paper, we improve the quality of a noisy depth map using a steering kernel filter. In order to enhance the noisy depth map efficiently, we refine the steering kernel regression function adaptively with the assumptions; the more spread out the kernel is in a flat area, the more efficient the image denoising will be in that flat area, and the more shrunken the kernel is in a textured area, the more efficient the image denoising will be in that textured area. For steering kernel refinement, the original steering kernel is modified by a newly-designed weighting function directly related to joint distance transform [8] of the depth map and its color image. The weighting function is considered by the local analysis window size, the distance transform strength, and kernel refinement strength. Physically, the weighting function makes kernel in the flat area more spread out while in the textured area more shrunken toward local edges. Consequently, the refined kernel is more efficient to minimize noise. The main contributions of this paper are: (a) developing an efficient adaptive steering kernel regression function based on joint distance transform for depth map denoising, and (b) presenting a relationship between local region structures and joint distance transform values in different image modalities. In our experiments, the proposed method obtained better results with both synthetic and real test data than competing methods, such as bilateral filter [4], joint bilateral filter [5], and steering kernel filter [7]. This paper is organized as follows. Section 2 introduces the related work of adaptive steering kernel regression. Then, Section 3 explains the proposed method in detail. After providing experimental results in Section 4, we conclude in Section 5.
2 Steering Kernel Regression In this section, we introduce steering kernel regression function based on Takeda’s work [7]. The 2-D image modal can be represented by
yi = z ( xi ) + ε i
i = 1,...,W × W , xi = [ x1i , x2i ]T
(1)
where x1i and x2i of xi are sampling positions of vertical and horizontal coordinates in an image, respectively, yi means a noisy sample and z(xi) is the regression function at a sampling position xi , εi is a noise factor, and W×W is the size of local analysis window around a pixel position x of interest. When we represent the regression function z(xi) using Taylor series up to the 2nd order, the optimization problem to find the best value at a pixel position x of interest is given by W ×W
W ×W
i =1
i =1
min ∑ [ε i ]2 = min ∑ [ yi − z ( xi )]2 W ×W
≈ min ∑ [ yi − β 0 − β ( xi − x) − β vech{( xi − x)( xi − x) }] K H steer ( xi − x) i =1
T 1
T 2
T
(2)
2
i
where vech(·) is defined as the lower triangular portion of a symmetric matric and β0, β1, and β2 are
Depth Map Enhancement Using Adaptive Steering Kernel Regression
293
T
⎡ ∂z ( x) ∂z ( x) ⎤ 1 ⎡ ∂ 2 z ( x ) ∂ 2 z ( x) ∂ 2 z ( x ) ⎤ ⎥ , β2 = ⎢ ⎥ 2 ⎣ ∂x12 ∂x1 x2 ∂x22 ⎦ ⎣ ∂x1 ∂x2 ⎦
β 0 = z ( x), β1 = ∇z ( x) = ⎢
(3)
In Takeda’s work, the steering matrix Histeer is defined by 1
H isteer = hμ i Ci2
(4)
where h is the global smoothing parameter, μi is the local density parameter, and Ci is a 2×2 covariance matrix based on gradient in the local value. Here, it is worth noting that Histeer (xi-x) is dependent on photometric property, because the covariance matrix Ci is determined by the intensity in an image. Finally, when we project the steering matrix onto Gaussian kernel function, the steering kernel regression function is given by
K H steer ( xi − x) = i
det(Ci ) 2πh 2 μ i2
e
⎛ ( xi − x ) T C i ( x i − x ) ⎞ ⎜− ⎟ ⎜ ⎟ 2 h 2 μ i2 ⎝ ⎠
The 2×2 covariance matrix Ci is given by
Ci = J iT J i ,
[
J i = z x1 ( x j )
.
(5)
]
z x2 ( x j ) , x j ∈ωi
(6)
where Ji is a gradient matrix, zx1 and zx2 in Ji are the first partial derivatives along vertical and horizontal directions, respectively, and ωi is a local analysis window around the position of a given sample. In Feng’s work [9], the dominant orientation and shape of the local gradient field can be derived from the truncated singular value decomposition of Ji as in Eq. 7.
⎡s J i = U i S iVi T = U i ⎢ 1 ⎣0
0 ⎤ ⎡v11 s2 ⎥⎦ ⎢⎣v12
v21 ⎤ , v22 ⎥⎦
(7)
where Ui SiVTi indicates the truncated singular value decomposition of Ji , Si is a diagonal 2×2 matrix representing the energy in the dominant directions, and Vi is a 2×2 orthogonal matrix. In Eq. 7, the dominant orientation θi, the elongation parameter δi, and the scaling parameter γi of covariance matrix Ci can be calculated by α
⎛ v12 ⎞ s + λ′ ⎛ s s + λ ′′ ⎞ ⎟⎟, δ i = 1 , γi = ⎜ 1 2 ⎟ , ′ λ v s + ⎝ M ⎠ ⎝ 22 ⎠ 2
θ i = arctan⎜⎜
(8)
where λ′ and λ′′ are larger than zero for regularization, α is the structure sensitivity parameter which is smaller than 1, and M is the number of samples in the local analysis window (W×W). When we get the circular kernel using spatial property, the elongation parameter δi determines the amount of elongation of the kernel and transforms the circular kernel into an ellipse kernel with semi-major axis δi and semi-minor axis 1/δi. The dominant orientation θi and the scaling parameter γi determine the amount of rotation and expansion of the kernel, respectively. From Eq. 8, we can notice some important properties of steering kernels related to local region structures in flat areas and textured areas. In flat areas, s1 and s2 are usually close to 0 (s1 ≈ s2 ≈ 0). Consequently, the elongation parameter δi is close to the
294
S.-Y. Kim et al.
value of 1 to create a circular kernel, and the scaling parameter γi becomes a large number to expand the kernel size. Contrarily, s1 is much larger than s2 (s1>>s2) in textured areas. This makes the elongation parameter δi close to the value of 0 and the scaling parameter γi to be a small number. Consequently, the kernel is elongated along the local edge. 2.1 Inspection of Steering Kernel of “Noisy” and “Noiseless” Image Figure 1(a) and Figure 1(b) show the shape comparison of steering kernel regression functions of the original Lena image and a noisy image generated by adding Gaussian noise with standard derivation = 25. We can notice that the shape and orientation of the kernels are very close to those of the original case. In addition, steering kernels of the noisy image are relatively more spread out in flat areas than the original case, and also relatively more spread out along the local edge in textured areas.
Textured area
Flat area
(a) Noiseless Lena image
(c) Case A: Kernel use from noise image
(b) Noisy Lena image
(d) Case B: Kernel use from noise-less image
(e) Case C: Kernel use from both images
Fig. 1. Testing with a variety of kernels from noise and noiseless image for image denoising
For the proof of concept, we performed a simple test to check what would happen if we use kernels from a “noiseless” image to enhance its noisy image. Note that for the sake of simplicity, we here call the original image “noiseless” although every real image is to some degree affected by noise. Figure 1(c), Case A, shows the result of kernel use from the noise image to denoise the noisy Lena image in Fig. 1(b). Figure 1(d), Case B, shows the result of kernel use from the noiseless image to denoise the
Depth Map Enhancement Using Adaptive Steering Kernel Regression
295
noisy image. Figure 1(e), Case C, shows the result when we use kernels from the noise image in flat areas and kernels from the noiseless image in textured areas. Case C shows the best results in the comparison of root-mean-square error (RMSE); Case C (RMSE: 6.50), Case B (RMSE: 6.59), and Case A (RMSE: 6.64). In addition, when we compare Case A with Case B more carefully, Case B gets better result in textured areas marked with a rectangle than Case A, but worse result in flat areas marked with a circle than Case A. From the test, we can infer important properties of steering kernel for more effective image denoising. The properties are: (a) the more spread out the kernel is in a flat area, the more efficient the image denoising will be in that flat area, and (b) the more shrunken the kernel is in a textured area, the more efficient the image denoising will be in that textured area.
3 Adaptive Steering Kernel Based on Joint Distance Transform In this paper, we propose a method to refine steering kernel based on distance transform to enhance a noisy depth map. The primary idea of this work is that we modify steering kernel more efficiently with color image information according to local region structures. The proposed method uses joint distance transform values of the depth map and its color image to weight the covariance matrix Ci in Eq. 6 adaptively. 3.1 Joint Distance Transform A distance transform represents the distance between edges extracted from an image and its pixel positions. We carry out distance transforms for both the depth map and its color image. In order to perform distance transforms, we first apply a bilateral filter onto both the noisy depth map and its color image to remove isolated edges. Then, we obtain an edge map from the color image, and another edge map from the depth map using an edge detection algorithm. Thereafter, every pixel on an edge of each image is set to 0 and the other pixels are set to infinity (= 255) to initialize the distance transform. The distance transform is performed iteratively to get a distance transform map. Formally, for a pixel value pk(i, j) in position (i, j) on a distance transform map at iteration k using a-b distance transform, the pixel value pk(i, j) is represented by pk (i, j) = min [ pk −1(i −1, j −1) + b, pk −1(i −1, j) + a, pk −1(i −1, j +1) + b, pk −1(i, j −1) + a, pk −1(i, j), pk −1(i, j +1) + a, k −1
k −1
(9)
k −1
p (i +1, j −1) + b, p (i +1, j) + a, p (i +1, j +1) + b]
where a and b define the strength of distance transform. We usually b set to a+1. When the distance transform maps D1 and D2 are obtained from the depth map and its color image, respectively, the joint transform map DJ is calculated by ⎧D (i, j) DJ (i, j) = ⎨ 2 ⎩ 255
if D1(i, j) − D2 (i, j) ≤ 1 otherwise
(10)
Figure 2 shows the generation of joint distance transform from a depth map and its color image using 5-6 (a = 5, b = 6) distance transform. As shown in Fig. 2, we notice
296
S.-Y. Kim et al.
that the smaller intensity values the pixels of the distance transform map have, the closer their pixel positions are from the edge of each image. In addition, it is worth noting that the value of joint distance transform map comes from some parts of the distance transform map D2 of the color image. The other parts will be infinity (=255).
Noisy depth map
Edge map
Distance transform map
Joint distance transform
Color image
Edge map
Distance transform map
Fig. 2. Generation of the joint distance transform map
3.2 Adaptively-Weighted Steering Kernel From the test in section 2.1, we can infer that the more spread the kernel is in a flat area, the more efficient image denoising is in the flat area, and the more shrunken the kernel is in a textured area, the more efficient image denoising is in the textured area. From this inference, we change the shapes of the steering kernels by modifying the elements s1 and s2 of the diagonal matrix Si derived from Eq. 7. The weighted steering kernel is represented by s1(i, j) := s1(i, j) ⋅ (1+ w(i, j)) s2 (i, j) := s2 (i, j) ⋅ (1+ w(i, j))
(11)
where w(i, j) is a weighing function. In this paper, we design the weighting function w(i, j) related to joint distance transform DJ(i, j) using an exponential equation. The newly-designed weighting function is represented by if 0 ≤ DJ (i, j) ≤ a β ⎧ w(i, j) = ⎨ −λ⋅Dj (i, j) − β otherwise ⎩m⋅ e
(12)
where a is the strength of the distance transform, β is the strength of the weighing function with 0 < β <1, and m and λ are dependent valuables related to β and the local analysis window size W, which is given by m = 2 ⋅ β ⋅ e = 5.4366 β , λ =
where [·] is a rounding operator.
− log 2 1 − [W / 2]
(13)
Depth Map Enhancement Using Adaptive Steering Kernel Regression
297
The weighing function based on joint distance transform is depicted in Fig. 3. The weighing function has a positive value when the joint distance transform value is smaller than half the window size. In this case, the local analysis window is moving on definite textured areas on a depth map, since there are small common distance transform values between two distance transform maps derived from the depth map and its color image. Therefore, the elements s1 and s2 of the diagonal matrix Si become larger while the scaling parameter γi in Eq. 8 is getting smaller. Consequently, the steering kernel will be more shrunk in the textured areas than the original one. w(i, j)
β
0
Textured area
a
a·[W/2]
Flat area
DJ(i, j)
-β
Fig. 3. Weighing function based on joint distance transform
In contrast, the weighing function has a negative value when the joint distance transform value is larger than half the window size. In this case, the local analysis window is moving on definite flat areas on a depth map, there are large common distance transform values between two distance transform maps derived from the depth map and its color image. Therefore, the elements s1 and s2 of the diagonal matrix Si become smaller while the scaling parameter γi in Eq. 8 is getting larger. Consequently, the steering kernel will be more spread on the flat areas than the original one.
4 Experimental Results We tested our proposed method with both synthetic and real data. For these experiments, we set the global smoothing parameter h in Eq. 4 to 2.4, the structure sensitivity parameter α in Eq. 8 to 0.5, the strength of weighing function β in Eq. 12 to 0.5, the local analysis window size W to 11, and the kernel size to 21. We used 1-2 distance transform to get the joint distance transform map. In order to verify the performance of the proposed method, we have compared our method with bilateral filter [4], joint bilateral filtering [5], and steering kernel filter [7]. 4.1 Synthetic Data In order to assess the improvement of the depth accuracy obtained with the proposed method, we tested the method on synthetic data with known ground truth data from the Middlebury stereo dataset [10]. The synthetic data that we picked are the 6 datasets: Bowling, Midd1, Monopoly, Flowerpots, Rocks1, and Wood1. These tested
298
S.-Y. Kim et al.
images have an image resolution of 437×370 pixels. In order to artificially generate a noisy depth map from the datasets, we added Gaussian noise with standard deviation = 20 to the ground truth data. We used RMSE measurement based on established ground truth data for an objective evaluation of depth quality improvements. We summarized RMSE comparison between the proposed method and competing methods in Table 1. Table 1. RMSE comparison Add Noise (STD=20)
Bilateral
Joint Bilateral
Bowling Midd1 Monopoly Flowerpots Rocks1 Wood1
19.42 19.67 18.92 19.34 19.82 20.01
4.14 4.83 4.26 5.72 4.67 3.80
5.53 6.32 5.24 8.29 6.54 5.14
3.86 4.63 4.01 5.88 4.64 3.21
3.71 4.44 3.87 5.68 4.47 3.06
Average
19.53
4.57
6.17
4.37
4.20
Test data
Steering Kernel
Proposed
Midd1
Monopoly
Flowerpots
(a) Ground truth
(b) Noise image
(c) Joint bilateral
(d) Steering
(e) Proposed
Fig. 4. Results obtained with the proposed method for the tested synthetic data
As shown in Table 1, the proposed method has the best results in the test with synthetic data. The RMSE gain from our method was about 0.37 more than bilateral filtering and 0.17 more than steering kernel filtering on average. Joint bilateral filtering has the worst result among them, because homogenous depth information was often misrepresented as non-homogenous data by unexpected edges on the color image. Figure 4 shows some refined depth maps from the noisy ones generated by joint
Depth Map Enhancement Using Adaptive Steering Kernel Regression
299
bilateral filter, steering kernel filter and the proposed method. When we look at the region marked by a rectangle, our method made the flat area smoother and the textured area sharper than joint bilateral filter and steering kernel filter thanks to adaptive weighing function based on joint distance transform. 4.2 Real Data In the next experiments, we tested the performance of steering kernel filter [7] and our method with three real data sets; Bear doll, Actor1 and Actor2 obtained by a Kinectstyle range sensor [11]. These tested images have an image resolution of 640×480 pixels. Figure 5(a) and Figure 5(b) show tested color images and depth maps, respectively. Figure 5(c) and Figure 5(d) show refined depth maps by steering kernel filter and the proposed method, respectively. When we look at the region marked by a circle, we can notice that the proposed method preserves important depth discontinuity better than steering kernel filter. Bear doll
Actor1
Actor2
(a) Color image
(b) Depth map
(c) Steering kernel
(d) Proposed
Fig. 5. Results obtained with the proposed method for the tested real data
(a) Original scene
(b) Steering kernel
(c) Proposed
Fig. 6. Results of 3-D scene reconstruction
300
S.-Y. Kim et al.
Figure 6(a), Figure 6(b), and Figure 6(c) show the results of 3D scenes reconstructed with the original, the refined depth map by steering kernel filter and the refined depth map by our method with Actor2 test data, respectively. We used a 3D scene reconstruction method [12] to generate the 3D scenes. As we can see in the calf area of a man denoted by circles, the proposed method minimizes the noise in the depth map more efficiently than steering kernel filter while preserving the depth discontinuous region more sharply.
5 Conclusion We presented a new method using adaptive steering kernel regression based on joint distance transform to enhance noisy depth maps. In this work, we refined steering kernels adaptively based on joint distance transform. When we applied our method to the tested synthetic data, we could reduce the RMSE about 0.37 and 0.17 more than bilateral filtering and steering kernel filtering, respectively, on average. In the future, we will generalize the idea of the proposed method to be used in various applications relating to image enhancement. Acknowledgements. This work was supported by DOE-URPR (Grant-DOEDEFG0286NE37968) and US Air Force Grant (FA8650-10-1-5902) in USA.
References 1. Henry, P., Krainin, M., Herbst, E., Ren, X., Fox, D.: RGB-D Mapping: Using depth cameras for dense 3D modeling of indoor environments. In: Proc. of International Symposium on Experimental Robotics (2010) 2. Fehn, C., Barré, R., Pastoor, S.: Interactive 3-D TV- concepts and key technologies. Proceedings of the IEEE 94(3), 524–538 (2006) 3. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: IEEE Conf. on Computer Vision and Pattern Recognition (2011) 4. Elad, M.: On the origin of the bilateral filter and ways to improve it. IEEE Trans. on Image Processing 11(10), 1141–1150 (2002) 5. Petschnigg, G., Agrawala, M., Hoppe, H., Szeliski, R., Cohen, M., Toyama, K.: Digital photography with flash and no-flash image pairs. ACM Trans. on Computer Graphics 23(3), 664–672 (2004) 6. Riemens, A.K., Gangwal, O.P., Barenbrug, B., Berretty, R.M.: Multi-step joint bilateral depth upsampling. In: Proc. of Electronic Imaging, Visual Communications and Image Processing, pp. 1–12 (2009) 7. Takeda, H., Farsiu, S., Milanfar, P.: Deblurring using regularized locally-adaptive kernel regression. IEEE Trans. on Image Processing 17(4), 550–563 (2008) 8. Borgefors, G.: Hierarchical chamfer matching: a parametric edge matching algorithm. IEEE Trans. on Patten Analysis and Machine Intelligence 10(6), 849–865 (1988) 9. Feng, X., Milanfar, P.: Multiscale principal components analysis for image local orientation estimation. In: Proc. of Asilomar Conference on Signals, Systems and Computers, pp. 478–482 (2002) 10. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Jour. of Computer Vision 47(1-3), 7–42 (2002) 11. Lee, E.K., Ho, Y.S.: Generation of high-quality depth maps using hybrid camera system for 3D video. Jour. of Visual Communication and Image Representation 22, 73–84 (2011) 12. Kim, S.Y., Lee, S.B., Ho, Y.S.: Three-dimensional natural video system based on layered representation of depth maps. IEEE Trans. on Consumer Electronics 52(3), 1035–1042 (2006)
Indented Pixel Tree Browser for Exploring Huge Hierarchies Michael Burch, Hansj¨org Schmauder, and Daniel Weiskopf VISUS, University of Stuttgart
Abstract. In this paper we introduce the Indented Pixel Tree Browser—an interactive tool for exploring, annotating, and comparing huge hierarchical structures on different levels of granularity. We exploit the indented visual metaphor to map tree structures to one-dimensional zigzag curves to primarily achieve an overview representation for the entire hierarchy. We focus on space-efficiency and simultaneous uncovering of tree-specific phenomena. Each displayed plot can be filtered for substructures that are mapped to a larger space and hence, unhide more finegranular substructures that are hidden in the compressed overview. By representing tree structures side-by-side, the viewer can easily compare them visually and detect similar patterns and also anomalies. In our approach, we follow the information seeking mantra: overview first, zoom and filter, then details-on-demand. More interactive features such as expanding and collapsing of nodes, applying different color codings, or distorting the tree horizontally as well as vertically support a viewer when exploring huge hierarchical data sets. The usefulness of our interactive browsing tool is demonstrated in a case study for the NCBI taxonomy that contains 324,276 species and organisms that are hierarchically organized.
1 Introduction Data sets containing hierarchically organized elements exist in a variety of forms. Today’s software systems are huge and may consist of many thousands of files in a hierarchically structured file system. Such a hierarchy may contain millions of elements if we also take into account all the implemented lines on the source code level or the abstract syntax tree deduced thereof. Another interesting application where hierarchical data occurs is given by phylogenetic analyses that are applied to compute an evolutionary tree of life. The produced hierarchical data sets are oftentimes very large and an exploration of the textual data needs to be supported by interactive visualizations to uncover interesting insights in the data. The most convenient visual metaphor for depicting parent-child relationships is by using node-link diagrams as used by Reingold and Tilford [1]. The objects of a hierarchy are mapped to circular shapes and direct links between two circles express some kind of subordination, see Figure 1 (a). By exploiting the law of good continuation of Gestalt theorists [2] an effective and efficient interpretability of a tree structure via the link information can be achieved. Several node-link tree layouts have been developed that focus on making the tree structure clearer and on showing symmetries. The major drawback of node-link diagrams is the fact that there are many empty spaces between the graphical primitives which leads to scalability problems for large G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 301–312, 2011. c Springer-Verlag Berlin Heidelberg 2011
302
M. Burch, H. Schmauder, and D. Weiskopf
trees. Icicle plots as used by Kruskal and Landwehr [3] have been developed to have a more space-efficient representation that gives a clear impression about the tree structure but too much space is used to visually map inner nodes, see Figure 1 (b). Treemaps developed by Shneiderman [4] are the most space-efficient diagram type that do not have any information gaps on screen and do not waste visual space for mapping hierarchical elements to the display, see Figure 1 (c). The major drawback of treemap representations is an error-prone interpretability of the tree structure and hence, parent-child relationships for huge hierarchies. We build the tree browser proposed in this work on the indented tree plot visual metaphor as used by Burch et al. [5] which has the deciding advantage that huge trees can be displayed in a single static diagram and the tree structure remains visible even for deep and huge trees, see Figure 1 (d). Furthermore, the indented plots can be represented in a compressed view with the benefit that the tree structure still remains visible. By using one-dimensional zig zag curves for displaying tree diagrams as in the indented plots we can easily provide interactive features to manipulate and navigate in the original entire plot that serves as an overview representation. The user of the Indented Pixel
(a)
(b)
(c)
(d)
Fig. 1. An example hierarchy with eleven vertices shown in four possible visual tree metaphors: (a) Node-link diagram. (b) Layered icicle plot. (c) Treemap. (d) Indented Plot.
Indented Pixel Tree Browser for Exploring Huge Hierarchies
303
Tree Browser can easily explore subtrees by selecting subregions in an indented plot. The selected part is mapped to a larger display space below the plot where the selection was applied and hence, more fine-granular substructures of the hierarchy can be made visible. The selection process of subregions can be applied until the deepest level of a tree is reached. The detail view can be used to explore the labelling information and to make annotations based on it. Furthermore, a certain number of selected subregions of the hierarchical data set can be displayed in a side-by-side view and can easily be compared to eachother with the goal to detect similar patterns or anomalies. To illustrate the usefulness of our Indented Pixel Tree Browser we applied it to the NCBI taxonomy [6] containing 324,276 species and organisms. In Section 4 we demonstrate how to use the tool to explore this very huge hierarchical data set and provide some insights that we got by using our browsing tool. Interactive features are also supported by the tool, explained in Section 3.3, and illustrated in the case study in more detail for the real-world biological data set.
2 Related Work The visualization and exploration of hierarchical structures was in focus of research since ancient times [7,8] and is still today [9,10]. The most convenient type of diagram for depicting parent-child relationships is hereby obtained by using the node-link visual metaphor as used by Reingold and Tilford [1]. Showing connectedness for parent-child relationships as explicit links makes use of the law of good continuation of the Gestalt principles [2]. To even obtain more efficient diagrams many variants of the node-link visual metaphor exist apart from the classical approach that places the root node on top of the display space and the child nodes on horizontal layers depending on their depth in the tree. Radial diagrams [11,12,13] and bubble or balloon layouts [12,14] have been developed to use the available space more efficiently. As a drawback, a comparison of elements on the same depth in the hierarchy is difficult because of spatial distortions and differently oriented subtrees. Bubble trees suffer from the fact that representative circles on which subtrees are visually encoded become very small even for subtrees with a low depth. Another strategy for displaying node-link diagrams is by displaying the links in an orthogonal way, which means allowing only ninty degree angled bends leading to many parallel lines for huge hierarchies. Node-link tree visualization techniques and more general node-link graph visualizations are typically designed to follow some aesthetic drawing rules as presented by Purchase [15] such as avoiding link crossings or preserving symmetries if those exist. A more space-filling approach to depict hierarchies is by layered icicles as proposed by Kruskal and Landwehr [3] that use stacked rectangles where representatives for inner nodes use as much horizontal space as the sum of the spaces used by all their child elements. The hierarchical structure is clearly discernable but the drawback is the wasted space for the inner nodes and the fact that borders between inner nodes cannot be uncovered when trees have a high branching factor. Radial variants have been developed to achieve more aesthetically pleasing diagrams and to better exploit the given display space for deeper subtrees. Existing tools that focus on radial layered icicles are the Information Slices by Andrews and Heidegger [16], Sunburst by Stasko and Zhang [17], or InterRing by Yang et al. [18].
304
M. Burch, H. Schmauder, and D. Weiskopf
A compact and space-filling visualization technique is obtained by using nested rectangular shapes such as treemap representations introduced by Shneiderman [4]. Also there, many variants exist that focus on a better interpretability of the tree structure such as squarified treemaps by Bruls et al. [19], cushion treemaps by van Wijk and van de Wetering [20], or Voronoi treemaps by Balzer et al. [21]. The problem with treemap representations is the fact that they need the full screen to scale for huge trees and hence, a side-by-side representation of a variety of tree structures with different granularities is difficult. Furthermore, compared to Indented Pixel Tree Plots, the run time complexity of the layout and rendering algorithm is much higher in most cases, especially for Voronoi treemaps. In this work we are focusing on a space-efficient representation of hierarchies where the structures and substructures are still interpretable and a side-by-side representation is possible. Hence, we base our hierarchy browsing tool on Indented Pixel Tree Plots proposed and evaluated by Burch et al. [5]. These can be drilled down to very small regions on a display with the benefit that the structure of the hierarchy remains still visible and as little ’ink’ as possible is used to draw such a plot. Furthermore, little space is used in the vertical direction since the hierarchical data is mapped to zigzag curves. The remaining space can be used to display additional information such as detail views of a certain number of subregions of more coarse-granular structures. Our browsing tool is based on the visual information-seeking mantra: overview first, zoom and filter, then details-on-demand [22]. The Information Slices by Andrews and Heidegger [16] exploit the radial layered icicle metaphor and use a similar concept as the Indented Pixel Tree Browser for representing more fine-granular hierarchical substructures. Regions on semi-circular discs can be expanded to obtain a more detailed view of a subhierarchy. However, their approach does not allow a side-by-side comparison of many subhierarchies on several granularity layers. The hyperbolic tree browser [23] is based on radial node-link diagrams and supports several interactive features to explore hierarchical data. As a consequence of using node-link diagrams the technique only scales for a couple of thousand of nodes. The TreeJuxtAposer [24] is used to allow good structural comparisons of large trees but by using orthogonal node-link representations a good overview of huge and deep hierarchies in a single static view is difficult there. Annotations and highlighting are used in their approach to obtain a contextual information. The InterRing browser by Yang et al. [18] exploits the radial layered icicle visual metaphor but it is difficult to browse into very deep and huge trees with the additional goal to have a side-byside view on many subhierarchies at the same time. To the best of our knowledge there isn’t any hierarchy browsing tool that is able to show huge trees completely in a single static view and simultaneously allows for side-by-side comparisons of the substructures supported by interactive features.
3 Indented Pixel Tree Browser We introduce the Indented Pixel Tree Browser for exploring huge hierarchical structures. The visualization technique is based on the indented visual metaphor omnipresent in graphical file browsers and pretty printing of source code. The approach benefits
Indented Pixel Tree Browser for Exploring Huge Hierarchies
305
from space-efficiency that leaves enough space for comparisons of side-by-side and more fine-granular representations of subregions of the original entire data set. Apart from the entire overview plot many interactive features are supported for manipulating, exploring, and comparing tree structures and substructures on different levels of granularity.
(a)
(b)
(c)
Fig. 2. A hierarchy with 54 vertices represented as: (a) Node-link diagram. (b) Indented tree plot without color coding. (c) Indented tree plot with black to red color coding depending on the depth in the tree. Alternating horizontal gray and white bars indicate hierarchy levels.
3.1 Indented Pixel Tree Plot We model a hierarchy in the graph-theoretic sense as a tree T = (V, E) where V denotes the set of vertices and E V × V denotes the set of directed edges that express parent-child relationships directed from the root to the leaves of the tree. One vertex is designated as the root vertex. Edges are only shown implicitly in an indented tree plot in contrast to node-link diagrams where edges are explicitly drawn by direct links. The root vertex and all inner vertices are mapped to vertically aligned lines, whereas leaf vertices are mapped to single horizontal lines. This asymmetric handling of inner and leaf vertices leads to a better separation of both types of visual hierarchy elements. Parent-child relationships are expressed by indentation of the corresponding geometric shapes with respect to the hierarchy levels of the respective parent and child vertices. Figures 1 (a) and (d) and Figures 2 (a)-(c) illustrate how hierarchical data sets are visually mapped to node-link diagrams and also to indented tree plots without (Figure 2 (b)) and with (Figure 2 (c)) color coding depending on the depth in the tree. 3.2 Selection of Subregions To tap the full potential of the space-efficient indented tree plots we support interactive features such as selecting subregions and displaying them in a side-by-side representation on different layers starting with the overview layer on top of the display and ending at the detail layer closest to the bottom of the display. Figures 3 (a)-(d) show how regions and subregions can be selected in an Indented Pixel Tree Plot and how layering and side-by-side views are achieved. In Figure 3 (a) we can see how one single subregion is selected from an Indented Pixel Tree Plot and how it is displayed on the layer below in a larger space which allows to explore more fine-granular substructures of the tree. In Figure 3 (b) several subregions are selected on the same tree layer and displayed on the layer below in a side-by-side view. Figure 3 (c) shows how subregions and subsubregions can be selected and displayed on several
306
M. Burch, H. Schmauder, and D. Weiskopf
(a)
(b)
(c)
(d)
Fig. 3. A certain number of regions can be selected in an Indented Pixel Tree Plot and subregions can be selected again from it that are displayed in a side-by-side layered representation: (a) Selection of one region. (b) Selection of several regions on the first layer. (c) Selections on already selected regions are applied and displayed on several layers. (d) Side-by-side representation of selected regions on several layers.
layers until a certain granularity of the tree structure is reached. Figure 3 (d) shows a side-by-side representation of selected subregions and subsubregions on many layers. All views in a side-by-side representation are visually connected by linking and brushing features. Selecting one subregion in one plot highlights all occurences of common elements in all other displayed plots in all side-by-side views. This mechanism supports to preserve a viewer’s mental map and helps to set the selected region in context to others. 3.3 Interactive Features Our browsing tool supports a variety of interactive features to manipulate the hierarchical data and to explore, compare, and annotate it on different levels of granularity. – Region selection: A certain region on each indented plot can be selected by the mouse drag-and-drop functionality. The selected subregion is displayed as another indented plot right below on the next layer in a larger display space. – Marking and annotation: Relevant nodes can be annotated and marked as interesting elements. All of them are highlighted in each of the currently displayed indented plots in case the respective subregion contains that node. – Hierarchy expanding/collapsing: Clicking on a graphical element that represents a node leads to a collapsed subtree with the selected node being the root node of that subtree. If the subtree is already collapsed it will be expanded again. – Vertical and horizontal distortion: To use the indented plots efficiently vertical and horizontal distortions are allowed for each plot independently. – Color coding: A variety of predefined color scales can be applied to each Indented Pixel Tree Plot separately and the user is supported by creating his own color scale.
Indented Pixel Tree Browser for Exploring Huge Hierarchies
307
– Text pattern search: By typing in a text fragment all hierarchical elements containing this text fragment as a subsequence in its label are highlighted. – Geometric zoom/Lense function: A lense function can be used to geometrically zoom into a region on screen. – Details-on-demand: By moving the mouse cursor over a graphical primitive on screen the corresponding detail information of that element is displayed as a tooltip at the current mouse cursor position. – Data and PNG export: Selected subregions of an indented plot can be stored in the predefined newick data format that allows additional edge length informations or arbitrary lists of attributes attached to the nodes and edges. Furthermore, selected views can be exported as a PNG image.
4 Case Study To demonstrate the usefulness of our Indented Pixel Tree Browser we apply it to the NCBI taxonomy that is a hierarchical organization of species and organisms [6]. First of all, the entire tree structure is displayed in a single static diagram as a starting point for tree exploration.
Fig. 4. An Indented Pixel Tree Plot serves as an overview representation of the entire provided hierarchical data set. This plot is used as the starting point for the exploration and as a contextual view during the exploration process.
4.1 NCBI Taxonomy The NCBI taxonomy [6] contains 324,276 hierarchically organized species and organisms. Exploring such a huge hierarchy is a challenging task since an overview of the entire tree in a single static view is difficult to provide. Figure 4 shows an overview representation of the entire data set. We can easily get a hint about how the structure of the hierarchy behaves and where the tree branches in deep substructures by using both vertical indentation and color coding that is depending on the depth in the tree. Our first observation from the overview plot in Figure 4 is that the entire tree is structured asymmetrically. This means that it has an unbalanced form and it consists of several tree structures on the first hierarchy level that do not show similar hierarchy patterns. For instance, many subregions close to the center part of the plot are branching very deep down to depth 42. The leftmost and rightmost substructures behave differently. Their maximal depth is 10 to 15.
308
M. Burch, H. Schmauder, and D. Weiskopf
We already got many insights from inspecting the static plot but to get even more insights we can interactively explore different subregions of the tree in side-by-side and layered representations. This is illustrated in Figure 5 where a snapshot of the graphical user interface of our Indented Pixel Tree Browser is represented. The center view is used to browse the hierarchical structure. By using drag-anddrop operations the user can select subregions in each currently displayed plot whereas subregions can also be parts of other already selected subregions in the same plot or they can totally enclose them. Each plot can be distorted vertically as well as horizontally and color codings can be adjusted independendly for both the Indented Pixel Tree Plot and the guiding patterns. Tree nodes can be highlighted as interesting elements and are consequently color coded in yellow (or any other user-defined color) in all of the displayed plots in which they occur. Already selected subregions can be shifted by using the drag-and-drop operation on the corresponding guiding pattern. All layers below the shifted one are also shifted and adapted. The leftmost panel of the GUI provides a variety of parameters that can be used to modify the appearance of the plot and to load hierarchical data from file or to store selected regions in the predefined newick format or as a PNG image. The rightmost panel shows details-on-demand for selected graphical elements. The text fields in the lower part are used to apply textual as well as hierarchy-specific filters. By filtering the tree structure for depth 1 and apply the detail-on-demand function we can find out that there are five child nodes named Unclassified sequences, Viroids, Viruses, Other Sequences, and Cellular organisms, see Figure 6. The Cellular organisms branch is splitting again into three subhierarchies, namely Archaea containing 3,312 nodes, Eukaryote containing 232,984 node, and Bacteria containing 79,613 nodes, the three main domains of life. Figure 7 shows how the entire plot can be used to browse the tree and get interesting insights in deeper hierarchical substructures. First of all, we are interested in the deepest substructure of the hierarchy and select this subregion. Further selections lead the viewer to the detail view where he can find the labeling information of some species located in this subhierarchy. On the deepest level in this subhierarchy we find the Cyprinodon species that is a genus of small killifish belonging to the family Cyprinodontidae of ray-finned fish. Using the functionality of the browsing tool we can find out that it belongs to the Eukaryote subtree at the first hierarchy level. Fishes are structured in the deepest levels of the phylogenetic tree and their biodiversity is the reason for this visual phenomenon in the left part of the provided Indented Pixel Tree Plot in Figure 7. As a second example, we are interested in the long horizontal band on the right hand side of the plot that catches one’s eye. Using the detail view again, we find out that it belongs to Unclassified Bacteria Miscellaneous. Tracking the path to the root leads to the Bacteria main branch. There are many more insights that one might find in this data set by using the Indented Pixel Tree Browser but we chose just some of them as illustrative examples.
Fig. 5. The graphical user interface of the Indented Pixel Tree Browser provides a variety of interactive features to manipulate the hierarchical data, to annotate and highlight subregions, to compare subtrees, and to navigate in it
Indented Pixel Tree Browser for Exploring Huge Hierarchies 309
310
M. Burch, H. Schmauder, and D. Weiskopf
Fig. 6. Collapsing all subhierarchies belonging to levels deeper than 2 in the entire hierarchy and showing the label details uncovers insights about the hierarchy structure right below the root
Fig. 7. Two selected subregions in the entire plot are mapped to the layer below to a larger display space and make more fine-granular substructures visible. Repetitions of this selection process finally lead to side-by-side detail views and an additional labeling information.
Indented Pixel Tree Browser for Exploring Huge Hierarchies
311
5 Conclusion and Future Work In this paper we demonstrated how the indented tree visual metaphor can be used to efficiently explore and compare huge and deep trees. We proposed the Indented Pixel Tree Browser—a tool for interactively navigate through huge hierarchical data sets. Furthermore, subtrees can easily be compared and explored for patterns and anomalies on different levels of granularity. We applied it to the NCBI taxonomy, a hierarchical data set containing 324,276 nodes classifying species and organisms. In this case study, we demonstrated how the interactive features can be applied to navigate, filter, and explore such a huge tree very fast and get many insights from the data set. In future, we plan to apply the browsing tool to data sets from software development where hierarchical structures may consist of many million elements when inspecting the source code level or the abstract syntax tree. By applying filtering functions, linking and brushing to the real source code, color coding, and details-on-demand functions such a browsing tool can be a great help for software developers when maintaining the source code or inspecting several statistics of interest in it. Static indented tree plots have been evaluated in our former work [5] and compared to node-link diagrams. A user study addressing more interactive plots should follow to check the usability of the Indented Pixel Tree Browser. Acknowledgements. We would like to thank Dr. Kay Nieselt, University of T¨ubingen, for providing the NCBI taxonomy data set.
References 1. Reingold, E.M., Tilford, J.S.: Tidier drawings of trees. IEEE Transactions on Software Engineering 7, 223–228 (1981) 2. Koffka, K.: Principles of Gestalt Psychology. Harcourt-Brace, New York (1935) 3. Kruskal, J., Landwehr, J.: Icicle plots: Better displays for hierarchical clustering. The American Statistician 37, 162–168 (1983) 4. Shneiderman, B.: Tree visualization with tree-maps: a 2D space-filling approach. ACM Transactions on Graphics 11, 92–99 (1992) 5. Burch, M., Raschke, M., Weiskopf, D.: Indented pixel tree plots. In: Proceedings of International Symposium on Visual Computing, pp. 338–349 (2010) 6. Sayers, E.W., Barrett, T.: Database resources of the national center for biotechnology information. Nucleic Acids Research 37, 5–15 (2009) 7. Bertin, J.: Semiologie graphique: Les diagrammes, Les reseaux, Les cartes (2nd edition 1973, English translation 1983). Editions Gauthier-Villars, Paris (1967) 8. Knuth, D.: The Art of Computer Programming. Fundamental Algorithms, vol. I. AddisonWesley, Reading (1968) 9. McGuffin, M.J., Robert, J.M.: Quantifying the space-efficiency of 2D graphical representations of trees. Information Visualization 9, 115–140 (2009) 10. J¨urgensmann, S., Schulz, H.J.: A visual survey of tree visualization. Poster Compendium of the IEEE Conference on Information Visualization (2010) 11. Battista, G.D., Eades, P., Tamassia, R., Tollis, I.G.: Graph Drawing: Algorithms for the visualization of graphs. Prentice Hall, Upper Saddle River (1999)
312
M. Burch, H. Schmauder, and D. Weiskopf
12. Herman, I., Melanc¸on, G., Marshall, M.S.: Graph visualization and navigation in information visualization: A survey. IEEE Transactions on Visualization and Computer Graphics 6, 24– 43 (2000) 13. Eades, P.: Drawing free trees. Bulletin of the Institute for Combinatorics and its Applications 5, 10–36 (1992) 14. Grivet, S., Auber, D., Domenger, J.P., Melanc¸on, G.: Bubble tree drawing algorithm. In: Wojciechowski, K., Smolka, B., Palus, H., Kozera, R.S., Skarbek, W., Noakes, L. (eds.) Computer Vision and Graphics, Dordrecht, The Netherlands, pp. 633–641. Springer, Heidelberg (2006) 15. Purchase, H.: Metrics for graph drawing aesthetics. Visual Languages and Computing 13, 501–516 (2002) 16. Andrews, K., Heidegger, H.: Information slices: Visualising and exploring large hierarchies using cascading, semi-circular discs. In: Proceedings of the IEEE Information Visualization Symposium, Late Breaking Hot Topics, pp. 9–12 (1998) 17. Stasko, J.T., Zhang, E.: Focus+context display and navigation techniques for enhancing radial, space-filling hierarchy visualizations. In: Proceedings of the IEEE Symposium on Information Visualization, pp. 57–66 (2000) 18. Yang, J., Ward, M.O., Rundensteiner, E.A., Patro, A.: InterRing: a visual interface for navigating and manipulating hierarchies. Information Visualization 2, 16–30 (2003) 19. Bruls, M., Huizing, K., van Wijk, J.: Squarified treemaps. In: Proceedings of Joint Eurographics and IEEE TCVG Symposium on Visualization, pp. 33–42 (2000) 20. van Wijk, J.J., van de Wetering, H.: Cushion treemaps: Visualization of hierarchical information. In: Proceedings of Information Visualization, pp. 73–78 (1999) 21. Balzer, M., Deussen, O., Lewerentz, C.: Voronoi treemaps for the visualization of software metrics. In: Proceedings of Software Visualization, pp. 165–172 (2005) 22. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: Proceedings of the IEEE Symposium on Visual Languages, pp. 336–343 (1996) 23. Lamping, J., Rao, R., Pirolli, P.: A focus+content technique based on hyperbolic geometry for viewing large hierarchies. In: Proceedings of Human Factors in Computing Systems, pp. 401–408 (1995) 24. Munzner, T., Guimbreti`ere, F., Tasiran, S., Zhang, L., Zhou, Y.: TreeJuxtaposer: scalable tree comparison using focus+context with guaranteed visibility. ACM Transactions on Graphics 22, 453–462 (2003)
Towards Realtime Handheld MonoSLAM in Dynamic Environments Samunda Perera and Ajith Pasqual Department of Electronic & Telecommunication Engineering, University of Moratuwa, Moratuwa, Sri Lanka {samunda,pasqual}@ent.mrt.ac.lk http://www.ent.mrt.ac.lk Abstract. Traditional monoSLAM assumes stationary landmarks making it unable to cope up with dynamic environments where moving objects are present in the scene. This paper presents the parallel implementation of monoSLAM with a set of independent EKF trackers where stationary features and moving features are tracked separately. The difficult problem of detecting moving points from a moving camera is addressed by the epipolar constraint computed by using the measurement information already available with the monoSLAM algorithm. While doing so SLAM measurement outlier rejection is also performed. Results are presented to verify and highlight the advantages of our approach over traditional SLAM. Keywords: SLAM monoSLAM.
1
Introduction
Simultaneous Localization And Mapping (SLAM) is the widespread method utilized by a mobile robot to build a consistent map of a previously unknown environment while simultaneously determining its pose within this map. Successful SLAM applications now exist in a variety of domains including indoor, outdoor, aerial and underwater using different types of sensor devices such as laser scanners, sonars and cameras as appropriate to the domain. However, the majority of those applications use laser scanners due to the accurate and dense measurements returned. On the other hand, using cameras in SLAM provides several unique advantages over the others. They are inexpensive, low power, compact and capture scene information up to virtually infinite depth. Moreover, human-like visual sensing and the potential availability of higher level semantics in an image make them well suited for augmented reality applications [1] and human assistive navigation. In fact, recent advancements in visual SLAM have shown interesting results with both stereo/multi camera SLAM systems [2,3] and monocular camera SLAM (monoSLAM) systems [4,5,6]. However, real world environments even an indoor desktop scenario can contain moving objects and are therefore dynamic in general and can violate a fundamental assumption made in the conventional SLAM solutions i.e. map features G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 313–324, 2011. c Springer-Verlag Berlin Heidelberg 2011
314
S. Perera and A. Pasqual
(landmarks) are stationary. Therefore, if moving features are mistakenly added to the SLAM map, the solution can get inconsistent. Moreover stationary features already in the SLAM map can get temporarily occluded by the moving objects resulting in incorrect feature matches or failed expensive feature search and unnecessary feature deletions followed up by re-initializations. In the work by Castle et al. [1] recognition of known planar objects was used to augment the scene and these objects’ locations were incorporated as additional 3D measurements in to a monoSLAM process. In such a situation movement of a known object can also result in inconsistencies. Therefore, detection and tracking of moving objects is vital to the SLAM performance in dynamic environments. Although the SLAM solution was first obtained two decades ago [7], it was in 2003 that the formal derivation of the Bayesian formula of the SLAM with Detection And Tracking of Moving Objects (DATMO) problem was published [8]. There, authors observe that moving features are highly unpredictable and calculating a joint posterior over all features would be costly. Therefore, they maintained separate posteriors for the stationary and moving features and presented results using data from laser scanners and odometry. In addition authors acknowledged that using cameras to detect moving objects is harder than using laser scanners yet cameras provide useful information. Wolf et al. paper [9] on SLAM in dynamic environments used only SICK laser range finders. Zhou et al. have presented results of SLAM in semi dynamic environments using laser range scanners [10]. There different objects were associated with unique RFID tags and these RFID tags were then used to localize dynamic objects. Very recently two research teams had produced results on visual SLAM with object tracking in indoor environments. Wangsiripitak et al. [11] presents results by separately tracking a 3D moving object and this information is used by the SLAM system to not to delete temporary occluded features and not to consider features on moving objects. However, the 3D object identification and pose initialization were done by hand and authors acknowledge that their method is not widely applicable. Meanwhile Migliore et al. [12] presents a monoSLAM system with detection and tracking of moving point features. Here a general feature first observed is tracked using a separate EKF tracker and detection of movement was based on continuous evaluation of intersections between three viewing rays belonging to the feature in three different camera poses. To address the uncertainties in this reasoning, a probabilistic framework called Uncertain Projective Geometry was utilized. However, no details on the processing times were given. In this paper we apply an alternative method for a handheld monoSLAM system to survive in dynamic environments. We propose to cast the moving point detection problem as an outlier detection problem. A complete SLAM system may include outlier detection and rejection in its measurement stage to filter out incorrect feature matches and to extract feature matches corresponding to the dominant ego motion. However, with a large proportion of moving points the dominant motion assumption can be problematic. Further we argue that
Towards Realtime Handheld MonoSLAM in Dynamic Environments
315
detected outliers should not be discarded (as is common in usual outlier rejection stages) since they could carry information about the moving points. The knowledge of the location of moving points allows to reason about occlusion of static landmarks in the SLAM map. Therefore, we describe a demonstrative monoSLAM system with a set of independent EKF trackers where stationary features and moving features are tracked separately. The system addresses the moving point detection based on the epipolar constraint evaluated from the fundamental matrix estimated directly from SLAM measurements. Further in doing so incorrect data associations are also rejected.
2
Our Approach
As depicted in Figure 1 we separate out the stationary SLAM process from the moving point tracking process. Moving point tracker process consists of independent EKF trackers. Similar to Wang et al. [8] our solution first performs a SLAM update and passes the SLAM posterior information (specifically camera pose xp and its error covariance Pxp xp ) in to the tracker process which then uses them to obtain the independent moving point posteriors. After the SLAM measurement stage, fundamental matrix (F ) is estimated while rejecting incorrect measurements as outliers. Once the tracker EKF measurement step is complete, a moving-stationary reasoning (Section 5) is applied based on the epipolar constraint evaluated from the fundamental matrix. If a moving feature can be classified as stationary it is immediately converted in to a stationary feature in the SLAM map. Else they are retained in the tracker process. It should be noted that our approach differs from [12] in that we utilize the fundamental matrix estimated in the SLAM measurement outlier rejection process to moving point detection whereas [12] uses SLAM camera pose estimates. Further [12] does not appear to use the occlusion information. Though our method is not able to function in the presence of all kinds of motion, it improves the performance of SLAM in dynamic environments.
3
SLAM
As in [13,14] we assume that the handheld camera undergoes 6 DOF smooth motion and choose a constant linear velocity, constant angular velocity model. Camera pose xp is represented by the 3D position rW and the rotation quaternion qW R . Here subscripts W and R represent the world frame and camera frame respectively. As noted earlier uncertainty of pose is given by the error covariance Pxp xp . T (1) xp = rW ; qW R = (x, y, z, q0 , qx , qy , qz ) Camera total state xv consists of the pose xp and linear velocity vW and angular velocity ω R . xv = rW ; qW R ; vW ; ω R (2)
316
S. Perera and A. Pasqual
S LAM
T rac ker
T ime Update T ime Update
Meas urements Meas urement
F undamental Matrix C omputation with Outlier R ejec tion
F
C las s ific ation
xp , PXpXp Meas urement Update Meas urement Update
A dd G eneric F eature A dd S tationary F eature
Fig. 1. Our Approach
Within a dt time interval unknown linear acceleration aW and angular acceleration αR can occur and contribute to the process noise (n) components VW and ΩR respectively. W W V a dt n= = (3) R Ω αR dt Camera state time update is as follows. ⎞ ⎛ W ⎞ ⎛ W W + VW dt rnew r + v R⎟ ⎜qW ⎜ W R × q ω R + ΩR dt ⎟ ⎟ ⎜q ⎟ ⎜ new W ⎠ = ⎝ ⎠ ⎝vnew vW + V W R R R ωnew ω +Ω
(4)
For map features Fi , i = 1 . . . m we use the inverse scaling parametrization suggested by [6] which improves the linearity of measurement equation over the previous inverse depth parametrization [4] approach. The feature state is thus T . given by xW Fi = xi yi zi ωi Observations are governed by the pinhole camera model, R
u0 − f hLRx hL z u = hR L v v −f y 0
hR Lz
(5)
Towards Realtime Handheld MonoSLAM in Dynamic Environments
317
R R where (u0 , v0 ), f, (hR Lx , hLy , hLz ) are principal point, focal length (in pixels) and 3D coordinates in camera frame respectively. However, in order to account for the camera distortion we apply the classical two parameter radial distortion model as well.
4
EKF Tracker
A generic feature first observed by the camera can be either stationary or moving. In general we assume a generic feature Gj , j = 1 . . . n can be moving with velocity W and use an independent EKF filter to track each moving feature. vG j The state of a generic feature is given as W xGj xj = (6) W vG j where xW Gj is in inverse scale parametrization. It should be noted that we represent the moving feature entirely in world frame instead of a mixed frame representation as in [12]. When the feature was first initialized we assume a prior for inverse scale (w) ˆ as in [6] and the initial W T is taken as (0, 0, 0) . velocity v ˆG j ⎛ ⎞ W R W u0 − u ⎜ v0 − v ⎟ R r ⎜ ⎟ (7) xW Gj = 0 1 ⎝ f ⎠ w ˆ However, in order to keep track of a moving feature, uncertainty of velocity (M) is kept at an acceptable magnitude level. When C is the noise covariance of camera measurement (z = (u, v)) and σω is the standard deviation of the inverse scale prior, the error covariance P of the newly initialized moving feature is given by ⎞ ⎛ Pxp xp 0 0 0 ∂x ∂x ∂x ∂x ⎜ 0 C 0 0⎟ T j j j j ⎟ J where J = ∂x W . (8) P = J⎜ ∂z ∂ w ˆ ∂ˆ vG 2 p ⎝ 0 0 σω 0 ⎠ j 0 0 0 M Time update equation for the moving feature is as follows. Here VnW is the noise velocity to account for the uncertain motion of a moving feature.
⎞ ⎛ W W dt + V I vG n j xW ⎜ Gj ⎟ xj = ⎝ 0 (9) ⎠ 1 W W vGj + Vn Innovation covariance Si which permits an active search of the moving feature is calculated as Si = HPHT + DR DT (10)
318
S. Perera and A. Pasqual
Here H is the jacobian of the camera observation function (h) w.r.t. xj . D and R are as below (vj stands for the measurement noise).
Pxp xp 0 ∂h ∂h D = ∂xp ∂vj , R = (11) 0 C
5
Moving Point Detection
For a moving camera, epipolar geometry describes the intrinsic projective geometry between two sequential views which is characterized by the Fundamental matrix F . For a stationary 3D point X, an image point x in the first view defines a constraint - an epipolar line l in the second view on which the corresponding point x lies. Here the epipolar line and the epipolar constraint are represented T as l = F x and x F x = 0. In case the 3D point X is moving, x will either not lie on l or move along l . The former situation gives rise to a nonzero perpendicular absolute distance of d from the epipolar line and acts as an indication of the point’s motion. Here if T T √ l = a b c and x = p q 1 then d = |ap+bq+c| . a2 +b2 The latter situation identified as degenerate motion occurs usually when the camera follows the objects moving in the same direction. Here we make the assumption that the handheld camera operating in a SLAM scenario always undergoes at least a slight rotation in addition to translation and thus chances of degenerate motion are low. A further relaxation is possible by tilting the camera w.r.t. the ground plane where moving objects are mostly present. This distance calculation requires computing the fundamental matrix for the two views. For that we use the data association information already computed in the measurement step of SLAM. When 8 or more point correspondences are available normalized 8-point algorithm [15] can be used. To reduce epipolar line instability due to noise we use about 12-25 point matches. In order to have a wider base line between the camera positions, old measurement lists are kept in a FIFO (First In First Out) queue and epipolar geometry is considered for several (ex: 20) frames apart instead of two consecutive frames. In SLAM, feature image patches are actively searched inside an elliptical search region defined by the innovation covariance minimizing chances of incorrect data associations. It is usual to assume 0.5-1 pixel standard deviation in measurement noise. Yet in case of large search regions and repetitive texture patterns it can return false matches. Since the 8-point algorithm is sensitive to such outliers we estimate the fundamental matrix using RANSAC. Fundamental matrix calculation using RANSAC first involves selection of a random subset (8 correspondence points) of the total point correspondences, estimation of fundamental matrix using normalized 8-point algorithm and identifying inliers in the total set which lie within a user given re-projection threshold distance (in our case 2 pixels) from the estimated epipolar lines. Here outliers can be due to incorrect data associations which is experienced in cluttered environments as well as due to the movement of feature which was earlier stationary.
Towards Realtime Handheld MonoSLAM in Dynamic Environments
319
This process is repeated until a maximum amount of inliers are observed and they all are then used with the normalized 8-point algorithm to produce the final fundamental matrix estimate. One may also use the camera intrinsic matrix (K), rotation (R) and translation (t) between the two views in order to compute the fundamental matrix from F = [Kt]x KRK −1 [16]. However, due to the loose smooth motion assumption on the handheld camera state, we avoid such an approach. Moreover if F is computed from SLAM posterior poses, we lose the advantage of outlier rejection which should happen prior to computing posterior estimates. Once we have the fundamental matrix estimate, mean of distance to the epipolar line is computed recursively over several measurements and used in moving point classification. If the mean distance is lower than a practical threshold (2 pixels in accordance with the earlier re-projection threshold) a point is considered as static and converted in to a SLAM map feature, else the point is kept as a moving feature. It should be noted that in order to handle the effect of lens distortion, observed (distorted) measurements are first undistorted prior to the fundamental matrix estimation and calculating the deviation from the epipolar lines.
6
Experimental Results and Discussion
A real-time monoSLAM system was implemented using C++ on an Intel Core2 Duo 2.4 GHz Processor laptop with IEEE 1394 firewire camera (CMU1394 driver [17]) providing 320x240 images at 30 Hz. For 3D visualization VTK [18] was used. The camera was calibrated using standard software [19] and in order to set the correct metric scale of monoSLAM, a planar chessboard pattern of known grid size was used. When the system starts operation, the chessboard corners are detected and the available camera intrinsic and radial distortion parameters are used to obtain the relative rotation and translation of the chessboard coordinate frame with respect to the camera coordinate frame. Here we consider the initial camera pose as the world coordinate frame thus 3D locations of the chessboard corners can be determined in the world frame. A subset of these is then used in monoSLAM as the initial/known landmarks. Figure 2(a) shows our system in operation. Here camera 3D trajectory, orientation and map are illustrated. Figure 2(b) shows the correct detection of a moving box as a collection of moving features (moving features active search regions visualized as white ellipses). Stationary features are in red and features not selected for measurement are in yellow. A typical breakdown of processing times of different stages for a map size of 62 features is given in Table 1. Here the main SLAM and tracker process runs in a high priority thread and visualization is run at low priority to enable full 30Hz operation. The above run times reported are based on profile guided optimization with double precision floating point EKF calculations.
320
S. Perera and A. Pasqual
(a)
(b)
Fig. 2. (a) Camera Trajectory & SLAM Map (b) Moving Point Detection Table 1. Processing Times Stage
Time (ms)
Image acquisition SLAM prediction Feature selection and measurement Fundamental matrix estimation SLAM update Tracker predict, measurement, update Feature addition
1 0.5 4 2.5 13 2 6
To verify the correct operation of our approach and to demonstrate the improvement gained we performed experiments on both synthetic and real video sequences. Synthetic scenes were created using a recent version of [20] and consisted of 2 perpendicular planes (black-white checker pattern) and a cube (redblue checker pattern) which act as stationary scene and moving object respectively. A comparison between SLAM with moving point tracking and traditional SLAM is given in Figure 3. Here the camera translates along the x axis direction while an object moves in the scene (non-degenerate motion). Our method was able to successfully recover the actual linear trajectory while traditional SLAM failed. An interesting situation holds when both the camera and the cube translates in the same direction (a degenerate motion). Here as expected the moving points are erroneously identified as static. However, the estimated camera trajectory is still a straight line (Figure 4). This is not a surprise since despite the points’ movement the innovation direction remains fixed and only the magnitude changes. In a real video sequence we translated the camera parallel to the x axis of the coordinate frame, while a box was selected as the moving object and moved in the scene. As Figure 5 illustrates traditional monoSLAM fails to recover the
Towards Realtime Handheld MonoSLAM in Dynamic Environments
321
(a) Traditional SLAM - Moving (b) Our Method - Moving points points identified as static distinguished
(c) Traditional SLAM - Estimate
(d) Our Method - Estimate
Fig. 3. Comparison between traditional SLAM and our method
actual motion of the camera and deviates significantly from the actual motion parallel to the x axis. However, our method based on SLAM with outlier rejection and tracking follows the actual motion very closely. In addition with basic SLAM there is a clear error (Figure 6)in the camera 3D orientation which should ideally lie in the xy plane. The video related with these experiments will be available in authors homepage. It should be noted that fundamental matrix computation using point correspondences can sometimes result in degeneracies without a unique solution. This mostly happens when the two images of the scene are related by a homography H. This homography relation holds true when the camera motion is rotation only or when the scene is planar. In such cases two-parameter family of solutions are available for the fundamental matrix of the type F = SH where S is skew-symmetric [16]. Even so, the epipolar constraint is still valid for any image point which satisfies H but the two-parameter family of solutions results in rotating epipolar lines centered on corresponding image point. However, since we are using the mean of the distance to the epipolar line over several measurements this did not adversely affect our moving point classification process. An alternative approach would be to detect the degeneracy and to apply the more specific homography constraint for moving point detection.
322
S. Perera and A. Pasqual
(a) Frame 159 - Epipolar (b) Frame 162 - Movlines travel through moving ing points identified as points static(red)
(c) SLAM Estimate Fig. 4. Results for a degenerate motion
Fig. 5. MonoSLAM camera trajectory in a dynamic environment. The actual trajectory parallel to the x axis is only recovered from SLAM with tracking (right) whereas general SLAM(left) fails.
Towards Realtime Handheld MonoSLAM in Dynamic Environments
323
Fig. 6. MonoSLAM camera orientation in a dynamic environment. The actual orientation perpendicular to the z axis is only recovered from SLAM with tracking (right) whereas in general SLAM(left) there is a clear misorientation.
7
Conclusion and Future Work
This paper presented a system which applies the epipolar constraint to monoSLAM for detecting moving features in the environment and thereby to improve the monoSLAM performance. Successful 30 Hz realtime operation in dynamic environment was illustrated. It integrates SLAM measurement outlier rejection and moving point detection and tracking making it superior over basic SLAM. Outlier rejection accounts for mismatched features in SLAM and to a certain level currently moving but previously stationary features in SLAM map. Our method can be readily applied to any basic feature based SLAM system. One limitation of our work is the requirement of a threshold for correct moving point classification. The threshold need to be adjusted manually depending upon the expected number of point correspondences and performance deteriorates with low number of correspondences. Therefore, a data driven threshold for the moving point classification logic based on the fundamental matrix uncertainty is preferred. Another issue is the difficulty of tracking individual moving points across time. For example often it fails to completely reason about the occlusion boundaries of a moving person. Therefore, we anticipate to explore dense approaches in the future. Acknowledgments. Authors wish to thank Dr. Ranga Rodrigo for valuable advice and support. Clarifications provided by Dr. Andrew Davison and Prof. Richard Hartley are greatly acknowledged. Many thanks to Jan Funke for extending the framework [20] to dynamic environments.
References 1. Castle, R.O., Klein, G., Murray, D.W.: Combining monoSLAM with object recognition for scene augmentation using a wearable camera. Journal of Image and Vision Computing 28, 1548–1556 (2010) 2. Paz, L., Pinies, P., Tardos, J., Neira, J.: Large-scale 6-DOF SLAM with stereo-inhand. IEEE Transactions on Robotics 24, 946–957 (2008) 3. Sola, J., Monin, A., Devy, M., Vidal-Calleja, T.: Fusing monocular information in multicamera SLAM. IEEE Transactions on Robotics 24, 958–968 (2008) 4. Civera, J., Davison, A., Montiel, J.: Inverse depth parametrization for monocular SLAM. IEEE Transactions on Robotics 24, 932–945 (2008)
324
S. Perera and A. Pasqual
5. Marzorati, D., Matteucci, M., Migliore, D., Sorrenti, D.G.: Monocular SLAM with inverse scaling parametrization. In: BMVC 2008 (2008) 6. Marzorati, D., Matteucci, M., Migliore, D., Sorrenti, D.G.: On the use of inverse scaling in monocular SLAM. In: IEEE International Conference on Robotics and Automation, ICRA 2009, pp. 2030 –2036 (2009) 7. Smith, R., Self, M., Cheeseman, P.: Estimating uncertain spatial relationships in robotics. In: Cox, I.J., Wilfong, G.T. (eds.) Autonomous Robot Vehicles, vol. 8, pp. 167–193 (1990) 8. Wang, C.C., Thorpe, C., Thrun, S.: Online simultaneous localization and mapping with detection and tracking of moving objects: theory and results from a ground vehicle in crowded urban areas. In: Proceedings of IEEE International Conference on Robotics and Automation, ICRA 2003, vol. 1, pp. 842–849 (2003) 9. Wolf, D., Sukhatme, G.: Online simultaneous localization and mapping in dynamic environments. In: Proceedings of IEEE International Conference on Robotics and Automation, ICRA 2004, vol. 2, pp. 1301–1307 (2004) 10. Zhou, H., Sakane, S.: Localizing objects during robot SLAM in semi-dynamic environments. In: IEEE/ASME International Conference on Advanced Intelligent Mechatronics, AIM 2008, pp. 595–601 (2008) 11. Wangsiripitak, S., Murray, D.: Avoiding moving outliers in visual SLAM by tracking moving objects. In: IEEE International Conference on Robotics and Automation, ICRA 2009, pp. 375 –380 (2009) 12. Migliore, D., Rigamonti, R., Marzorati, D., Matteucci, M., Sorrenti, D.G.: Use a single camera for simultaneous localization and mapping with mobile object tracking in dynamic environments. In: ICRA Workshop on Safe Navigation in Open and Dynamic Environments: Application to Autonomous Vehicles (2009) 13. Davison, A.: Real-time simultaneous localisation and mapping with a single camera. In: Proceedings of Ninth IEEE International Conference on Computer Vision, vol. 2, pp. 1403–1410 (2003) 14. Davison, A.J., Reid, I.D., Molton, N.D., Stasse, O.: monoSLAM: Real-time single camera SLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 1052–1067 (2007) 15. Hartley, R.: In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 580–593 (1997) 16. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) ISBN: 0521540518 17. Baker, C., Ulrich, I.: CMU 1394 digital camera driver (2008), http://www.cs.cmu.edu/~ iwan/1394/ 18. Schroeder, W.J., Avila, L.S., Hoffman, W.: Visualizing with VTK: A tutorial. IEEE Computer Graphics and Applications 20, 20–27 (2000) 19. Bouguet, J.Y.: Camera calibration toolbox for matlab (2008), http://www.vision.caltech.edu/bouguetj/calib_doc/ 20. Funke, J., Pietzsch, T.: A framework for evaluating visual SLAM. In: BMVC 2009 (2009)
Registration of 3D Geometric Model and Color Images Using SIFT and Range Intensity Images Ryo Inomata1 , Kenji Terabayashi1, Kazunori Umeda1 , and Guy Godin2 1
Chuo University, 1–13–27 Kasuga, Bunkyo-ku, Tokyo 112–8551, Japan [email protected], {terabayashi,umeda}@mech.chuo-u.ac.jp 2 National Research Council, 1200 Montreal Road, Ottawa, Ontario K1A 0R6, Canada
Abstract. In this paper, we propose a new method for 3D-2D registration based on SIFT and a range intensity image, which is a kind of intensity image simultaneously acquired with a range image using an active range sensor. A linear equation for the registration parameters is formulated, which is combined with displacement estimations for extrinsic and intrinsic parameters and the distortion of a camera’s lens. This equation is solved to match a range intensity image and a color image using SIFT. The range intensity and color images differ, and the pairs of matched feature points usually contain a number of false matches. To reduce false matches, a range intensity image is combined with the background image of a color image. Then, a range intensity image is corrected for extracting good candidates. Moreover, to remove false matches while keeping correct matches, soft matching, in which false matches are weakly removed, is used. First, false matches are removed by using scale information from SIFT. Secondly, matching reliability is defined from the Bhattacharyya distance of the pair of matched feature points. Then RANSAC is applied. In this stage, its threshold is kept high. In our approach, the accuracy of registration is advanced. The effectiveness of the proposed method is illustrated by experiments with real-world objects.
1
Introduction
As information technology is becoming more widely used, research that creates realistic models using computer graphics techniques is increasing in importance [1]. Texture mapping on scanned real world objects, which is the method used to map texture images measured using a color sensor on 3D geometric model measured by a range sensor, is one of the methods for creating realistic models. Usually, a 3D geometric model and color image are independently obtained from different viewing positions through range and color sensors. Thus, the registration of a 3D geometric model and color images is necessary. One approach is the method of registration using a silhouette image and a 2D image contour [2], [3], [4], [5]. The concept is that a silhouette image from 3D geometry is compared with a 2D image. Iwakiri et al. [2] proposed a fast G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 325–336, 2011. c Springer-Verlag Berlin Heidelberg 2011
326
R. Inomata et al.
(a) Range intensity image
(b) Color image
Fig. 1. Range intensity image and color image
method using hierarchical silhouette matching. Neugebauer et al. [5] estimated camera parameters by matching the features of a 3D model and a 2D image by hand. On the other hand, one category of registration techniques relies on the use of range intensity images. An active range sensor measures distance by projecting light and measuring the property of the reflected light. The amount of reflected light is related to the reflectance ratio of the measured points. This is called ”reflectance image” or ”range intensity image.” The range intensity image is similar to a color image as shown in Fig. 1. These images are used in the 3D-2D registration. The similarity of two images is used by the following approaches. Boughorbel et al. [6] used the χ2 -similarity metric to measure the similarity between the range intensity image and the intensity image. Umeda et al. [7] proposed a method for estimating the registration of a range sensor and a color sensor on the basis of the gradient constraint between a range intensity image and a 2D image. Other approaches use the following features: edges (Kurazume et al. [8]), corners (Elstrom et al. [9]), and SIFT (Scale-Invariant Feature Transform) [11]. SIFT has been shown to be largely invariant to scale changes and varying illumination; thus, the feature is a good candidate for registration. Bohm et al. [10] match a range intensity image and a color image using SIFT. Then, the rigid body transform for a pair-wise registration is computed. The problem is that only extrinsic parameters are estimated. In our approach, using SIFT, the intrinsic and extrinsic parameters and the distortion of the camera’s lens are estimated for highly accurate registration. The range intensity and color images differ, and the pair of matched feature points usually contains a number of false matches. For highly accurate registration, we propose a method to reduce false matches, i.e., soft matching. In this paper, soft matching means that false matches are weakly removed while correct matches are kept. To reduce false matches, a range intensity image is made similar to a color image before using SIFT. First, a range intensity image is combined with a background image of a color image. Then, a range intensity image is corrected for extracting good candidates. Moreover, soft matching is used. Soft matching consists of the following three methods: matching reliability from the Bhattacharyya distance of matched feature points, using scale information from SIFT, and RANSAC [12] is applied. In this stage, its threshold is kept high. As a result, the accuracy of registration is advanced. In this paper, false matches are reduced, and
Registration of 3D Geometric Model and Color Images
(a) Flow of the 3D-2D registration
327
(b) Projection of a 3D point on an image plane
Fig. 2. Flow of the 3D-2D registration and projection of a 3D point on an image plane
two images are matched using SIFT. Then, false matches are removed, and a linear equation as parameters for the 3D-2D registration is solved. Finally, the camera parameters and the distortion of the camera’s lens are updated with the obtained correction of the parameters. This paper is organized as follows. In Section 2, we show the flow of the 3D2D registration. In Section 3, a linear equation for the 3D-2D registration is formulated. After introducing the method to reduce false matches in Section 4, we show the soft matching in Section 5. We show several experimental results in Section 6, and we conclude the paper in Section 7.
2
Outline of the Registration Method
The inputs of the 3D-2D registration are a color image and a 3D geometric model with range intensity images. For the registration of the 3D geometric model and a color image, it is necessary to obtain the camera parameters, i.e., the intrinsic and the extrinsic ones, in the coordinate system of the 3D geometric model. If the camera parameters are appropriate, the color image and the range intensity image projected on the camera’s image plane are matched. In other words, 3D2D registration entails obtaining the camera parameters to match two images. In practice, the distortion of the camera’s lens should also be considered for highaccuracy registration. Fig. 2 (a) shows the flow of the registration. First, the initial values of the parameters are given. A range intensity image is projected on the camera’s image plane using the camera parameters, producing a 2D image. The projection is applied to the 3D coordinates of the range image points. The projected range intensity image is compared with the color image. If they are not sufficiently matched, a range intensity image is made similar to a color image. After using SIFT, soft matching is used. The camera parameters are then updated with the obtained correction of the parameters. We iteratively apply the loop process in three stages. In stage 1, only extrinsic parameters are updated. In stage 2, the translation velocity vector and intrinsic parameters are
328
R. Inomata et al.
updated. In stage 3, the camera parameters and the distortion of the camera’s lens are updated. To evaluate the matching of the range intensity image and the color image, we choose a correlation coefficient. In order to obtain a color image which is similar to the range intensity image, the one color channel, similar to the projecting light of a range sensor, of the color image is used for comparison with the range intensity image.
3
Formulation of the 3D-2D Registration
A 3D point (X, Y, Z) is projected at (u, v) on the image plane of the camera (see Fig. 2 (b)); αv Y αu X + sY + u0 , + v0 , v= (1) u= Z Z where αu , αv , u0 , v0 , s are the intrinsic parameters. αu , αv are the aspect ratio, u0 , v0 are the center of the image, and s is the skew. Firstly, we formulate a linear equation for extrinsic parameters. Then, we extend the equation to the case of variable intrinsic parameters and distortion of the camera’s lens. 3.1
Constraint for Extrinsic Parameters
Suppose that the intrinsic parameters are given and constant. In this case, Eq. (1) is differentiated to s αu X + sY ˙ αu ˙ Z, (2) X + Y˙ − u˙ = Z Z Z2 αv ˙ αv Y ˙ (3) Y − 2 Z. Z Z In this paper, the time dimension for derivatives refers to the difference between the projected range intensity image and the color image, which has a color component near the projected light of a range sensor; a color image is virtually translated from a range intensity image in the unit time. Therefore, by using SIFT, u, ˙ v˙ is a translation of a color image from a range intensity image. ˙ Y˙ , Z] ˙ T ; i.e., the velocity of the 3D point results ˙ = [X, Eq. (4) is given by X from the camera motion. ˙ = −v − ω × X, X (4) v˙ =
where v = [vx , vy , vz ]T is the translation velocity vector and ω = [ωx , ωy , ωz ]T is the rotation velocity vector. In this paper, these parameters are referred to as motion parameters. By substituting Eq. (4) in Eq. (2) and Eq. (3), u˙ = −avx − bvy − cvz − (cY − bZ)ωx −(aZ − cX)ωy − (bX − aY )ωz , v˙ = −dvy − evz − (eY − dZ)ωx
(5)
Registration of 3D Geometric Model and Color Images
+eXωy − dXωz ,
329
(6)
where a, b, c, d, and e are given by s αu X + sY αv Y αu αv ,b = ,c = − ,e = − 2 . ,d = Z Z Z2 Z Z These are linear equations for the six motion parameters v and ω. Therefore, they can be calculated by obtaining the three or more points and applying the simple linear least squares method. In practice, small displacements are used for velocity components in Eq. (4). Extrinsic parameters are represented by 3 × 3 rotation matrix R and 3D translation vector t = [tx , ty , tz ]T . They are directly calculated from the obtained small displacements. Eq. (4) is for small displacements, the correct values of the extrinsic parameters are usually not reached on the first try. Therefore, the loop process is iteratively applied (see Fig. 2 (a)). a=
3.2
Constraint for Intrinsic Parameters and Distortion of the Camera Lens
The method can be extended for intrinsic parameters and the distortion of the camera lens. Suppose that the distortion of the camera lens is in proportion to the cubed range from the center of the image. In this case, Eq. (1) is given by u = αu +s
Y Z
X Z
X2 + Y 2 Z2 X2 + Y 2 1+k + u0 , Z2 1+k
(7)
Y X2 + Y 2 v = αv (8) + v0 , 1+k Z Z2 where k is the distortion parameter of the camera lens. With the same procedure given above, we can obtain the linear equation camera parameters and the distortion of the camera lens. Therefore, they can be calculated by obtaining six or more points and applying the simple linear least squares method. The parameters for the 3D-2D registration can be obtained by the proposed linear equation in a case in which intrinsic parameters or the distortion of the camera lens is either given or not (see Section 2).
4
Image Modification for Reducing False Matches
A range intensity and color images differ, and the pair of matched feature points usually contains a number of false matches. Incorrect candidates are extracted from a range intensity image for the following reasons. – In a range intensity image, a boundary region of a target object has no texture – The S/N ratio of a range intensity image is low
Therefore, a range intensity image is made to be similar to a color image to reduce false matches before using SIFT.
330
R. Inomata et al.
(a) Color image
(b) Background image (c) Range intensity image (d) Combined image
Fig. 3. Combined range intensity image with a background image of a color image
4.1
Combined Range Intensity Image with a Background Image of a Color Image
When the features of a range intensity image are extracted using SIFT, as a boundary region of a target object has no texture, a number of false matches are extracted. Therefore, it is needed to solve the problem which is a boundary region of a target object has no texture in a range intensity image. For reducing false matches, a range intensity image is combined with the grayscale image which is transformed from a background image of a color image. To combine two images, the 3D information of a projected range intensity image is used. Fig. 3 (a) is an example of a color image. Fig. 3 (b) shows a background image of a color image, Fig. 3 (c) shows a range image, and Fig. 3 (d) illustrates the range intensity image which is combined with a background image. In addition to matching two images, features that are extracted in a background image are eliminated. 4.2
Correction of Range Intensity Image
A range intensity image is affected by the following factors. – Distance to each measured point – Normal vector at the measured point – Sensor-specific characteristics
By the factors above, to match a range intensity image and a color image, incorrect candidates are extracted from a range intensity image. Therefore, instead of a raw range intensity image, we used a corrected one that was obtained by the method proposed by Shinozaki et al. [13].
5
Soft Matching
To remove false matches, if a simple robust estimate, RANSAC [12], is used, correct matches can be removed. Therefore, soft matching is used before using RANSAC. First, a small difference in the scale of the correct matches is used. Then, the appearance of the pair of matched feature points is used. We use Bhattacharyya distance to evaluate the similarity of the appearance.
Registration of 3D Geometric Model and Color Images
(a) Range intensity image
331
(b) Color image
Fig. 4. Area of calculating the Bhattacharyya distance (z is the scale information from SIFT)
5.1
Scale Information from SIFT
First, false matches were removed by using scale information from SIFT. If the pair of matched feature points is correct, the difference in the scale of each feature point will be small. Therefore, false matches are removed by the following threshold processing. if (|x| < μx − 0.8 ∗ σx ) : correct (9) otherwise : f alse where x is the difference of scale in each feature point, μx is the average of x, and σx is the standard deviation of x. The constant 0.8 was determined empirically. 5.2
Matching Reliability from the Bhattacharyya Distance
If the pair of matched feature points is correct, the intensity of the area of each feature point will be similar. We use Bhattacharyya distance for calculating the similarity. The Bhattacharyya distance S is given as S=
m √
pu qu ,
(10)
u=1
where pu and qu are the two normalized color histograms made from hue information, u is the hue number, and m is the number of the elements of the hue. We propose a matching reliability using Bhattacharyya distance of the intensity of area of the pair of matched feature points. The matching reliability is defined as 1 (1 − S)2 1 exp − p =√ . (11) 2 σ2 2πσ This equation represents a normal distribution with the average zero and the standard deviation σ. In this paper, σ is the standard deviation of S, and the greater the Bhattacharyya distance, the higher the matching reliability. It is important to determine the area for calculating the Bhattacharyya distance. To determine the area appropriately, scale information from SIFT is used. In the previous chapter, the difference in scale of each feature point is small. The
332
R. Inomata et al.
Bhattacharyya distance is then calculated (3.0 ∗ z) from SIFT keypoints (see Fig. 4 ), where z is the scale from SIFT and (3.0 ∗ z) is the area of description of SIFT. The method for removing false matches is shown below. First, the Bhattacharyya distance S is calculated by using Eq. (10). Secondly, the matching reliability p of the pair of matched feature points is calculated by using Eq. (11). False matches are then removed by the following threshold processing. if (p > μp − 1.0 ∗ σp ) : correct (12) otherwise : f alse where μp and σp are the average of matching reliability p and the standard deviation of matching reliability p respectively. In addition, the constant 1.0 is determined empirically. Finally, after soft matching is used, RANSAC is applied. In this stage, its threshold is kept high.
6
Experimental Results
The effectiveness of the proposed method is illustrated by experiments with real world objects. These experiments evaluate three points: accuracy, processing time and degree of viewpoint change. 6.1
Experiment of System and Initial Values
The range images and range intensity images are obtained with the ShapeGrabber range sensor (scan head SG-102 on a PLM300 displacement system) [14]. A Nikon D70 provides the color images in RAW format. As the laser color is red, the R-channel of the color image is used for registration. The size of the color images is 3008 × 2000. The initial value of extrinsic parameters is given as follows. R was set to the unit matrix. tx , ty were set as the center of the gravity of the target range. tz was roughly given manually. The initial value of intrinsic parameters au , av , u0 , v0 was obtained using the camera calibration method proposed by Zhang [15]. au , av were set to 8032 and 8013, respectively. u0 , v0 were set to 1648 and 1069, respectively. Initial values of other parameters were given as follows. Skew s was set to 0. k was set to zero, i.e., without any distortion of the camera lens. 6.2
Making a 3D Geometric Model with Range Intensity Images
The object of registration is Fig. 1 (cat). The dimensions of the object are w59mm × h112mm × d32mm. Fig. 5 (a) illustrates a 3D geometric model that is integrated from a number of range images and corrected range intensity images with PolyWorks [16]. The number of images is 15, and that of points is 183016 in the 3D geometric model.
Registration of 3D Geometric Model and Color Images
(a) cat
333
(b) shiny crane Fig. 5. Geometric model with intensity information
6.3
Registration Result
Fig. 6 (a), (b) illustrates the result of the registration. The bright (green) images are the range intensity image, and the dark (red) images are the color image. The result of camera parameters and the distortion of the camera lens are shown below. ⎡
⎤ ⎡ ⎤ 0.883 0.037 −0.468 151.5 R = ⎣ −0.027 0.999 0.033 ⎦ , t = ⎣ −19.6 ⎦ 0.469 −0.018 0.883 −27.3
αu αv s = 6091.7 6097.6 −12.8
u0 v0 = 1362.9 976.7 , k = 0.1892
The processing time was about 30.4 s when we used a PC (CPU: Core i7 2.93GHz) with GPU (GeForce GTX260). The correlation is used to evaluate the convergence of the registration. If the correlation is decreased after several iterations, the stage is updated. In the next stage, the parameters are used, obtaining the highest correlation in the previous stage. The number of iterations is 17 in three stages and the correlation coefficient is 0.8766. Then, Fig. 7 illustrates the result of registration for a variety of viewpoint changes. This result shows that precise registration can be obtained even when the initial parameters are far from the final estimated values. In addition, by obtaining the extrinsic and intrinsic parameters and distortion of the camera lens, precise registration can be obtained. Fig. 6 (c), (d), (e), (f), (g), (h) illustrate the result of the registration of other angles in the same object (Fig. 6 (a), (b)). The correlation coefficients are 0.8286, 0.8467, and 0.5467, respectively. These results show that precise registration can be obtained when a color image of varied angles is used. Fig. 6 (g), (h) show that our method can be applied when the object contains few features. Fig. 8 (a), (b) show another model (shiny crane). The dimensions of the object are w44mm × h175mm × d64mm. Fig. 5 (b) shows the 3D geometric model. The number of range and corrected range intensity images is 87 and that of points is 409057 in 3D geometric model. Fig. 8 (c), (d) illustrate the result of
334
R. Inomata et al.
(a) Initial state
(b) Final result in 2D
(e) Initial state
(f) Final result in 2D
(c) Initial state
(d) Final result in 2D
(g) Initial state
(h) Final result in 2D
Fig. 6. Registration result: cat
Fig. 7. Effect of viewpoint change
the registration. The result of the camera parameters and the distortion of the camera lens are shown below. ⎡
⎤ ⎡ ⎤ 0.990 0.378 −0.137 −31.8 R = ⎣ −0.025 0.995 0.093 ⎦ , t = ⎣ 13.0 ⎦ 0.140 −0.088 0.986 −409.9
αu αv s = 7226.4 7185.8 −47.0
u0 v0 = 1511.3 1423.2 , k = 0.000
Registration of 3D Geometric Model and Color Images
(a) Range intensity image (b) Color image
(c) Initial state
335
(d) Final result in 2D
Fig. 8. Range intensity image and color image and registration result: shiny crane
(a) cat
(b) shiny crane
Fig. 9. Constructed 3D models with color information
The processing time was about 24.2 s. The number of iterations is 11 in three stages and the correlation is 0.7343. Fig. 9 (a) illustrates the final result of texture mapping by using five color images. The images were taken from front (Fig. 6 (a)), left (Fig. 6 (c)), back (Fig. 6 (e)), top (Fig. 6 (g)) and left. Fig. 9 (b) illustrates another result of texture mapping with five images. Fig. 8 (b) is the one of the five.
7
Conclusions
We have proposed a method for 3D-2D registration using SIFT and the range intensity image. In our approach, the extrinsic and intrinsic parameters and distortion of the camera lens can be obtained simultaneously. Then, we have proposed a method to reduce false matches and soft matching. In soft matching, false matches are weakly removed while correct ones are kept. Before using SIFT, to reduce false matches, a range intensity image is combined with a background image of a color image and corrected. Moreover, we have proposed the soft
336
R. Inomata et al.
matching. We achieve precise automatic registration. In future work, a more detailed quantitative evaluation of the method will be performed.
References 1. Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D., Pereira, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., Shade, J., Fulk, D.: The digital Michelangelo project:3D scanning of large statues. In: SIGGRAPH 2000, pp. 131–144 (2000) 2. Iwakiri, Y., Kaneko, T.: PC-based realtime texture painting on real world objects. In: Proc. Eurographics 2001, vol. 20, pp. 105–113 (2001) 3. Lensch, H.P.A., Heidrich, W., Seidel, H.P.: Automated texture registration and stitching for real world models. In: Proc. Pacific Graphics 2000, pp. 317–326 (2000) 4. Lavallee, S., Szeliski, R.: Recovering the position and orientation of free -form objects from image contours using 3D distance maps. IEEE Trans. Pattern Anal. Mach. Intell. 17(4), 378–390 (1995) 5. Neugebauer, P.J., Klein, K.: Texturing 3D models of real world objects from multiple unregistered photographic views. In: Proc. Eurographics 1999, pp. 245–256 (1999) 6. Boughorbel, F., Page, D., Dumont, C., Abidi, M.A.: Registration and integration of multi-sensor data for photo-realistic scene reconstruction. In: Proc. Applied Imagery Pattern Recognition, pp. 74–84 (1999) 7. Umeda, K., Godin, G., Rioux, M.: Registration of range and color images using gradient constrains and range intensity images. In: Proc. of 17th Int. Conf. on Pattern Recognition, vol. 3, pp. 12–15 (2004) 8. Kurazume, R., Nishino, K., Zhang, Z., Ikeuchi, K.: Simultaneous 2D images and 3D geometric model registration for texture mapping utilizing reflectance attribute. In: Proc. Fifth ACCV, pp. 99–106 (2002) 9. Elstrom, M.D., Smith, P.W.: Stereo-based registration of multi-sensor imagery for enhanced visualization of remote environments. In: Proc. of the 1999 Int. Conf. on Robotics Automation, pp. 1948–1953 (1999) 10. Boehm, J., Becker, S.: Automatic Marker-Free Registration of Terrestrial Laser Scans using Reflectance Features. In: 8th Conf. on Optical 3D Measurement Techniques (2007) 11. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 12. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 16(24), 381–395 (1981) 13. Shinozaki, M., Kusanagi, M., Umeda, K., Godin, G., Rioux, M.: Correction of color information of a 3D model using a range intensity image. Comput. Vis. and Image Understanding 113(11), 1170–1179 (2009) 14. ShapeGrabber, http://www.shapegrabber.com 15. Zhang, Z.: A flexible new techniques for camera calibration. IEEE Trans. Pattern Anal. Martch. Intell. 22(11), 1330–1334 (2000) 16. PolyWorks, http://www.innovmetric.com
Denoising Time-Of-Flight Data with Adaptive Total Variation Frank Lenzen1,2 , Henrik Sch¨ afer1,2 , and Christoph Garbe1,2 1
Heidelberg Collaboratory for Image Processing, Heidelberg University 2 Intel Visual Computing Institute, Saarland University
Abstract. For denoising depth maps from time-of-flight (ToF) cameras we propose an adaptive total variation based approach of first and second order. This approach allows us to take into account the geometric properties of the depth data, such as edges and slopes. To steer adaptivity we utilize a special kind of structure tensor based on both the amplitude and phase of the recorded ToF signal. A comparison to state-of-the-art denoising methods shows the advantages of our approach. Keywords: Time-of-flight, Denoising, Adaptive Total Variation, Higher Order Regularization.
1
Introduction
In the last few years time-of-flight (ToF) cameras have become popular in order to retrieve depth information from 3D scenes. The basic idea of ToF cameras is to actively illuminate the scene by a modulated infrared (IR) signal and calculate depth information from the phase shift between the emitted and recorded signal [17,20]. As an example, Fig. 1 shows a depth map (right) taken with a PMD Cam Cube 3, together with the IR amplitude of the recorded signal (middle) and a standard RGB image providing an overview over the scene (left). ToF recordings suffer from some drawbacks.
Fig. 1. Test scene and data set. Left: RGB image of test scene. Middle: amplitude of signal. Right: depth data with colorbar.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 337–346, 2011. c Springer-Verlag Berlin Heidelberg 2011
338
F. Lenzen, H. Sch¨ afer, and C. Garbe
One issue is the presence of significant noise in the depth data, as can be seen in Fig. 1 right. Second, the output of ToF cameras often is of low resolution. Typical resolutions are in the range up to 200 × 200 pixels (PMD Cam Cube 3). Finally, if dynamic scenes are recorded, the output suffers from motion blur. In this paper, we tackle the problem of noise contained in ToF data. The task is to reconstruct a noise-free and accurate depth map. This problem has already been addressed in literature. Denoising approaches for ToF data are e.g. presented in [15,10,7,6,11,1]. These approaches are mainly based on bilateral filtering or wavelet techniques, combined with appropriate noise models. In order to regularize the denoising problem, i.e. prescribing smoothness of the result, these models (except for the clustering approach in [15]) do not assume explicit geometric structures in the depth map such as regular edges or piecewise planar surfaces. In contrast, we present here a variational denoising approach based on adaptive total variation (TV), which allows to take into account such geometric properties and thus to reconstruct edges and slopes with sufficient regularity. Denoising with TV methods has been intensively studied in literature. For the seminal work we refer to [18]; adaptive TV variants are described e.g. in [22,9,14]; higher order TV approaches are considered in [2,21,19,4]. As an alternative, non– local techniques have been proposed, e.g. the nonlocal-means approach in [3] and non-local TV regularization [12]. Contribution: We present a total variation (TV) based denoising approach especially tailored for smoothing depth maps. This variational approach uses penalization terms, which locally adapt to the image content and which are well suited for preserving edges and linear slopes in depth maps. Such slopes for example can be expected in depth maps of piecewise planar objects. Thus, our approach is motivated by geometrical considerations. Organization of the paper: We start with a description of the data acquisition and transformation, cf. Sect. 2. Our denoising approach is presented in Sect. 3. A comparison to state-of-the-art methods in Sect. 4 shows the advantages of our approach. We conclude the paper with Sect. 5.
2
Data Acquisition and Transformation
The data used for the experiments is acquired with a PMD CamCube 3 with 200 × 200 pixels, a continuously modulated light source and suppression of background illumination. The camera records 8 images per shot, correlated with the modulation signal at 4 different phase shifts. From these 8 images, the phase shift ϕ and the amplitude A of the reflected light can be calculated for each pixel, as described in [17]: Ai,j
2 = N
−1 N n (n) −2πi N Ii,j e , n=0
ϕi,j = arg
N −1 n=0
n (n) Ii,j e−2πi N
.
Denoising Time-Of-Flight Data with Adaptive Total Variation
339
While the phase shift ϕ of a pixel corresponds to the distance from the scene to the camera, the desired depth map d should contain the distance of scene and camera plane, measured parallel to the optical axis. Therefore, the phase shift has to be transformed according to di,j := ϕi,j cos αi,j , where α is the angle included by the optical axis and the current pixel. Remark 1. In this paper, we decided to retain the original xy-grid of the camera layout and to apply only a transformation to the depth data (z-coordinate), with the disadvantage that the geometry of the scene is not optimally represented. Future work will focus on an exact handling of the 3D geometry of the data.
3
Denoising Method
In the following, we describe our approach to denoise the depth map d obtained as described in Sect. 2. Our ansatz is based on the variational problem min F (u) := min wi,j (ui,j − di,j )2 + φ(u). (1) n,m u
u∈R
i,j
Here wi,j are local weights on the data term in order to incorporate data feasibility. These weights are determined by considering an appropriate noise model, see Sect. 3.1. The regularization term φ(u) is assumed to be of the form T sup{vi,j Li,j (u) | vi,j ∈ Ci,j }, (2) φ(u) = i,j
where Li,j : Rn,m → Rs are local finite difference operators and Ci,j ⊂ Rs are closed convex constraint sets. The general concept in (2) allows for a locally adaptive L1 -penalization of derivatives of u, provided by Li,j u, where the adaptivity is determined by the size and shape of the constraints set Ci,j . This concept also covers standard TV regularization approaches. 3.1
Weighting of the Data Term
We follow the noise model presented in [8], according to which the noise εi,j at each pixel (i, j) is independent Gaussian distributed. The variance σi,j depends 2 on the amplitude Ai,j of the recorded IR signal, i.e. σi,j = σ02 /2A2i,j for some constant factor σ0 > 0. The recorded depth map then is given as di,j = ui,j + εi,j with noise-free data u. We apply a maximum-a-posteriori (MAP) estimator: max p(u|d) = max p(d|u)p(u),
u∈Rn×m
u∈Rn×m
(3)
where p(u|d) is the conditional probability of u given d, p(d|u) is the conditional probability of d given u and p(u) is the (unconditioned) probability of u. p(u) is commonly assumed to be known a priorily, thus p(u) is also referred to as prior on u. From the Gaussian noise model we find that 2
A 2 1 − σi,j 2 |ui,j −di,j | e 0 , p(d|u) = c1 i,j
(4)
340
F. Lenzen, H. Sch¨ afer, and C. Garbe
with some constant c1 > 0. For the prior on u, we consider at this point the general form p(u) = c12 e−φ(u) for some suitable φ and a constant c2 > 0. With these settings and using the fact that problem (3) is equivalent to solving minu − log p(u|d), we end up with min u
A2i,j i,j
σ02
|ui,j − di,j |2 + φ(u),
(5)
where additive constant terms have been omitted. Comparing (5) with (1), we find that the weights wi,j in (1) have to be chosen as wi,j := A2i,j /σ02 . 3.2
Edge Detection
To supply edge information for the proposed method, we use the enhanced structure tensor proposed in [13]. While calculating the derivatives for the structure tensor, the data is upsampled by a factor 2, to prevent information loss. After obtaining the structure tensor, smoothed with an ordinary Gaussian, it is recalculated but with an hourglass-shaped Gaussian filter, aligned to the previously detected edges. This prevents too much smoothing in cross-edge direction, to distinguish close-by parallel edges. We use all available data, i.e. structure tensors for both A and d. Then the sum of both is evaluated to acquire the eigenvectors. This way, the normal orientation of the edges (eigenvectors vi,j ) and a value for their distinctness (differences of eigenvalues si,j ) are obtained. To reduce the noise in the edge image, a smoothing function is applied. Weighted with the 1 distance, the surrounding 24 pixels are checked for aligned edges by utilizing the inner product of the two relevant vectors. A Gaussian-like function is used to distinguish better between (almost) aligned and unaligned
1 1 ∗ T · vk,l / 2σ 2 , edges: si,j (A, d) = norm (k,l)∈N24 (i,j)−(k,l) exp − 1 − vi,j where norm is a normalizing constant. Finally, to have matching edge and depth data, the edge images are downsampled to the original size. 3.3
Regularizer φ(u)
Our regularization is based on discrete derivatives of first and second order, using finite differences on the pixel grid. Let Dx u, Dy u denote right-sided finite differences for the first, and Dxx u, Dyy u, Dxy u central finite differences for the second order, respectively. At the boundary of the pixel grid, u is constantly extended, i.e. we assume (discrete) homogeneous Neumann boundary conditions. We are aiming at an anisotropic total variation (TV) penalization of u (cf. [9]), coupled with an isotropic L1 -penalization of the Hessian of u (cf. [19]),
T
Dxx ui,j Dx ui,j Dx ui,j
, (6) D u yy i,j ci,j Gi,j Dy ui,j + (1 − ci,j )γi,j φ(u) := Dy ui,j
Dxy ui,j i,j
F
Denoising Time-Of-Flight Data with Adaptive Total Variation
341
where – ci,j ∈ [0, 1] provide a local weighting of the first and second order terms, T 2 T + βi,j (Id −vi,j vi,j ) for a given unit vector vi,j ∈ R2 – matrix Gi,j = α2i,j vi,j vi,j defines the anisotropy for the first order, with αi,j > 0, βi,j > 0 being the local regularization parameters parallel and orthogonal to vi,j , respectively, – the term
Dxx u
Dyy u := (Dxx u)2 + (Dyy u)2 + 2(Dxy u)2 ,
Dxy u F
is the Frobenius norm of the discrete Hessian of u, and – γi,j > 0 defines the local regularization parameter for the second order term.
Fig. 2. Top left: standard TV. Top right: standard TV with weights. Bottom left: anisotropic TV. Bottom right: anisotropic TV with second order terms. By considering a weighted data term, we are able to cope with the local varying noise variance. Anisotropic TV ensures a better preservation of edges, while the higher order penalization term regularizes the slopes.
Remark 2. We note that the regularization term φ(u) in (6) is of the form (2), as can be seen by defining Li,j (u) := (Dx ui,j , Dy ui,j , Dxx ui,j , Dyy ui,j , Dxy ui,j ) and the constraint set Ci,j = C(αi,j , βi,j , γ, ci,j , vi,j ) ⊂ R5 by C(α, β, γ, c, v) := p ∈ R5 :
3 2 1 1 T p1 2
p1 T 2 2 1 p v ( ) + (Id −vv ) ( ) ≤ c ,
pp45 ≤ (1 − c)2 . p p 2 2 α2 β2 γ2 F
342
F. Lenzen, H. Sch¨ afer, and C. Garbe
Fig. 3. Left: investigated close-ups in the first data set. Middle: second test scene. Right: cross section (black line) through second data set.
Remark 3. For denoising ToF data, we propose to use vi,j = vi,j (u) and s∗i,j (A, u) defined as in Sect. 3.2. Moreover, we use ci,j := g(s∗i,j (u)) for some continuous mapping g : [0, 1] → [0, 1] and fixed αi,j , βi,j , γi,j > 0. In case that s∗i,j = 0, i.e. no edge is present at pixel (i, j), and thus vi,j is not containing useful information, we additionally assume αi,j = βi,j , leading to an isotropic TV penalization. Note that we choose vi,j (u) and s∗i,j (A, u) depending on u instead of the noisy data d. In particular, the adaptivity of the regularizer is determined by the unknown solution u. In this case existence theory becomes more involved. Existence of a minimizer of F (u) can be shown for wi,j > 0, see [14, Prop. 1].
4
Experiments
Results of Proposed Method. We demonstrate that the combination of noise model, anisotropic TV and higher order term is necessary in order to obtain an adequate reconstruction of the noise-free depth map. To this end, we consider variants of TV denoising, which do not address all of these issues simultaneously. First, we utilize the standard ROF model in [18], see Fig. 2, top left. Since ROF assumes a constant noise variance, this method can not cope with the
Input
std. ROF
ROF + weights
anisotropic
anisotropic + higher order
Fig. 4. Top: Edge region in the ToF image (contrast enhanced). Bottom: Slope region in the ToF image. Anisotropic TV combined with higher order terms leads to a better preservation of edges and contrast, while reconstructing slopes more regular.
Denoising Time-Of-Flight Data with Adaptive Total Variation
343
Fig. 5. Depth maps filtered with cross-bilateral filter (top left), IC (top right), NLmeans (bottom left) and proposed method (bottom right). The proposed method removes noise even in regions of high noise variance, while preserving edges and slopes.
spatially dependent noise variance. Accounting for the right noise model (see Fig. 2, top right), the method is able to remove the noise completely. However, both isotropic TV variants suffer from a loss of contrast, see for example Fig. 4, top row, first and second image, where the right part of the polystyrene structure is reconstructed with the wrong depth. The effect of loosing contrast is well known in literature. Anisotropic TV, however, is able to preserve the contrast (Fig. 4, top row) and, with additional higher order term, is able to regularly reconstruct the slopes of the surfaces, see Fig. 4, bottom row. Comparison with State-of-the-Art methods. We compare the proposed approach with several state-of-the-art methods. First, we consider a cross-bilateral filter comparable to [5], using both IR phase and amplitude. The cross-bilateral filter extends the standard method by taking the intensity image into account and uses both images to calculate the local filter kernels wi,j : 2
1 1 − (i,j)−(k,l) 2 2σs wi,j (k, l) = e · norm σs
1 − e 2σd
|di,j −dk,l |2 2σ2 d
1 − + e 2σA
|Ai,j −Ak,l |2 2σ2 A
.
Since a pixel is always similar to itself, we set wi,j (i, j) = 0 before normalization, to smooth single outliers (cf. [12]). As suggested in [24], we use three iterations and as in [16] decrease σd and σA by the square root of the number of iterations.
344
F. Lenzen, H. Sch¨ afer, and C. Garbe
Fig. 6. Cross section through sample piece and comparison of cross-bilateral (top left), IC (top right), NL-means (bottom left) and proposed method (bottom right). Each plot shows the input data (thin gray), high precision data (thick gray) obtained with long exposure time and the method under consideration (black). The proposed method is able to reconstruct edges and slopes with high quality. In addition, a close-up of the left part of the cross section is provided.
In addition to bilateral filtering, we apply infimal convolution (IC) [4,21] (in combination with a weighted data term) and the non-local (NL) means algorithm [3] (using the publicly available implementation by Peyre1 ). In Fig. 5 we present the results of these methods applied to the depth map presented in Fig. 1. Because of the intensity image, the cross-bilateral filter has quite sharp edges, but these edges often have a halo. In areas with very low intensity and high noise, the cross-bilateral filter smooths better than its standard version (defined as in [1,23]) would, due to the homogeneous dark areas in the intensity image. However, cross-bilateral filtering does not compensate the false depth data completely. Infimal convolution is able to almost reduce the noise even in regions with strong variance. The NL-means algorithm faces problems in regions with strong noise. The proposed method is able to remove the noise completely, while keeping the edges sharp. 1
http://www.mathworks.com/matlabcentral/fileexchange/13619
Denoising Time-Of-Flight Data with Adaptive Total Variation
345
Table 1. 2 -difference to high precision data obtained with long exposure time. The proposed method shows the smallest deviation from this data. cross-bilateral 0.036866
NL-means infimal convolution proposed 0.036627 0.039305 0.033600
We investigate the reconstruction of edges and slopes on a second data set, see Fig. 3, where we focus on the cross section indicated by the black line. The results of the above methods along this cross section are depicted in Fig. 6. We compare each result with a depth map taken with a long exposure time (thick gray line) and thus being more accurate than the original input data (thin gray line). The proposed approach is able to reconstruct the slopes better than NL-means. Moreover, the edges are reconstructed sharper as by the infimal convolution or cross-bilateral approach, see e.g. the left part of the cross cut. The quantitative comparison of these reconstructions, see Table 1, shows that the reconstruction by the proposed method is most accurate.
5
Conclusion and Outlook
We have proposed an adaptive TV approach to denoise ToF data, where adaptivity is determined based on a extended structure tensor using the full ToF signal (amplitude and phase). Our ansatz allows to regularize the depth map in view of its geometric properties, e.g. edges and slopes. A comparison to state-of-the-art methods shows that our approach better reconstructs the depth. In future work we will further improve the regularization in view of the true 3D geometry of the data and we will perform a detailed evaluation using ground truth. Acknowledgements. The work presented in this article has been co-financed by the Intel Visual Computing Institute. The content is under sole responsibility of the authors.
References 1. Aurich, V., Weule, J.: Non-linear Gaussian filters performing edge preserving diffusion. In: Proceed. 17. DAGM-Symposium (1995) 2. Bredies, K., Kunisch, K., Pock, T.: Total Generalized Variation. SIAM J. Imaging Sciences 3(3), 492–526 (2010) 3. Buades, A., Coll, B., Morel, J.M.: A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4(2), 490–530 (2005) 4. Chambolle, A., Lions, P.-L.: Image recovery via total variation minimization and related problems. Numerische Mathematik 76, 167–188 (1997) 5. Chan, D., Buisman, H., Theobalt, C., Thrun, S.: A noise-aware filter for real-time depth upsampling. In: Proc. of ECCV Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications, pp. 1–12 (2008)
346
F. Lenzen, H. Sch¨ afer, and C. Garbe
6. Edeler, T., Ohliger, K., Hussmann, S., Mertins, A.: Time-of-flight depth image denoising using prior noise information. In: Proceedings ICSP, pp. 119–122 (2010) 7. Frank, M., Plaue, M., Hamprecht, F.A.: Denoising of continuous-wave time-of-flight depth images using confidence measures. Optical Engineering 48 (2009) 8. Frank, M., Plaue, M., Rapp, K., K¨ othe, U., J¨ ahne, Hamprecht, F.A.: Theoretical and experimental error analysis of continuous-wave time-of-flight range cameras. Optical Engineering 48(1), 13602 (2009) 9. Grasmair, M., Lenzen, F.: Anisotropic Total Variation Filtering. Appl. Math. Optim. 62(3), 323–339 (2010) 10. Sch¨ oner, H., Moser, B., Dorrington, A.A., Payne, A.D., Cree, M.J., Heise, B., Bauer, F.: A clustering based denoising technique for range images of time of flight cameras. In: CIMCA/IAWTIC/ISE 2008, pp. 999–1004 (2008) 11. Jovanov, L., Pizurica, A., Philips, W.: Fuzzy logic-based approach to wavelet denoising of 3D images produced by time-of-flight cameras. Opt. Express 18, 22651– 22676 (2010) 12. Kindermann, S., Osher, S., Jones, P.W.: Deblurring and denoising of images by nonlocal functionals. Multiscale Model. Simul. 4(4), 1091–1115 (2005) (electronic) 13. K¨ othe, U.: Edge and junction detection with an improved structure tensor. In: Michaelis, B., Krell, G. (eds.) DAGM 2003. LNCS, vol. 2781, pp. 25–32. Springer, Heidelberg (2003) 14. Lenzen, F., Becker, F., Lellmann, J., Petra, S., Schn¨ orr, C.: Variational image denoising with constraint sets. In: Proceedings SSVM (in press, 2011) 15. Moser, B., Bauer, F., Elbau, P., Heise, B., Sch¨ oner, H.: Denoising techniques for raw 3D data of ToF cameras based on clustering and wavelets. In: Proc. SPIE, vol. 6805 (2008) 16. Paris, S., Durand, F.: A fast approximation of the bilateral filter using a signal processing approach. Technical report, MIT - CSAIL (2006) 17. Plaue, M.: Analysis of the PMD imaging system. Technical report, Interdisciplinary Center for Scientific Computing, University of Heidelberg (2006) 18. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D 60(1-4), 259–268 (1992) 19. Scherzer, O.: Denoising with higher order derivatives of bounded variation and an application to parameter estimation. Computing 60, 1–27 (1998) 20. Schmidt, M., J¨ ahne, B.: A physical model of time-of-flight 3D imaging systems, including suppression of ambient light. In: Kolb, A., Koch, R. (eds.) Dyn3D 2009. LNCS, vol. 5742, pp. 1–15. Springer, Heidelberg (2009) 21. Setzer, S., Steidl, G., Teuber, T.: Infimal convolution regularizations with discrete l1-type functionals. Comm. Math. Sci. 9, 797–872 (2011) 22. Steidl, G., Teuber, T.: Anisotropic smoothing using double orientations. In: Proceedings SSVM 2009, pp. 477–489 (2009) 23. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proceedings of the Sixth International Conference on Computer Vision (ICCV 1998), p. 839 (1998) 24. Weiss, B.: Fast median and bilateral filtering. ACM Trans. Graph. 25(3), 519–526 (2006); Proceedings of ACM SIGGRAPH 2006
Efficient City-Sized 3D Reconstruction from Ultra-High Resolution Aerial and Ground Video Imagery Alexandru N. Vasile1, Luke J. Skelly1, Karl Ni1, Richard Heinrichs1, and Octavia Camps2 1
2
Massachusetts Institute of Technology - Lincoln Laboratory, Lexington, MA, USA Dept. of Electrical and Computer Engineering, Northeastern University, Boston, MA, USA {alexv,skelly,karl.ni}@ll.mit.edu, [email protected]
Abstract. This paper introduces an approach for geo-registered, dense 3D reconstruction of city-sized scenes using a combination of ultra-high resolution aerial and ground video imagery. While 3D reconstructions from ground imagery provide high-detail street-level views of a city, they do not completely cover the entire city scene and might have distortions due to GPS drift. Such a reconstruction can be complemented by aerial imagery to capture missing scene surfaces as well as improve geo-registration. We present a computationally efficient method for 3D reconstruction of city-sized scenes using both aerial and ground video imagery to obtain a more complete and self-consistent georegistered 3D city model. The reconstruction results of a 1x1km city area, covered with a 66 Mega-pixel airborne system along with a 60 Mega-pixel ground camera system, are presented and validated to geo-register to within 3m to prior airborne-collected LiDAR data.
1 Introduction Automatic 3D reconstruction and geo-location of buildings and landscapes from images is an active research area. Recent work by [1][2][3][4] has shown the feasibility of 3D reconstruction using tens of thousands of ground-level images from both unstructured photo collections, such as Flickr, as well as more structured video collections captured from a moving vehicle [5][6][7][8], with some algorithms incorporating GPS data for geo-location when available [6][9][10][11]. While 3D reconstructions from ground imagery provide high-detail street-level views of a city, the resulting reconstructions tend to be limited to vertical structures, such as building facades, missing a lot of the horizontal structures, such as roofs, or flat landscape areas, thus leading to an incomplete model of the city scene [1][6]. Furthermore, when using GPS data for geo-location, the 3D model’s geo-registration accuracy and precision might suffer since street-level GPS solutions are poor, particularly amidst tall buildings on narrow streets due to multipath reflection errors [6][12]. For video collects with GPS captured from moving vehicles, the resulting 3D model is typically composed of a single connected component that might have internal distortions due to GPS drift or discontinuities [6][10]. For unstructured ground photo collections, the issue of geo-registration is further exacerbated as only a subset of images might have G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 347–358, 2010. © Springer-Verlag Berlin Heidelberg 2010
348
A.N. Vasile et al.
GPS metadata, with typical city-sized reconstructions composed of multiple unconnected 3D models representing popular touristic sites or landmarks [1][2][4], where each connected component might have few or no GPS tie points. Recent work by [4] attempts to resolve this problem, however they require additional GIS building data as a unifying world model to connect the various disconnected 3D scene clusters. While ground level 3D reconstructions do not capture a complete model of a city’s surface scene, they can be complemented by adding aerial imagery, which has wider area coverage along with inherently more accurate aerial GPS data. A reconstruction using a combination of aerial and ground imagery might lead to a 3D city model that has both high level of detail as well as capture a wide area. Furthermore, by using aerial imagery to create a reference, geo-registered and self-consistent 3D world model, we might be able improve both absolute geo-registration accuracy as well as precision of the previously unconnected 3D ground reconstructions. In this paper, we develop an efficient method that utilizes both ground video imagery as well as ultra-high resolution aerial video imagery to reconstruct a more complete 3D model of a large (1x1km) city-sized scene. The method starts out with two similar structure-from-motion (SFM) algorithms to process the aerial and ground imagery separately. We developed a SFM processing chain similar to [1][2], with several improvements to take advantage of inherent video constraints as well as GPS information to reduce computational complexity. The two separate 3D reconstructions are then merged using the aerial-derived 3D model as the unifying reference frame to correct for any remaining GPS errors in the ground-derived 3D scene. To quantify the improvements in georegistration accuracy and precision, we compare the aerial-derived 3D model, the ground-derived 3D model, as well as the merged 3D reconstruction to a previously collected high-resolution 3D LiDAR map, which is considered to be truth data. To the best of our knowledge, no one has published results of city-sized reconstruction using both aerial and ground imagery to obtain a more complete 3D model, nor has the geolocation accuracy and precision of 3D reconstruction been quantified in a systematic manner over large scale areas using 3D LiDAR data as truth. We utilize an airborne 66Mpixel multiple-camera sensor operating at 2Hz to capture videos of large scale city-sized areas (2x2km), with an example image shown in Figure 1-A/B. In addition, we captured ground based video data at 1Hz with 5 12MPix Nikon D5000 cameras using a moving vehicle as shown in Figure 1-C/D. The algorithm was tested using 250 66-MPix aerial video frames along with 34400 ground images to create a dense 3D reconstruction of a 1x1km area of Lubbock, Texas. A LiDAR map of the city at 50cm grid sampling with 0.5 meter geo-registration accuracy is used to determine the geo-registration accuracy and precision of the various 3D reconstructed data sets. The main contributions of our paper, therefore, are: 1. An efficient SFM method that takes into account video constraints as well as GPS information, in combination with a method to merge the 3D aerial and ground reconstruction for a more complete 3D city model, with improvements in georegistration accuracy and precision for the ground collected data. 2. The first 3D reconstruction using both aerial and ground imagery on a large citysized scale (1x1km). 3. A detailed study showing geo-location improvements of the merged reconstruction, validated in a systematic manner over a large scale area using 3D LiDAR data as truth.
Efficient City-Sized 3D Reconstruction
A
B
349
C
D Fig. 1. Data used by the 3D reconstruction system. A) Example of a 66 MPix image captured at 2HZ by a multi-camera airborne sensor, covering a ground area of about 2x2km. B) Zoomed in view of the same aerial image. C) A 5 12MPix camera ground system, covering an 180 degree field of view, collected at 1Hz. D) Example of resulting ground imagery.
The rest of the paper is organized as follows: Section 2 discusses in detail the developed algorithm along with implementation of the system. Section 3 reports the 3D reconstruction results on a 1x1 km area using aerial data, followed by the 3D reconstruction results using only the ground imagery. Qualitative as well as quantitative geo-registration results of the 3D reconstructions are reported for the aerial imagery, ground collected imagery, merged ground imagery, as well as for the combined aerial-ground reconstruction. Section 4 concludes with a discussion of the lessons learned and directions for future work.
2 Our Approach The developed algorithm can be divided into two stages: (1) two separate 3D reconstruction pipelines for ground and aerial imagery, described in Section 2.1 and (2) a 3D merge method to fuse the two 3D reconstructions into a complete city model, described in Section 2.2. 2.1 3D Reconstruction from Video Imagery The 3D reconstruction pipeline is similar to [1][2], with several improvements that take into account temporal video constraints as well as availability of GPS information. The processing pipeline, shown in Figure 2, can be broken up into the following stages: preprocessing, feature extraction, feature matching, geometric estimation, sparse 3D reconstruction, followed by dense 3D reconstruction and geo-registration.
350
A.N. Vasile et al.
Fig. 2. Structure from motion 3D reconstruction pipeline
For the pre-processing step, we first record estimates of the camera intrinsic parameters, such as focal length, and any information related to camera extrinsic parameters, such as GPS information. For the aerial imagery, the camera intrinsics are determined using prior calibration, while for the ground imagery we use jpeg-header meta data to determine an initial estimate of the focal length, as well as record the GPS information on a per video-frame basis. In the feature extraction stage, we find points of interest for each image using Lowe’s SIFT descriptors [13] and store those SIFT features for each image. Next, in the feature matching stage, the SIFT descriptors of the points of interest are first matched using Lowe’s ratio test [13], with an additional uniqueness test to filter out many-to-one correspondences. The matches are verified using epipolar constraints in the form of RANSAC-based estimation of the essential matrix [14][15]. Typically, the image matching stage is the most computationally expensive stage of the process. For unstructured collections of photos, where any image might match any other of the remaining images, the process typically takes O(n2) computational time, where n is the number of images in the photo collection. In our case, where we have a video sequence, we can reduce computational complexity of the matching step by taking into account that time-neighboring video frames capture similar perspectives of the 3D scene, thus there is a high likelihood that consecutive video frames will have many features matches to a current video frame, while video frames further separated in time might have fewer matches. We employ a simple data-driven model that assumes for each frame i, neighboring frames i±1will have a maximal peak amount of matching features. We continue to match neighboring frames further out in time as long as the number of matches does not fall below 25% of the maximal peak number, or reaches a predetermined hard threshold, Tf, of frames. Based on offline tests of maximal correspondence track lengths, Tf is set to 40 consecutive frames for the aerial imagery, while for the ground imagery Tf is set to 10. To account for situations where the same scene area was revisited at further separate times, the above data driven matching scheme is also performed between the current frame i and key frames i+K*m, where K is the skip frame interval (set to Tf/2) and m ∈ Z. This image matching method allows us to reduce the computational complexity of image matching from O(n2) for unstructured photo collections closer to order O(n), which leads to significant computational savings. Once pair-wise matching is completed, the final step of the matching process is to combine all the pair-wise matching information to
Efficient City-Sized 3D Reconstruction
351
generate consistent tracks across multiple images [1][2][16], where each track represents a single 3D location. Once tracks are generated, the next step is to recover the pose for every camera and 3D position for every track. Similar to [1][2], our SFM method is incremental, starting with a two-view reconstruction, adding another view and triangulating more points, while doing several rounds of non-linear least squares optimization, known as bundle adjustment, in order to minimize the re-projection error. After each bundle adjustment stage, some 3D points that have re-projection error above a certain pixel threshold are removed and a final bundle adjustment stage is run. The above process repeats again with each additional image view. The final result of this step is a set of adjusted camera intrinsic and extrinsic matrices along with a sparse 3D reconstruction of the 3D scene. Using GPS data available for each image frame, the extrinsic matrices along with the 3D reconstruction are scale corrected by comparing the variance of the bundle-adjusted camera positions to the variance of the GPS-based camera positions transformed into a Cartesian metric coordinate system. Geo-registration in Earth-Fixed Earth-Centered (ECEF) world coordinates is performed by automatically finding a 6 degrees-of-freedom(rotation/translation) rigid transformation that minimizes the least squares errors between the metric-scaled camera positions and the GPS-based ECEF camera positions. The geo-registered sparse 3D reconstruction is upgraded to a dense 3D reconstruction using Furukawa et al.’s Patch-based MultiView Stereo (PMVS) algorithm [17][18]. The results of the above process are two dense, geo-registered 3D models, one derived from aerial imagery, while another is derived from ground imagery. The georegistration accuracy and precision of each model is quantified by automatically aligning each 3D model to a previously collected, geo-located, 3D LiDAR data. Due to the possibility of poor GPS solutions obtained from ground imagery, some parts of the ground 3D reconstruction might include drift errors. In the next section, we discuss a method to correct distortions in the ground-based 3D reconstruction by using the aerial data as a reference frame. 2.2 3D Reconstruction Merge The aerial-based reconstruction was used as a reference frame to merge the groundderived reconstruction in order to create a more complete and self-consistent 3D model of the city. The first step was to roughly align the two 3D reconstructions to remove the overall bias between the two models. This process was done automatically using a modified version of Iterative Closest Point [19] (ICP) algorithm with six degrees of freedom (rotation + translation) as detailed in [20]. The next step was to correct for intra-model distortions. To achieve this objective, a method was implemented to determine localized distortion by separating the 3D ground model into individual range maps derived from consecutive video frame pairs. The range maps are derived from the resulting PMVS dense correspondence metadata output alongside the dense 3D ground reconstruction. Each range map is automatically aligned using ICP to the aerial 3D model to obtain corrected camera pose locations. As each camera is used in defining two range maps, two separate pose corrections are obtained, which are averaged to obtain a single pose estimate for each camera (an improvement can be made in future implementations by using the post-ICP residual
352
A.N. Vasile et al.
error to determine a weighted average of the two pose estimates). After applying the pose correction to all the ground collected images in the region of interest, PMVS is re-run to obtain an improved dense reconstruction. For efficiency, initial sparse correspondence data from the original PMVS run is re-used in this second PMVS run.
Fig. 3. Merge Method. The ground reconstruction is split into range maps derived from image pairs. Each range map is aligned to the aerial reconstruction using 6-DOF ICP to obtain a corrected pose estimate. The pose estimates are then used to correct the dense ground reconstruction by re-running the PMVS algorithm to obtain a refined dense ground reconstruction.
3 Results Aerial imagery was collected over Lubbock, Texas using a 66Mpix airborne multicamera sensor, flying at an altitude of 800 meters in a circular race-track pattern, with a collection line-of-sight 30 degrees off nadir. Using a single 360 degree race-track, consisting of 250 images, a 3D reconstruction was computed using the algorithm detailed in Section 2.1. It is worth noting that the 3D reconstruction algorithm does not require a 360 degree view of an area to perform a good reconstruction; the algorithm has been successfully tested for other more general flight paths such as straight fly-bys over an area. Ground video imagery was collected using a pickup truck mounted with a 60 MPix multi-camera sensor on top of the cabin roof. Of the 125000 images collected, 34400 images overlapped in coverage with the aerial platform and were used to perform a 3D reconstruction in the region of interest. As the aerial data is collected from above at 30 degrees off nadir, one would expect that the aerial data captures well the sides of many but not all the buildings. On the other hand, the ground photos capture primarily the building facades. This might lead to some concern as there might be no overlap in certain areas when merging the ground and the aerial model. In our experience, the situation did not arise in our data set as most of the buildings are not very tall and are fairly well spaced apart. Such a situation might be of concern for a Manhattan-like city collect and could be resolved by imposing minimum thresholds on scene overlap and residual error. 3.1 Reconstruction from Aerial Video Imagery The results of the 3D aerial reconstruction are qualitatively shown in Figure 4. The figure shows the Texas-Tech campus, along with its distinctive football stadium. Multiple zoomed-in views of the stadium and other campus buildings are rendered using Meshlab [21] to capture the quality of the 3D reconstruction results. The dense reconstruction has approximately 23 million points, with a 20cm pixel ground
Efficient City-Sized 3D Reconstruction
353
Fig. 4. Aerial 3D reconstruction of 1x1km area of Lubbock, Texas using a 250 frame 66MPix video sequence. 3D rendering of the data was achieved using MeshLab.
sampling distance, and range resolution of approximately 1 meter (can resolve in height AC units on rooftops). One can visually determine that we were able to find a good 3D metric reconstruction (preserves 90O angles), to within a similarity transformation. 3.2 Geo-location of Aerial Reconstruction The 3D model was geo-registered as described in Section 2.1, using the GPS metadata available for each video frame. The quality of the geo-registration was also verified using 3D LiDAR truth data. Figure 5-A shows a geo-located 3D LiDAR truth data set collected at an earlier date from a separate airborne sensor. The LiDAR data is geolocated to within 0.5 meters and sampled on a rectangular grid with 50 cm gridspacing. The range/height resolution of the data is about 30 cm. Figure 5-B shows the 3D aerial reconstruction in the same coordinate space as the 3D LiDAR data, in order for the viewer to get a rough comparison of the coverage area and notice that the two
354
A.N. Vasile et al.
A
B
C
Fig. 5. Qualitative geo-registration results of aerial reconstruction. A) 3D LiDAR map of Lubbock, Texas displayed using height color-coding, where blue represents low height, yellow/red represent increasing height. B) Geo-registered 3D aerial reconstruction. C) 3D LiDAR truth data superimposed onto the 3D aerial reconstruction. Notice that there is no doubling of buildings or sharp discontinuities between the two data sets, indicating a good geo-registration.
A
B
Fig. 6. Quantitative geo-registration results of aerial reconstruction. A) Histogram of initial geo-registration error with bias of 2.52m indicating good geo-registration accuracy. The σ=0.54 m indicates low internal geometric distortion (high precision). B) Histogram of geo-registration error after applying ICP. The bias has been reduced by 2.5 times to 0.84m.
data sets appear well aligned. Figure 5-C shows the two data sets superimposed to qualitatively demonstrate that we obtained a good geo-registration. A quantitative study is performed to determine geo-location accuracy and precision of the 3D aerial reconstruction by automatically aligning the aerial reconstruction to the 3D LiDAR truth dataset using an ICP algorithm with six degrees of freedom (rotation/translation) [21]. The results are shown in Figure 6-A: the bias of the georegistration is 2.52 meters with σ =0.54 meters. The results validate that we have good geo-location accuracy, to within about 3 meters, while the low value of σ=0.54m indicates low geometric distortion, thus high geo-location precision. It is noteworthy to highlight that the above results are the geo-registration errors prior to ICP alignment; in the above study, ICP is only used to find 3D correspondences and verify initial goodness of geo-registration. Thus, using just GPS metadata collected with the aerial video imagery, we obtained good geo-location accuracy to within 3 meters and high geolocation precision to within 0.5 meters. Figure 6-B show the geo-registration statistics after ICP alignment, with the bias reduced by about 2.5 times to 0.84m. The results
Efficient City-Sized 3D Reconstruction
355
of Figure 6-B indicate the potential for LiDAR data, when available, to improve geo-location. 3.3 Reconstruction and Geo-location of Ground Video Imagery Ground video data was collected in Lubbock, Texas covering the same area as the airborne sensor. The data was collected using 5, GPS-enabled, 12Mpix Canon D5000 cameras with 1Hz frame update. Figure 7-A shows the captured GPS locations superimposed on a satellite image visualization using Google Earth. Figure 7-B/C shows the resulting 3D reconstruction. To appreciate the 3D reconstruction quality, Figure 7D/E capture zoomed-in views of the reconstruction near a stadium with colored texture derived from the underlying RGB video frames.
A
B
D
C
E
Fig. 7. Qualitative results of ground reconstruction. A) Ground recorded GPS points overlaid onto a satellite image. B,C) Ground reconstruction captured from two different views in heightabove-ground color-coding (lowest height corresponds to purple, blue/green/red correspond to increasingly higher altitudes). D,E) Zoomed-in views of 3D reconstruction near a football stadium with RGB color-map information obtained from the reconstructed images.
Using the same procedure as for the aerial reconstruction, we geo-registered the 3D ground model by comparing the GPS data captured for each frame to the bundleadjusted camera locations. Figure 8-A/B qualitatively capture the initial geo-location error (prior to ICP alignment): comparison of the 3D LiDAR data in Figure 8-A to the superimposed 3D ground reconstruction onto the 3D LiDAR data in Figure 8-B reveals large geo-location errors, with doubling of buildings surfaces. The statistics of the geo-registration error prior to ICP alignment are shown in Figure 8-C. From Figure 8-C, we can determine that we have poor geo-location accuracy with a georegistration bias of 9.63 meters, as well a poor geo-location precision with a σ=2.01
356
A.N. Vasile et al.
A
C
B
D
Fig. 8. Geo-location of ground reconstruction. A) Qualitative view of the 3D LiDAR truth data, B) Same 3D LiDAR data superimposed with ground reconstruction showing doubling of buildings due to large geo-location bias. C) Histogram of geo-registration errors: the bias is 9.63m with a σ=2.01m. D) Example of GPS errors encountered amidst taller buildings, which lead to poor geo-location of ground data.
meters indicating that significant distortions exist within the model. Thus, due to poor GPS ground solutions, the geo-registration of the ground reconstruction is significantly worse compared to the 3D aerial reconstruction. The reason for these higher georegistration errors is the presence of GPS outlier data and bias that is correlated in time, especially on streets with tall buildings, as shown in Figure 8-D. 3.4 Geo-Registration of Aerial-Ground Reconstruction after 3D Merge In order to obtain a better geo-location of the ground 3D data we apply the 3D Merge algorithm described in Section 2.2. The ground 3D data, along with the PMVS dense point correspondence information, is used to derive range maps from consecutive video frame pairs. New pose estimates are found for each camera and used to perform a structure-only bundle adjustment operation on the dense ground reconstruction. The result is a merged 3D aerial-ground reconstruction that is now self-consistent. Figure 9-A/B quantitatively capture the before and after merge intra-registration errors between the ground and aerial 3D data. The overall bias is reduced by an order of magnitude from 8.86m to 0.83m, while the intra-data distortion is reduced from σ=2.5m to σ=0.59m. Thus, the merge method was able to successfully remove the bias term and also reduce the intra-registration distortion by 4x in order to produce a more self consistent complete 3D city model.
Efficient City-Sized 3D Reconstruction
357
Fig. 9. Improvement of geo-registration after applying 3-D Merge algorithm. A) Intraregistration errors between the 3D ground and aerial data before 3-D Merge, with bias of 8.86m and σ=2.52m; B). Remaining intra-registration errors after 3-D Merge procedure with remaining bias of 0.83m and σ=0.59m. The merge method removed most of the bias term, and reduced distortions within the ground data set by 4x from σ=2.52m to a σ=0.59m.
The combined aerial-ground 3D city model was verified against 3D LiDAR data to determine the final geo-registration error. Results indicate a geo-location accuracy (bias) of 2.82m, with a geo-location precision of σ=0.74m. As expected, the final geo-location accuracy of 2.82m is limited by how well the aerial data was initially geo-located, which in our case was with an accuracy/bias of 2.52m. The overall geolocation precision, standing at σ=0.74m, is lower bound by both the geo-location precision of the aerial reconstruction (σ=0.54m) as well as the precision of the ground reconstruction after the 3D merge (σ=0.59m). Thus, the combined aerial-ground data set has geo-registration accuracy to within approximately 3 meters with geo-location precision on the order of 1m.
4 Conclusion and Future Work In this paper, we developed an algorithm to create a well geo-located, self consistent 3D city model using high-resolution ground and aerial imagery. We implemented a video 3D reconstruction algorithm that is applicable to both aerial and ground collected video data, obtained two separate 3D reconstructions and verified each reconstruction against truth 3D LiDAR data. The aerial reconstruction, due to more accurate GPS data, had good geo-location accuracy, with a bias of 2.52m compared to the ground 3D reconstruction which had a bias of 9.63m. Furthermore, the geo-location precision of the aerial imagery is significantly better, with a σ =0.54m indicating low internal geometric distortion, while the ground imagery has a σ=2.01m, indicating significant geometric distortion. By using the aerial 3D data as a reference frame and merging the 3D ground data, we obtained an improved, self-consistent 3D city model, with the intra-data distortion lowered from σ=2.5m to σ=0.59m, while removing most of the initial bias of 8.83 m between the two data sets. The combined city model has geo-registration accuracy to within approximately 3 meters with geo-location precision on the order of 1m. Future work involves parallelizing the 3D reconstruction technique to apply it to a 1 Giga-pixel aerial sensor for a more high-detailed and wider area 3D aerial reconstruction. We are also exploring statistical methods to 3D geo-locate single-images or video data using pose and scale invariant features.
358
A.N. Vasile et al.
References 1. Agarwal, S., Snavely, N., Simon, I., Seitz, S.M., Szeliski, R.: Building Rome in a Day. In: ICCV (2009) 2. Frahm, J.-M., Georgel, P., Gallup, D., Johnson, T., Raguram, R., Wu, C., Jen, Y.-H., Dunn, E., Clipp, B., Lazebnik, S., Pollefeys, M.: Building Rome on a Cloudless Day. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 368–381. Springer, Heidelberg (2010) 3. Bujnak, M., Kukelova, Z., Pajdla, T.: 3D reconstruction from image collections with a single known focal length. In: ICCV (2009) 4. Strecha, C., Pylvanainen, T., Fua, P.: Dynamic and Scalable Large Scale Image Reconstruction. In: CVPR (2010) 5. Frahm, J.-M., Pollefeys, M., Lazebnik, S., Zach, C., Gallup, D., Clipp, B., Raguram, R., Wu, C., Johnson, T.: Fast Robust Large-scale Mapping from Video and Internet Photo Collections. In: ISPRS (2010) 6. Pollefeys, M., Nister, D., Frahm, J., Akbarzadeh, A., Mordo-hai, P., Clipp, B., Engels, C., Gallup, D., Kim, S., Merrell, P.: Detailed Real-Time Urban 3D Reconstruction from Video. IJCV 78(2-3), 143–167 (2008) 7. Micusik, B., Kosecka, J.: Piecewise Planar City 3D Modeling from Street View Panoramic Sequences. In: CVPR (2009) 8. Lee, T.: Robust 3D Street-View Reconstruction using Sky Motion Estimation. In: 3DIM 2009 in conjunction with ICCV (2009) 9. Fruh, C., Zakhor, A.: An Automated Method for Large-scale, Ground-based City Model Acquisition. IJCV 60(1) (2004) 10. Agrawal, M., Konolige, K.: Real-time localization in outdoor environments using stereo vision and inexpensive GPS. In: ICPR, vol. 3, pp. 1063–1068 (2006) 11. Yokochi, Y., Ikeda, S., Sato, T., Yokoya, N.: Extrinsic Camera Parameter Based-on Feature Tracking and GPS Data. In: ICPR, pp. 369–378 (2006) 12. Modsching, M., Kramer, R., ten Hagen, K.: Field trial on GPS Accuracy in a medium size city: The influence of built-up. In: WPNC (2006) 13. Lowe, D.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 14. Hartley, R.I., Zisserman, A.: Multiple View Geometry. Cambridge University Press, Cambridge (2004) 15. Nistér, D.: An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 26(6), 756–770 (2004) 16. Snavely, N.: Scene Reconstruction and Visualization from Internet Photo Collections, Doctoral thesis, University of Washington (2008) 17. Furukawa, Y., Ponce, J.: Accurate, Dense, and Robust Multi-View Stereopsis. IEEE Trans. on Pattern Analysis and Machine Intelligence (2009) 18. Furukawa, Y., Ponce, J.: Patch-based Multi-View Stereo Software, http://grail.cs.washington.edu/software/pmvs 19. Besl, P., McKay, N.: A method of registration of 3-D shapes. IEEE Trans. Pattern Analysis and Machine Intelligence 12(2), 239–256 (1992) 20. Rusinkiewicz, S., Levoy, M.: Efficient variants of the ICP algorithm. In: Third International Conference on 3D Digital Imaging and Modeling (3DIM), pp. 145–152 (June 2001) 21. MeshLab, http://meshlab.sourceforge.net/
Non-Parametric Sequential Frame Decimation for Scene Reconstruction in Low-Memory Streaming Environments Daniel Knoblauch1, Mauricio Hess-Flores2 , Mark A. Duchaineau3, Kenneth I. Joy2 , and Falko Kuester1 1
3
University of California, San Diego, USA {dknoblau,fkuester}@ucsd.edu 2 University of California, Davis, USA {mhessf,kijoy}@ucdavis.edu Lawrence Livermore National Laboratory, Livermore, USA [email protected]
Abstract. This paper introduces a non-parametric sequential frame decimation algorithm for image sequences in low-memory streaming environments. Frame decimation reduces the number of input frames to increase pose and structure robustness in Structure and Motion (SaM) applications. The main contribution of this paper is the introduction of a sequential low-memory work-flow for frame decimation in embedded systems where memory and memory traffic come at a premium. This approach acts as an online preprocessing filter by removing frames that are ill-posed for reconstruction before streaming. The introduced sequential approach reduces the number of needed frames in memory to three in contrast to global frame decimation approaches that use at least ten frames in memory and is therefore suitable for low-memory streaming environments. This is moreover important in emerging systems with large format cameras which acquire data over several hours and therefore render global approaches impossible. In this paper a new decimation metric is designed which facilitates sequential keyframe extraction fit for reconstruction purposes, based on factors such as a correspondence-to-feature ratio and residual error relationships between epipolar geometry and homography estimation. The specific design of the error metric allows a local sequential decimation metric evaluation and can therefore be used on the fly. The approach has been tested with various types of input sequences and results in reliable low-memory frame decimation robust to different frame sampling frequencies and independent of any thresholds, scene assumptions or global frame analysis.
1 Introduction There has been a significant amount of research in the area of Structure and Motion (SaM) in recent years. The approaches have matured enough to allow for reliable reconstructions from image or video sequences. Lately more work has been introduced to automate these approaches. One important step in SaM is the decimation of input frames, in order to reduce the computation load but also discard frames that could possibly lead to pose and structure degeneracies. Most approaches introduced in previous G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 359–370, 2011. c Springer-Verlag Berlin Heidelberg 2011
360
D. Knoblauch et al.
work apply a global frame decimation for parts of the captured image sequences. This not only results in high memory consumption, as many frames have to be buffered, but is also not suitable for streaming environments, as high delays are introduced. These kinds of environments include reconstructions based on image streams acquired through embedded systems, where memory and network bandwidths are restricted and therefore require a sequential frame decimation before streaming. This paper introduces a new low-memory, non-parametric, sequential work-flow to frame decimation without usage of thresholds or scene-dependent knowledge for streaming environments. To enable this sequential work-flow a new frame decimation metric optimized for non-parametric, sequential frame decimation is introduced. Compared to decimation metrics introduced in previous work the new metric is designed to have one global maxima representing a good keyframe pair at each evaluation step. The goal of this approach is to avoid input frames that result in error-prone reconstructions and reduce the total amount of streamed data by filtering non-suitable frames as soon as possible. Reconstruction errors can be introduced by degenerate camera poses, numerical errors in triangulation with small baselines or bad correspondences caused by large baselines which result in more occlusions. Degenerate camera poses can result from little or no translation, in which case epipolar geometry estimation is respectively ill-posed or undefined. Numerical errors in triangulation on the other hand are introduced by small baselines in relation to the depth of the viewed scene. This is because near-parallel rays yield triangulated points that have a potentially large uncertainty in their computed positions. Lastly, large baselines introduce errors in correspondences as more occlusions appear. The main contribution of the introduced approach is to reduce errors in muti-view reconstructions from the mentioned error sources with a low-memory sequential frame decimation work-flow and newly introduced error metric, where no thresholds are used nor assumptions about the scene are made. The error metric is based on errors evaluated from different camera motion models and an analysis of the number of obtained correspondences in relation to the number of possible features in the observed scene. The sequential frame decimation is especially suited for pre-streaming filtering of frames in low-memory, low bandwidth environments.
2 Previous Work Structure and Motion from image sequences or videos has been a major focus in computer vision. Pollefeys et al. [1] introduced a multi-view reconstruction approach based on uncalibrated hand-held cameras. Nist´er [2] reconstructed scenes with a hierarchy of trifocal tensors from an uncalibrated frame sequence. Every algorithm based on uncalibrated images extracts as a first step correspondences between frame pairs. These correspondences are then used to extract the epipolar geometry or in other words the fundamental matrix and the corresponding camera poses. The most often-used feature matching algorithms are the scale-invariant feature transform (SIFT) by Lowe [3] and the speeded-up robust features (SURF) by Bay et al. [4]. With the automation of SaM from image sequences and videos the challenge to find good image pairs for pose estimation and reconstruction becomes apparent. Nist´er [5]
Non-Parametric Sequential Frame Decimation
361
introduced a frame decimation algorithm based on global motion estimation between frames and a sharpness measure to remove redundant frames. Ahmed et al. [6] recently introduced a global frame decimation algorithm based on number of correspondences, geometric robust information criterion (GRIC) [7] and a point-to-epipolar line cost between frame pairs. Both approaches analyze the given frames in a global manner and are therefore well-suited for global decimation once all frames are available. These approaches also rely on empirically chosen thresholds for the frame decimation decision. Royer et al. [8] introduced a simple sequential frame decimation algorithm for robotic applications. Their frame decimation decision is based on the number of available correspondences between keyframes, and tries to decimate as many frames in-between keyframes without going below an empirically chosen number of correspondences. Torr et al. [9] use their previously introduced GRIC approach to improve the correspondence track extraction over several frames by analysing if the epipolar geometry or a homography is a better motion model for the given frames. The active search algorithm proposed by Davison [4] performs a global analysis of frames to decide which one adds the most new information to a multi-view reconstruction. Beder and Steffen [10] introduced an error metric to analyse the goodness of pairs of camera poses based on the uncertainty of reconstructed 3D points. This error metric is only focused on good camera poses but does not estimate the possible goodness of the correspondences.
3 Frame Decimation This paper introduces a sequential frame decimation approach that reduces the amount of input images for the SaM algorithm by filtering frames that could result in degenerate camera poses and numerically unstable triangulations. At the same time it ensures that large baselines are avoided, as this results in less accurate correspondences based on more occlusions. The sequential frame decimation is well suited for low-memory, streaming environments. Figure 1 shows the basic work-flow of this approach. The primary goal is to find reliable consecutive keyframes to allow a more robust multi-view reconstruction, which has the additional benefit of reducing the amount of data transferred to the reconstruction infrastructure. The first step of the algorithm is to extract correspondences between the last keyframe and the present candidate frame. A frame decimation metric is extracted, which is used for a sequential frame decimation decision. Never more than three frames have to be kept in memory thanks to this sequential approach. It will be shown in the remainder of this paper that this approach results in a reduction of error between consecutive pairwise reconstructions and in the final multi-view reconstruction, and finds a locally optimal keyframe sequence while keeping memory usage low. 3.1 Sequential Frame Decimation Most previous frame decimation approaches aimed to find the globally optimal keyframe sequence and therefore required large subsets of the input frame sequence. This means that the frames have to be obtained in advance or the frames have to be buffered for a global analysis. It is obvious that these kinds of approaches are not well suited for
362
D. Knoblauch et al.
Fig. 1. Work flow sketch for the introduced non-parametric sequential frame decimation algorithm
streaming or low-memory environments. The sequential frame decimation approach in this paper overcomes these disadvantages by decimating frames on the fly. As a result only data from three frames, the last keyframe, the last frame and the present frame have to be managed in memory compared to at least ten frames in other approaches [6]. The introduced sequential work-flow can be used as long as the error metric is designed to have one global maximum, which corresponds to the next keyframe. This paper introduces a new decimation metric which incorporates a GRIC-based epipolar geometry versus homography residual comparison and the ratio of good correspondences to possible features at each evaluated frame pair. This is perfectly suitable for the introduced sequential frame decimation approach. The sequential frame decimation is therefore performed by comparing the frame decimation metric f G of keyframe k with that of frame k + i. The counter i is increased as long as f G(k, k + (i − 1)) <= f G(k, k + i). As soon as the first frame pair with a positive value for f G and a significant decrease in error metric is found, the previous frame pair with a positive f G is chosen as a keyframe pair. To start the sequential frame decimation the first suitable frame of the input sequence is used as the first keyframe. The first possible key frame pair can be initialized based on Beder and Steffen [10] to have a valid starting point. Every frame pair with f G(k, k + i) which results in a local maximum is used for subsequent SaM calculations and frame k + i is used as the next start frame for further decimation. This work-flow is illustrated in figure 1. This sequential frame decimation approach gets rid of degenerate frames and reduces the number of frames with small baselines in which simple triangulation algorithms such as linear triangulation could fail, while keeping the baseline small enough to retain good correspondences. 3.2 Camera Pose and Structure Degeneracy Detection There are two situations where relative camera pose estimation between two views results in degeneracies in which the epipolar geometry is not defined and therefore the
Non-Parametric Sequential Frame Decimation
363
fundamental matrix extraction fails. These cases are motion degeneracy and structure degeneracy. The motion degeneracy appears for example if the camera movement consists only of a rotation and no translation, or if scene points lie on certain quadric surfaces. The structure degeneracy appears if all the correspondences used to calculate the epipolar geometry are coplanar in the 3D scene. In these cases the fundamental matrix cannot be reliably computed, but the relative scene change between cameras can be described by a homography. Therefore, the comparison of residual error between these two scene representations gives an insight into the frame quality for pose and structure estimation. This comparison is performed based on the geometric robust information criterion (GRIC) [7]. GRIC is not only based on the goodness of fit but also on the parsimony of the two models. Equation 1 shows the relative comparison of fundamental matrix F and homography H. relGRIC(F, H) =
GRIC(H) − GRIC(F ) GRIC(H)
GRIC(X) is defined by equations 2 and 3. GRIC(X) = ρ(e2i )i + λ1 dn + λ2 k
(1)
(2)
i
ρ(e2i )i = min(
e2i , λ3 (r − d)) σ2
(3)
The goodness of fit is represented by the sum of squared residuals ei of F and H in relation to the input correspondences. The parsimony is based on d, the number of dimensions modeled (d = 3 for F and d = 2 for H), k, number of degrees of freedom in the model (k = 7 for F and k = 8 for H), r, dimension of the input data, which corresponds to r = 4 in the case of 2D correspondences, σ 2 is the variance of the residual errors, and similarly to [6] we set λ1 = log(r), λ2 = log(rn) and λ3 corresponds to a limit for the residual error, which was set to n, the number of correspondences between frames. Based on the definition of relGRIC and the fact that F has high errors when there is a small baseline or there is no translation at all and the fact that H has low errors in these cases but high errors with wider baselines it can be said that a frame is a good input for SaM the higher the relGRIC is. It can also be said that if relGRIC < 0 a bad input frame pair is found. In the sequential frame decimation work-flow all frame pairs with a relGRIC smaller then zero are directly decimated and not used for the sequential evaluation. 3.3 Correspondence Goodness The value of relGRIC makes sure that frames with high degeneracy probability are excluded from further SaM steps. A closer look at relGRIC values and its definition shows that its value will remain high or even increase as soon as H is no longer a good descriptor. In other words, relGRIC will have good values along longer baselines between cameras, as long as there are enough good correspondences to calculate the fundamental matrix. However, in the case of SaM reconstruction it is essential to find as
364
D. Knoblauch et al.
many good correspondences as possible, and this usually occurs with smaller baselines since this results in fewer occlusions and matching errors. For this reason a weighting term cW for relGRIC is introduced, which represents the probability of finding good correspondences. Correspondences are extracted based on SURF [4] features. A good first guess for the size of the baseline and therefore the possible number of occlusions is taken by looking at the ratio between found features NF in the source keyframe and the resulting correspondences NC with the present target frame. This initial approximation gives adequate results but does not take the possiblity of wrong correspondences into account. To make this weight more reliable, only inlier correspondences NI from the RANSAC-based fundamental matrix calculation are used to calculate cW , as shown in equation 4. cW =
NI NF
(4)
It can be seen that cW tends to decrease if the baseline grows. To make sure that the correspondences are a good representation of the given scene and cover as much of the scene as possible, a correspondence area cA versus image size iA ratio is introduced. The correspondence area cA is approximated by the axis-aligned bounding box of all the given inlier correspondences. To get a good reconstruction of the scene, this ratio aR, shown in equation 5, should stay as large as possible. aR =
cA iA
(5)
3.4 Frame Decimation Metric The last sections introduced ways to detect camera pose and structural degeneracies, and also evaluate camera baselines based on correspondence goodness. This chapter is going to introduce a new frame decimation metric f G that combines the latter results into a frame goodness for SaM estimation. The relGRIC metric is a relative measure of how good epipolar geometry describes the scene compared to a homography. In the case of relGRIC < 0 this means that the frame pair is not well suited for reconstruction as there is either a camera pose degeneracy or the baseline is too small. On the other hand, if relGRIC > 0 this means that this is a candidate frame pair for pose estimation. In this instance the main introduction of error in the reconstruction comes from the correspondences due in general to occlusions, which tend to increase with larger baselines. The introduced decimation metric takes both terms into consideration and weights the camera goodness by the correspondence goodness. This leads to the term for f G seen in equation 6. f G = (cW ∗ aR) ∗ relGRIC(F, H)
(6)
It can be seen that f G will have a high value if the relGRIC has a high value and the correspondence weight is high. This also means that f G will at most have the value of relGRIC, but also will decrease when the baseline grows. This behaviour represents the search for the sweet spot in baseline size based on the two main reconstruction error sources: camera pose and correspondences.
Non-Parametric Sequential Frame Decimation
365
Fig. 2. Comparison of f G and its components (aR*cW , relGRIC) to reprojection error of the pairwise reconstruction from frame k = 0 of the ‘kique’ sequence
4 Results The presented frame decimation algorithm has been tested with a large number of publicly available image sequences, consisting of different scene types and camera motions. Some examples of the tested sequences can be seen in figure 6. The ‘kique’ data set shown in figure 6a consists of aerial imagery taken while flying in a circle around a city neighbourhood. The ‘medusa’ data set [1] in figure 6b represents an image sequence taken by a hand-held camera moving in a half-circle around an archeological artifact, where the movement is very jittery at times. The data set ‘Leuven castle’ in figure 6c covers the outside of a building with a hand-held camera. Figure 6d shows data set ‘castle-P30’ [11] which covers a building from the inner courtyard. The ‘house’ data set [12] in figure 6e represents a carefully spaced image sequence taken in a circle around a toy house. It can be seen that these different target scenes and camera movements are all handled well by the introduced frame decimation approach. An evaluation of the goodness metric is performed. In figure 2 the frame decimation values for the ‘kique’ data set are plotted in conjunction with the resulting reprojection error of the given frame pair reconstruction between the first suitable frame of the sequence and the following frames. It can be seen that the f G value for the keyframe pair search starting at frame k = 0 has its best value at frame k + 3. This is in agreement with the lowest reprojection error representing the best reconstruction in this sequence. The plot shows also that frame k + 1 is a bad frame for pose estimation as the f G has a negative value, which is supported by the reprojection error. A bigger baseline as suggested by the relGRIC (in frame k + 5) is avoided thanks to the correspondencebased weighting. The frame decimation is run on the ‘kique’ data set and results in a
366
D. Knoblauch et al.
Fig. 3. Comparison of f G and its components (aR ∗ cW , relGRIC) to reprojection error for each pairwise reconstruction with respect to keyframe k = 1, for the ‘medusa’ sequence
sequence of keyframe pairs suitable for sequential multi-view reconstruction. By extracting 38 percent of the frames, degenerate camera poses are avoided while making sure that the baselines of the cameras stay small enough to reduce possible problems in correspondence calculation. This is all done on the fly with a sequential frame decimation work-flow designed to suit low-memory streaming environments. Figure 6a shows a selection of the first five extracted keyframes. Another example of the tests that have been run is based on the ‘medusa’ data set [1]. For this test the input video has been decomposed into images with 10 frames per second. Figure 3 shows the resulting values in the frame decimation from the first frame k = 1 in the sequence. Based on the introduced algorithm the next keyframe is frame k+12, which is the local maximum of f G. The resulting dense pair-wise reconstruction for this keyframe pair can be seen in figure 4a. It can also be seen that the reprojection error for this frame pair is the local minimum. On the other hand there are similarly low reprojection errors in frames k + 4 and k + 5. But as relGRIC and f G have negative values it can be seen that these frame pairs are degenerate. Figure 4b shows the reconstruction result with frame k + 5. The reconstruction is clearly worse than with the found keyframe. This can be explained by the fact that the baseline is too small and therefore the noise in the point cloud, caused through numerical errors, is much bigger, even though the reprojection error is small. It becomes clear from this that pairwise reconstructions between keyframes should present low reprojection errors, but that reprojection error in itself should not be used as the decimation metric. Between the frames k+8 to k+11 a zig-zagging in f G around zero and an explosion in reprojection error can be seen. It is assumed that in these areas the given correspondences and camera poses have high uncertainty. But these areas are decimated by the
Non-Parametric Sequential Frame Decimation
(a) Frames 1-13
367
(b) Frames 1-5
Fig. 4. Comparison of reconstructions between extracted keyframe pair 1-13 and low reprojection error frame pair 1-5 with linear triangulation for the ‘medusa’ sequence
(a) Number of key frames
(b) Sequential reprojection error
Fig. 5. Comparison between number of total frames and number of extracted keyframes for different ‘medusa’ sequence decodings and reprojection error during sequential reconstruction with decimated frames. The crosses represent keyframes and the corresponding reprojection error with every newly added keyframe is plotted.
given algorithm, as negative f G values are not taken into account, which results in a good keyframe pair. The frame decimation was run on the entire sequence and results in good keyframe extraction. In this example the input sequence is decimated to 23 percent of the original frames. This high decimation rate can be explained by the slow camera movement compared to reconstruction target size and distance. The original baselines are small and therefore prone to numerical errors in triangulation. Figure 6b shows the first five of the extracted keyframes. The given results show that the sequential work-flow in combination with the frame decimation metric f G results in good decimations. Based on the different movements
368
D. Knoblauch et al.
(a) ‘kique’
(b) ‘medusa’
(c) ‘leuven castle’
(d) ‘castle-P30’
(e) ‘house’ Fig. 6. First five keyframes extracted by non-parametric sequential frame decimation for different data sets
and reconstruction targets covered in the tested data sets, it can be said that this approach is suitable for a large variety of sequences. A comparison between values of relGRIC and f G show that in cases where large areas of the target scene are covered by every frame the relGRIC and therefore the camera movement is dominant in the decimation decision. The correspondence-feature-ratio integrated into f G comes into play the less overlap there is in a given frame sequence. Two examples on opposite ends of these frame sequence attributes can be seen in the ‘medusa’ and ‘castle-P30’ data sets. The ‘medusa’ data set consists of a camera movement around the reconstruction target. This results in the fact that f G decimation differs from relGRIC only with larger baselines. In the ‘castle-P30’ case the f G metric results in more key frames as the overlap between frames is much smaller. Both these cases comply with our expectations and are valuable features for multi-view reconstructions. Up until now it has been shown that the introduced algorithm decimates frames that are not suited for good SaM. Several tests have been performed with image sequences
Non-Parametric Sequential Frame Decimation
369
that have been carefully spaced before-hand, to allow good reconstructions. Figure 6e shows one of these sequences introduced by Zisserman et al. [12]. The frame decimation in this sequence results in no decimated frames and shows that there is no overdecimation. To make sure that the introduced approach is independent of the spacing, or in other words, frame rate of the input data, tests have been conducted for a video sequence with different frames rates. This has been done with the ‘medusa’ data set [1], which was decoded into 5, 10, 15 and 20 fps to allow this comparison. The frame decimation was run for all the sequences, and the number of resulting keyframes and the corresponding number of input frames is given in figure 5a. It can be seen that the number of keyframes stays fairly constant. There is a small change in number of keyframes which can be explained by the fact that the introduced algorithm does not find a unique frame decimation, as the decimation is performed locally and sequentially, and the fact that in the lower frame rates possible keyframes are already excluded by the lack of spatial information. To show how this frame decimation approach is well suited for sequential multi-view reconstruction, the average reprojection error resulting after every addition of a new keyframe to the reconstruction is extracted. Figure 5b shows the average reprojection error in pixels after addition of every extracted keyframe. The keyframes are represented by the crosses. All other frames are decimated by the introduced approach. It can be seen that overall the reprojection error is small and also constant over time. This shows clearly that all the introduced keyframes are well suited for reconstruction. This means that degenerate camera poses as well as frames with large baselines and therefore bad correspondences are avoided.
5 Conclusion This paper introduced a non-parametric sequential frame decimation algorithm for scene reconstructions in low-memory streaming environments. This frame decimation reduces the number of input images by eliminating frames that could lead to erroneous pose and structure estimates on the fly. The main contribution of this paper is the introduction of a sequential work-flow for frame decimation based on a newly introduced smooth frame goodness metric, which is designed to choose only one global maximum value at each keyframe evaluation. This allows for a sequential frame decimation without usage of thresholds or scene assumptions. This metric is based on a ratio between good correspondences and possible features and the geometric robust information criterion (GRIC) for residual error relationships between epipolar geometry and homography estimation. The definition of this metric allows for a local error minimization evaluation and can therefore be used on the fly. Thanks to the sequential nature of this approach, less memory is used during the decimation evaluation as only three frames have to be kept in memory at any time. In comparison, previous work that suggested global decimation optimization uses at least ten frames at a time. This approach has been tested with multiple publicly available data sets representing different types of target scenes and camera movements. The results show reliable frame decimation robust to frame sampling rates and independent of any thresholds, scene assumptions or global frame analysis.
370
D. Knoblauch et al.
References 1. Pollefeys, M., Van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual modeling with a hand-held camera. International Journal of Computer Vision 59, 207–232 (2004), doi:10.1023/B:VISI.0000025798.50602.3a 2. Nist´er, D.: Reconstruction from uncalibrated sequences with a hierarchy of trifocal tensors. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 649–663. Springer, Heidelberg (2000) 3. Lowe, D.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157. IEEE Computer Society, Los Alamitos (1999) 4. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Speeded-up robust features (SURF). Computer Vision and Image Understanding 110, 346–359 (2008); Similarity Matching in Computer Vision and Multimedia 5. Nist´er, D.: Frame decimation for structure and motion. In: Pollefeys, M., Van Gool, L., Zisserman, A., Fitzgibbon, A.W. (eds.) SMILE 2000. LNCS, vol. 2018, pp. 17–34. Springer, Heidelberg (2001) 6. Ahmed, M., Dailey, M., Landabaso, J., Herrero, N.: Robust key frame extraction for 3D reconstruction from video streams. In: International Conference on Computer Vision Theory and Applications (VISAPP), pp. 231–236 (2010) 7. Torr, P.H.: Geometric motion segmentation and model selection. Philosophical Transactions: Mathematical, Physical and Engineering Sciences 356, 1321–1340 (1998) 8. Royer, E., Lhuillier, M., Dhome, M., Lavest, J.M.: Monocular vision for mobile robot localization and autonomous navigation. International Journal of Computer Vision 74, 237–260 (2007), doi:10.1007/s11263-006-0023-y 9. Torr, P.H., Fitzgibbon, A.W., Zisserman, A.: The problem of degeneracy in structure and motion recovery from uncalibrated image sequences. International Journal of Computer Vision 32, 27–44 (1999), doi:10.1023/A:1008140928553 10. Beder, C., Steffen, R.: Determining an initial image pair for fixing the scale of a 3d reconstruction from an image sequence. In: Franke, K., M¨uller, K.-R., Nickolay, B., Sch¨afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 657–666. Springer, Heidelberg (2006) 11. Strecha, C., von Hansen, W., Van Gool, L., Fua, P., Thoennessen, U.: On benchmarking camera calibration and multi-view stereo for high resolution imagery. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8 (2008) 12. Fitzgibbon, A.W., Cross, G., Zisserman, A.: Automatic 3D model construction for turn-table sequences. In: Koch, R., Van Gool, L. (eds.) SMILE 1998. LNCS, vol. 1506, pp. 155–170. Springer, Heidelberg (1998)
Ground Truth Estimation by Maximizing Topological Agreements in Electron Microscopy Data Huei-Fang Yang and Yoonsuck Choe Department of Computer Science and Engineering Texas A&M University College Station, TX 77843-3112
Abstract. Manual editing can correct segmentation errors produced by automated segmentation algorithms, but it also introduces a practical challenge: the combination of multiple users’ annotations of an image to obtain an estimation of the true, unknown labeling. Current estimation methods are not suited for electron microscopy (EM) images because they typically do not take into account topological correctness of a segmentation that can be critical in EM analysis. This paper presents a ground truth estimation method for EM images. Taking a collection of alternative segmentations, the algorithm seeks an estimated segmentation that is topologically equivalent and geometrically similar to the true, unknown segmentation. To this end, utilizing warping error as the evaluation metric, which measures topological disagreements between segmentations, the algorithm iteratively modifies the topology of an estimated segmentation to minimize the topological disagreements between this estimated segmentation and the given segmentations. Our experimental results obtained using EM images with densely packed cells demonstrate that the proposed method is superior to majority voting and STAPLE commonly used for combining multiple segmentation results.
1
Introduction
Electron microscopy (EM) image segmentation is the first step toward the reconstruction of neural circuits. However, because EM images show high variations in neuronal shapes and have ambiguities in boundary localization due to imperfect staining and imaging noise, automated segmentation algorithms sometimes generate incorrect results. Such errors require manual correction, that is, manual correction of erroneous segmentations is an important part of the neural circuit reconstruction pipeline. The user’s editing or correction improves segmentation accuracy, but it introduces a new practical challenge: the combination of multiple users’ annotations of an image to obtain an estimation of the true, unknown labeling. Several combination methods have been proposed in the literature. The simplest combination
This work was supported in part by NIH/NINDS #1R01-NS54252 and NSF CRCNS #0905041.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 371–380, 2011. c Springer-Verlag Berlin Heidelberg 2011
372
H.-F. Yang and Y. Choe
strategy is majority voting. It treats each individual segmentation equally and assigns a pixel the label that most segmentations agree on. Another commonly used combination method is to weight each segmentation differently according to the performance of each user. This method is referred to as global weighted voting. Simultaneous truth and performance level estimation (STAPLE) algorithm proposed by Warfield et al. [1] belongs to this category. STAPLE uses an iterative expectation-maximization algorithm to measure the performance of experts and estimates the underlying true segmentation by optimally combining each segmentation depending on each expert’s performance level. In contrast to giving the same weighting to each pixel in a segmentation, local weighted voting methods [2,3] assign each pixel a different weight according to a local estimation of segmentation performance. However, these combination strategies are not adequate for the EM images because they do not take into account the topological or morphological correctness of a segmentation. Ensuring that a segmentation is topologically correct is a necessity to obtain an accurate neural circuit reconstruction. As pointed out by Jain et al. [4], in EM segmentation for reconstructing detailed connections between neurons, a topological error (i.e. merge or split errors) occurring at a branch causes severely erroneous neural connectivities in the subsequent sub-tree. Therefore, combination strategies for EM images should ensure that the estimate of the actual labeling given a set of segmentations is topologically correct. This paper presents a segmentation ground truth estimation method for EM images from brain tissue. Taking a collection of annotations of an image, the algorithm aims at providing an estimated labeling that is topologically equivalent and geometrically similar to the true, unknown segmentation. To this end, guided by the segmentation evaluation metric, warping error, the algorithm iteratively modifies topology of the estimated segmentation to maximize the topological agreements (i.e. minimize the disagreements) between the estimated segmentation and a set of given segmentations. Topological change can be done by merging two adjacent regions or splitting a region into two by modifying the label of a sequence of pixels. By gradually changing its topology, the estimated segmentation becomes topologically equivalent to the true, unknown segmentation.
2
Evaluation Metric: Warping Error
In supervised evaluation, the performance of a segmentation algorithm is quantitatively measured by comparing its segmentation results against a manually labeled ground truth (i.e. a reference image) based on evaluation metrics. Jaccard index [6], Dice Similarity Coefficient (DSC), and F-measure are well known and widely used metrics for segmentation evaluation. Those metrics use the amount of overlap between a segmentation and the ground truth as a similarity measure to evaluate the performance of a segmentation method. This makes them focus on measuring a segmentation’s boundary accuracy at the pixel level but not take into account its topological correctness. However, in EM segmentation evaluation, measuring the degree of the topological correctness of a segmentation is
Ground Truth Estimation by Maximizing Topological Agreements
373
also important because obtaining accurate analysis of the neural circuits relies on topologically correct reconstructions [5]. The warping error metric proposed by Jain et al. [5] is a metric that measures topological disagreements between segmentations and has been shown to be effective for EM segmentation evaluation. While comparing two segmentations, the error metric strongly penalizes topological disagreements but tolerates minor boundary localization differences. Conceptually, to calculate the topological disagreements between a segmentation and the ground truth, the ground truth image is first transformed into another image under topological and geometrical constraints, and then the disagreements (i.e. topological errors) can be identified as the pixel differences between the transformed image and the segmentation to be evaluated. Before giving a formal definition of warping error, the concept of warping is first presented. Formally, given two binary images, L∗ and L, if L∗ can be transformed into L by flipping the labels of a sequence of pixels, L is called a warping of L∗ , represented as L L∗ . That is to say, L∗ and L are topologically equivalent and geometrically similar. The labels of a sequence of pixels to be flipped are those of simple points (i.e. border pixels), where a point p is defined as
Algorithm 1. A warping algorithm that warps a binary image L∗ to a segmentation T . Algorithm from [5]. input : A binary image L∗ , a segmentation T , and geometric constraints set G output: A warped image L L = L*; while true do S = simple(L) ∩ G ; i = argmaxj∈S |tj − lj |; if |ti − li | > 0.5 then li = 1 − li ; else return L;
(a)
(b)
(c)
(d)
(e)
Fig. 1. Measured warping error of a segmentation against the ground truth. (a) A sample EM image. (b) Manually annotated ground truth. (c) A warping of the ground truth image shown in (b) given a segmentation in (d). (d) A segmentation to be evaluated. (e) Measured warping error. The pixel disagreements between (c) and (d) consist of the topological errors. In this case, two topological errors, a merge (blue circle) and a split (red circle), occur due to the boundary ambiguity in the original image.
374
H.-F. Yang and Y. Choe
a simple point if both the number of foreground connected components adjacent to p and the number of background connected components adjacent to p equal to 1. According to the theory of digital topology, flipping the labels of simple points will not alter the object’s topology. Now, letting T be a segmentation to be evaluated and L∗ be the reference annotation, the warping error D (T L∗ ) is given as D (T L∗ ) = min∗ |E (L, T ) | , LL
(1)
where L is the optimal warping of L∗ , and E (L, T ) is the difference set (i.e. the pixels that have different labels in images L and T ), defined as E (L, T ) = LT . In other words, the warping error is considered as the pixel disagreements between the segmentation to be evaluated T and the transformed segmentation L. Note that in order to find minimal warping error, the image L∗ is warped into L that is as similar to segmentation T as possible. The approach to warping a labeling L∗ to another labeling L given T is detailed in Algorithm 1, and simple(L) indicates the simple points of L. Figure 1 gives a detailed explanation of how the warping error is identified. First, the ground truth annotation in Figure 1(b) is warped into another labeling in Figure 1(c) by using the warping algorithm in Algorithm 1. Then, the topological disagreements presented in Figure 1(e) can be calculated from the pixel differences between the labeling in Figure 1(c) and the labeling in Figure 1(d), which contains two topological errors, a merge (blue circle) and a split (red circle). These errors result from the problem of boundary ambiguity in the original image, as can be seen in Figure 1(a).
3
Ground Truth Estimation by Minimizing Warping Error
This section focuses on the main contribution of the paper: estimating a segmentation that is topologically equivalent and geometrically similar to the true, unknown segmentation when a few segmentations are available. 3.1
Problem Definition
∗ , either obtained by automated segGiven a set of N segmentations, S1∗ , ..., SN mentation algorithms or annotated by different humans, the goal is to find an estimated segmentation Sˆ that is topologically equivalent and geometrically similar to the underlying unknown true segmentation. One potential segmentation that satisfies the topological and geometrical constraints and is capable of representing the true, unknown segmentation is that with a topology most of the given segmentations agree on. In other words, the estimated segmentation Sˆ is a segmentation that minimizes the warping error between itself and the given segmentations. Mathematically, Sˆ is obtained by minimizing the following:
Sˆ = argmin S
N i=1
D (S Si∗ ) = argmin S
N i=1
where Si is the optimal warping of the labeling Si∗ .
min |E (S, Si ) | ,
Si Si∗
(2)
Ground Truth Estimation by Maximizing Topological Agreements
375
One possible method to find the estimated segmentation Sˆ is to enumerate all possible labelings and choose one that has minimal warping error. However, enumerating all labelings can be computationally expensive. Another alternative is to gradually change the topology of a segmentation and make it converge to a topology that most segmentations agree on. The section below details this approach. 3.2
Proposed Topological Correction Algorithm
Changing the labeling of an image involves flipping the labels of pixels, which can result in a merger of two adjacent regions or a splitting of a region into two. The potential pixels whose change of label causes a topological change are those that affect warping error. To achieve the goal of seeking an estimated segmentation with a topology that most segmentations agree on, the algorithm starts with an initial segmentation obtained by using the majority voting method. At each iteration, by using the number of topological errors as the evaluation metric, the algorithm corrects one topological disagreement between the estimated segmentation and the given segmentations. While correcting a topological disagreement at each iteration, the algorithm selects an error having a lowest flipping cost defined in Equation 4, detailed in the next section. A new labeling is accepted only if it has less warping error. The algorithm repeats the process of correcting topological errors and stops when no topological changes can lead to the reduction in the overall warping error, that is, it reaches a segmentation that minimizes warping error defined in Equation 2. Algorithm 2 details this proposed method.
Algorithm 2. Topological correction by minimizing warping error (proposed algorithm) ∗ input : A set of labled binary images, S1∗ , ..., SN ˆ output: An estimate of the ground truth, S ∗ initialize Sˆ to the result of majority voting given S1∗ , ..., SN ; ∗ foreach Si do Ei = D(Sˆ Si∗ ); Emin = i Ei ; while not converged do assign each topological error the flipping cost based on Equation 4; select a topological error with the lowest flipping cost; flip the selected pixels in the estimated segmentation Sˆ ; foreach Si∗ do Ei = D(Sˆ Si∗ ) ; Enew = i Ei ; if Enew < Emin then ˆ accept the new estimated ground truth S; else reject the new estimated ground truth and restore Sˆ back to the previous estimation;
376
3.3
H.-F. Yang and Y. Choe
Topological Change Cost
As mentioned above, flipping the label of pixels that contain warping error modifies the topology of a segmentation. To better locate the pixels for topological change, each pixel is associated with a flipping cost, and the selection of what pixels’ labels to be changed depends on the cost associated with those pixels. More specifically, the flipping cost of each pixel contains statistical information of an image, such as the intensity distributions. To define the flipping cost, two notations are first introduced. Let S be the foreground (object) segmentation and S¯ be the background segmentation. The cost of flipping the label of a pixel p from Sp to S¯p , f (p), is defined as f (p) =
Pr (Ip |S) , Pr Ip |S¯ + Pr (Ip |S)
(3)
where Ip is the intensity value of pixel p, and Pr (Ip |S) and Pr Ip |S¯ represents how well the intensity of pixel p fits into the intensity distributions (histograms) of foreground and background, respectively. Because a set of segmentations are given, the intensity histograms for the foreground and background are available. Figure 2(a) shows the flipping cost of changing a label of each pixel from the foreground to the background. Brighter color indicates a higher cost. Similarly, the label of a pixel p from S¯p to Sp is defined by cost of changing the ¯ ¯ Pr Ip |S / Pr Ip |S + Pr (Ip |S) . The flipping cost of changing a label of each pixel from the background to the foreground is shown in Figure 2(b). As can be seen in the example given in Figure 1, the topological change of a segmentation requires a sequence of pixel flips. Now, let C denote a set of pixels involved in the merger of two adjacent regions or the splitting of a region. To reduce the computational complexity associated with calculating the flipping cost of the pixels, a simple assumption is made that the pixel flip is independent with each other. Therefore, the cost f (C) of flipping all points in C, is defined as sum of the flipping cost of the individual pixels, that is,
(a)
(b)
Fig. 2. Flipping cost of changing a label of each point in an image. (a) The flipping cost of changing a label of each pixel from the foreground to the background. (b) The flipping cost of changing a label of each pixel from the background to the foreground. Brighter color indicates a higher cost.
Ground Truth Estimation by Maximizing Topological Agreements
f (C) =
f (p) .
377
(4)
p∈C
4
Experimental Results
The experiments were carried out on a few synthetic images and on an EM data set [7,8]. The purpose of applying the proposed method to simple synthetic images was to show that the proposed method can retrieve a segmentation that is topologically equivalent to the true, unknown segmentation. 4.1
Synthetic Images
A set of segmentations of an image are required to evaluate the proposed method. Four alternative segmentations were generated, and they are shown in Figures 3(a) to 3(d). The first three segmentations separate the image into two regions (i.e. same topology) with a slightly different boundary localization while the last (Figure 3(d)) segments the image as a whole region (different topology from the rest). These four segmentations present four possible segmentations of an image, and they are to be combined to obtain an estimate of the underlying true segmentation. Demonstrated in Figure 3 is a simple comparison between the estimated segmentation obtained by the proposed method and those by majority voting and STAPLE. The four alternative segmentations are of high quality, approaching expert levels: The estimated sensitivities (i.e. the probability of an annotator labeling a pixel as foreground if the true label is foreground) were 0.9991, 0.9991, 0.9991, and 1.0000, respectively, and the estimated specificities (i.e. the probability of an annotator labeling a pixel as background if the true label is background) were 1.0000, 1.0000, 1.0000, and 1.0000, respectively. The estimated segmentation of the true, unknown segmentation by using the majority voting method is shown in Figure 3(e), which is a segmentation that most segmentations agree on. However, this estimated segmentation is unable to represent the true, unknown segmentation because they are not topologically equivalent. Different from the majority voting method that treats each segmentation equally, STAPLE weights individual segmentation depending on its estimated performance level. The estimated result by using STAPLE is shown in Figure 3(f). However, we can see that although most of the alternative segmentations are topologically correct, such as those in Figures 3(a) to 3(c), STAPLE is unable to produce a topologically correct estimate of the true, unknown segmentation. This indicates its estimation is sensitive to the boundary localization of given labelings. On the other hand, when the same set of initial segmentations were given, our proposed algorithm produced a topologically correct estimate. The resulting segmentation is shown in Figure 3(g), and its respective close-up in Figure 3(j). 4.2
EM Data Set
A serial section Transmission Electron Microscopy (ssTEM) data set of the Drosophila first instar larva ventral nerve cord (VNC) from Cardona et al. [7,8]
378
H.-F. Yang and Y. Choe
(a) Seg. 1
(b) Seg. 2
(f) STAPLE
(g) Proposed
(c) Seg. 3
(d) Seg. 4
(h) Maj. vote (close-up)
(i) STAPLE (close-up)
(e) Maj. vote
(j) Proposed (close-up)
Fig. 3. Comparison between the estimated segmentation obtained by the proposed method and those by majority voting and STAPLE. (a)-(d) Alternative segmentations to be fused. The first three segmentations have the same topology while the last has a different topology from the rest. (e) The estimated segmentation by majority voting merges two separate regions as one, thus causing a merge error. (f) The estimated segmentation by STAPLE also contains a merge error. (g) The estimated segmentation by the proposed method gives a topologically correct estimate. (h)-(j) Close-ups of the results from majority voting, STAPLE, and the proposed algorithm, respectively.
was used for the evaluation of the proposed method. The data set contains 30 sections, each of which having a size of 512 × 512 pixels. The tissue is 2 × 2 × 1.5 microns in volume, with a resolution of 4 × 4 × 50 nm/voxel. The data set was manually delineated by an expert, and the manual segmentations served as the ground truth the algorithm aims to estimate. To test the developed method, a number of segmentations were first generated by thresholding the image at different values with additional manual editing to finally construct alternative segmentations. Note that the main focus of our work is not on how these alternative segmentations are generated, so any reasonable manual or automated method will be enough. These generated segmentations represent the alternative segmentations to be combined. Taking those segmentations as input, the proposed method produced an estimated segmentation. Figure 4(a) shows the initial segmentations with which the proposed method starts (majority vote), Figure 4(b) the estimated segmentations obtained by the proposed method, and Figure 4(c) the ground truth annotated by the expert. The topologies of the initial segmentations, obtained by majority voting, do not agree with those of the ground truth, which are indicated by the red circles. The final estimated segmentations, on the other hand, are topologically equivalent and geometrically similar to the ground truth, with minor boundary localization differences. Using the same set of input segmentations and warping error
Ground Truth Estimation by Maximizing Topological Agreements
(a) Initial (maj. vote)
(b) Estimated
379
(c) Ground truth
Fig. 4. Comparison of the topologies of initial segmentations, estimated segmentations, and ground truth. (a) The initial estimated segmentations with which the proposed method starts. (b) The estimated segmentations produced by the proposed method. (c) Ground truth. The topologies of the initial segmentations (a) do not agree with those of the ground truth (c), which are indicated by the red circles. The estimated segmentations (b) produced by the proposed method, on the other hand, are topologically equivalent and geometrically similar to the ground truth.
as the evaluation metric, the quantitative comparison of the results obtained by majority voting, STAPLE, and the proposed method is shown in Table 1. As we can see, topological errors exist in the results obtained by majority voting and STAPLE because these two methods fuse segmentation labels at the pixel level. The proposed method, on the contrary, can obtain topologically correct segmentations as long as the topologies of majority of alternative segmentations are correct. Also note that, in the experiment, the proposed method used the majority voting method’s results as the initial estimated segmentations and gradually modified the topologies of estimated segmentations until convergence. The proposed method’s final estimated segmentations are topologically equivalent to the true segmentations even it started with segmentations containing topological errors.
380
H.-F. Yang and Y. Choe
Table 1. Comparison of the number of topological errors committed by majority voting, STAPLE, and the proposed method on 10 different samples from the EM data set. As we can see, topological errors exist in the results obtained by majority voting and STAPLE whereas the proposed method is able to obtain topologically correct segmentations as long as topologies of most of the alternative segmentations are correct. Sample # Majority voting STAPLE Proposed method
5
1 6 2 0
2 4 4 0
3 8 3 0
4 2 5 0
5 2 2 0
6 2 0 0
7 0 3 0
8 2 2 0
9 6 3 0
10 5 2 0
Conclusion
We presented a novel pooling method that seeks a segmentation topologically equivalent and geometrical similar to the true, unknown segmentation when a set of alternative segmentations are available. This method is effective for noisy EM images because it maximizes the topological agreements among segmentations during the estimation process and ensures that a segmentation is topologically correct, which is important for connection estimation for connectomics research. Experimental results have demonstrated the effectiveness of this method.
References 1. Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23, 903–921 (2004) 2. Artaechevarria, X., Mu˜ noz-Barrutia, A., de Solorzano, C.O.: Combination strategies in multi-atlas image segmentation: Application to brain MR data. IEEE Trans. Med. Imaging 28, 1266–1277 (2009) 3. Coup´e, P., Manj´ on, J.V., Fonov, V., Pruessner, J., Robles, M., Collins, D.L.: Patchbased segmentation using expert priors: Application to hippocampus and ventricle segmentation. NeuroImage 54, 940–954 (2011) 4. Jain, V., Seung, H.S., Turaga, S.C.: Machines that learn to segment images: a crucial technology for connectomics. Current Opinion in Neurobiology 20, 653–666 (2010) 5. Jain, V., Bollmann, B., Richardson, M., Berger, D.R., Helmstaedter, M.N., Briggman, K.L., Denk, W., Bowden, J.B., Mendenhall, J.M., Abraham, W.C., et al.: Boundary learning by optimization with topological constraints. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2488–2495 (2010) 6. McGuinness, K., O’Connor, N.E.: A comparative evaluation of interactive segmentation algorithms. Pattern Recognition 43, 434–444 (2010) 7. Cardona, A., Saalfeld, S., Tomancak, P., Hartenstein, V.: TrakEM2: open source software for neuronal reconstruction from large serial section microscopy data. In: Proc. High Resolution Circuits Reconstruction, pp. 20–22 (2009) 8. Cardona, A., Saalfeld, S., Preibisch, S., Schmid, B., Cheng, A., Pulokas, J., Tomancak, P., Hartenstein, V.: An integrated micro- and macroarchitectural analysis of the Drosophila brain by computer-assisted serial section electron microscopy. PLoS Biol. 8, e1000502 (2010)
Segmentation and Cell Tracking of Breast Cancer Cells Adele P. Peskin1 , Daniel J. Hoeppner2 , and Christina H. Stuelten2 1 2
NIST, Boulder, CO 80305 NIH, Bethesda, MD 20892
Abstract. We describe a new technique to automatically segment and track the cell images of a breast cancer cell line in order to study cell migration and metastasis. Within each image observable cell characteristics vary widely, ranging from very bright completely bounded cells to barely visible cells with little to no apparent boundaries. A set of different segmentation algorithms are used in series to segment each cell type. Cell segmentation and cell tracking are done simultaneously, and no user selected parameters are needed. A new method for background subtraction is described and a new method of selective dilation is used to segment the barely visible cells. We show results for initial cell growth.
1
Introduction
The ability of cells to migrate varies widely during development and the life of multicellular organisms. For example, cell migration is crucial during the early stages of development when organs are formed. During adulthood, a wide array of immune cells can migrate at speeds that can reach more than one cell length per minute. These cells travel over wide distances to sites of inflammation or injury to precipitate the host s immune response. On the other hand, epithelial cells, which line cavities and surfaces inside the body, are characterized by tight contact with neighboring cells and typically do not migrate over far distances. Cell- and organ-specific regulation of the migratory behavior of cells is crucial under both physiological and pathological conditions. In normal epithelia, such as breast epithelium, cell migration is restricted, but in tumor disease these transformed epithelial cells acquire mutations that allow them to migrate into the surrounding connective tissue (invasive tumor) and then to distant sites to establish metastatic disease [1]. Therefore, investigating the migratory behavior of epithelial and tumor cells may enable us to find new ways to control migration of tumor cells to distant sites and reduce metastasis. Cell migration can be studied by visualizing cell responses through the capture of time-lapse images for long periods of time. Quantitative analysis of these images is performed by tracking the position of individual cells as a function of time, which is labor-intensive and can only be done manually for a few selected
This contribution of NIST, an agency of the U.S. government, is not subject to copyright.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 381–391, 2011. c Springer-Verlag Berlin Heidelberg 2011
382
A.P. Peskin, D.J. Hoeppner, and C.H. Stuelten
cells. Thus, current ways to analyze cell migration often generate data of limited statistical value. To robustly analyze migration of cells over a time range of days, an image analysis program would ideally be able (a) to segment images and identify cells, (b) to individually name cells, (c) to assign cells in one image to cells in the preceding and succeeding images, that is, identify the same cell in sequential images although this cell may change location, size and shape, (d) identify whether a cell dies, (e) notice whether a cell divides and identify the offspring as new cells but related to the parental cell. We developed a segmentation method specific for the cell line studied here, MCF10CA1a cells [2]. Many segmentation methods exist in the literature: edgebased methods, methods which minimize certain energy functions [3] [4], and methods that combine different techniques, such as the coarse-to-fine iterative segmentation of 3T3 cells [5]. Several methods determine cell boundaries by tracking cell movements over time, either by tracking cell trajectories [6], or by feature-specific shape change detection [7]. Ersoy et.al. propose a flux-tensor based method that detects cell movements [8]. The images from this cell line present segmentation problems that are not addressed by these solutions: cell edges are not clearly defined and cell shapes are not consistent from one image to the next, so methods that depend on these features are not useful for this cell line. Other cell tracking methods fail to accurately track both the bright round cells, and the barely visible cells. In this paper, we discuss a method that tracks the combination of diverse, often fast-moving, cells of this breast cancer line. The parameters used in the segmentation method are based on the statistics of image intensities and user input is not required.
2
Tissue Culture
Breast cancer cells (MCF10CA1a cells) were cultured as described previously [9] [10]. Briefly, cells were grown in DMEM/F12 supplemented with 5 % horse serum (both Invitrogen, Carlsbad, CA) in a humidified atmosphere at 5 % CO2 . For experiments, cells were trypsinized and resuspended in a 1:1 mixture of DMEM/F12 supplemented with 5 % horse serum and DMEM, low glucose, supplemented with 10 % fetal bovine serum (Invitrogen). 25000 cells were plated into each dish (total volume: 3 ml). Cells were allowed to adhere overnight and then transferred into the incubator chamber of the microscope for monitoring cell migration1 .
3
Microscopy
Cultured cells were imaged continuously in a 5 % CO2 environment (Zeiss CO2 module S). Temperature was regulated to 37 ◦ C ± 0.1 ◦ C by use of an AirTherm 1
Certain trade names are identified in this report in order to specify the experimental conditions used in obtaining the reported data. Mention of these products does not constitute an endorsement of them.
Segmentation and Cell Tracking of Breast Cancer Cells
383
heater (WPI). Cells were imaged by use of a Zeiss AxioObserver microscope with a 40x, 1.3 NA oil-immersion objective and Cargille 37 immersion oil. DIC images were captured with an Orca EM (Hamamatsu) CCD camera every 2 min and saved as .zvi files at 512x512 pixels and 16 bit depth. Original files were converted to .tif series using ImageJ with no post-hoc file manipulation. Multiple areas of the coverslip were imaged simultaneously using a computer-controlled motorized stage (Ludl Mac5000) with a linear encoder. Axiovision (Zeiss, Version 4.6) software was used to control all hardware components. To minimize photo toxicity, 15 ms exposure times were achieved using a shutter (UniBlitz). Sample images can be found at: http://www.nist.gov/itl/math/hpcvg/cellimgvis.cfm.
Fig. 1. Early image
4
Background Removal
Background values for live cell images have been computed in a number of different ways, from taking the mean value or weighted mean of pixel intensities at each location [11] [3], to modeling pixel intensities over time at each location with Gaussian distributions representing cell and background [12], to top-hat filtering [4]. The goal for this segmentation is to do simultaneous cell tracking along with the segmentation, so we chose a method that can be performed on each image without knowledge of previous background image information. Background removal needs to be highly accurate in order to track the barely visible cells. To start, a mean value is found for the image as a whole, and all pixels with an intensity higher than three standard deviations above the mean are temporarily replaced by the mean value to smooth out the image. Then the background value for each individual pixel is determined by the average value in a small neighborhood of that pixel. We experimented with different sized neighborhoods, from 3 to 20 pixels in each direction. A neighborhood of ten pixels in each direction was used, which produced a smooth background without removing features of the image. Figure 1 shows an image at the start of growth in new cell medium, a 16-bit 508x496 pixel image (intensity units are on a scale
384
A.P. Peskin, D.J. Hoeppner, and C.H. Stuelten
Fig. 2. Sample image color-coded before and after background removal. Color table shows intensities before removal.
of 0-65536). Background removal for this image is displayed in Figure 2. The presence of the barely visible cells in the resulting image indicates the success of this technique.
5
Initial Segmentation
The initial images in a time sequence contain cells of a fairly similar size, although even at this stage several quite different cell morphologies are seen. Figure 3 shows some closeup pictures of a few of the cells from Figure 1, showing both cells with clearly defined boundaries, and cells that are barely visible on the image with large gaps in any apparent boundary. The overall approach is to capture the dark cell boundaries to define the brighter cells, and then use a combination of selectively dilating the fainter bright areas that mark the barely visible cells with a measure of pixel intensity standard deviation, which also indicates faint areas on the image that represent cells. Dilation in this paper will refer to enlarging a set of selected pixels by including pixels directly above, below, to the left or to the right of a selected pixel. Erosion refers to decreasing a set of selected pixels by eliminating pixels that have an unselected pixel above, below, to the left or to the right. To begin we define several overall patterns within the image, starting with the areas of high standard deviation. We want to capture cell features on the scale of the size of a dark boundary or a bright shadow inside a dark boundary. Most
Fig. 3. Sub-sections of the image in Figure 1, showing different cell morphologies
Segmentation and Cell Tracking of Breast Cancer Cells
385
of these features are represented by 3-10 pixels on the image. We experimented with neighborhoods varying between 3 and 15 pixels in each direction, looking for a pattern that best resembled the visual pattern of the cells. For each pixel, intensities in a neighborhood of five pixels in each direction are collected, and a mean and standard deviation are assigned to that pixel. The individual pixel values are then averaged to give an overall mean (om) and overall standard deviation for the pixel standard deviations (osd). Figure 4 shows pixels with individual standard deviation values above the overall mean plus one overall standard deviation (om + osd).
Fig. 4. Pixels with high standard deviation values
The dark areas defining cell borders are found in a series of steps shown in Figure 5. We look for the lowest possible intensity cutoff that leaves complete cell boundary sections intact and choose all pixels less than -100 intensity units after background removal. We remove small clusters of dark pixels. Most of the dark pixels of the background are present in small clusters of 10-15 pixels or fewer, so all remaining clusters of 15 or less of these pixels are eliminated. The remaining pixels are dilated to fill in any small gaps in the cell boundaries. Remaining pixels clusters that are not part of the cell boundaries are all located in small pixel clusters. It was found that eliminating clusters less than 50 pixels left all of the cell boundaries intact. The result is shown as the center picture in Figure 5. The last step is to eliminate all remaining clusters that do not coincide with the high standard deviation clusters of Figure 4. This last figure clearly shows that many of the cells are missing at least part of their cell boundary definition. Many of the cells on these images are dead cells and can be identified by a round appearance, bright interior, and dark boundary. These cells are important to identify as a normal terminal fate in cell development and differentiation. Some of the round cells are cells that are not yet attached to other cells. However it is not important to track the random movement of the dead cells once they come in contact with other cells. Many of the dead cells have brighter than average pixels scattered just inside the boundary. To enhance these bright regions, pixels are selected that are at least two standard deviations above the mean value for the whole image. These bright pixels are randomly located across the image, but
386
A.P. Peskin, D.J. Hoeppner, and C.H. Stuelten
Fig. 5. Steps to finding the dark cell borders
are seen in very small clusters of 3 or more pixels near cell borders. To enhance these bright regions, we dilate only the bright pixels in these clusters, and not the bright pixels scattered elsewhere. We evaluate the shapes of the resulting clusters of bright pixels, and the round clusters, as well as those completely surrounded by the dark regions, are initially assigned cell numbers. The resulting cells for the above image are shown in Figure 6.
Fig. 6. Bright round cells are assigned first
The next step is to identify the cells that do not have clear cell boundaries on the image. Potential cell pixels are selected by looking for a combination of pixels that are at least slightly above the mean pixel intensity, and also pixels whose standard deviation value is at least the mean standard deviation value. Several steps to assemble these pixels are shown for this example image in Figure 7. On the image these cells are recognized by the human eye because they are slightly brighter than the background. So we collect pixels slightly above the mean image intensity value (a value of 100 is used as a cutoff because it is approximately 1/3 of a standard deviation higher). Again to pick up the small clusters and leave out the background noise, we selectively dilate pixel clusters of greater than 3 pixels. Then pixels with standard deviation values less than the mean standard deviation are eliminated. Remaining pixels are dilated again, but cannot grow into the dark regions. Of the remaining pixels, we look at regions that have not been previously assigned to a cell above, and keep clusters bigger than 100 pixels, a cutoff for the size of a cell. We use the pattern of the pixel standard deviations from Figure 4 to understand which of the resulting pieces are part of the same cell. The pixels with the very highest standard deviation values are either part of the
Segmentation and Cell Tracking of Breast Cancer Cells
387
Fig. 7. Pixels with intensity above 100 are identified and clusters greater than 3 are dilated. Resulting pixels are kept if the standard deviation is at least the mean standard deviation of the image (om).
Fig. 8. Clusters larger than 100 pixels (left) are combined according to regions of very high standard deviation (center), and the final initial segmentation is shown (right)
dark border of a cell and its surrounding pixels, or part of the very brightest cell pieces. If two clusters of this set fall along the same high standard deviation region, they are probably both part of the same cell. The very high standard deviation pixels, defined by pixels whose standard deviation is at least the mean (om) plus two overall standard deviations (2*osd) are shown in Figure 8, along with the larger than 100 cluster pieces and the final grouping of these pieces. A few final steps check the resulting cells, fill any small holes, and eliminate cells that are too small.
6
Simultaneous Segmentation and Cell Tracking
Once an initial segmentation is performed, subsequent images can use information from the previous image to fill in gaps in cell boundaries, where they may not be present in an image. The beginning steps, defining a standard deviation value for each pixel, defining dark regions where they are present around the cells, and finding the brightest, round cells, are the same as those described above. Cell numbers are assigned to the bright round cells according to their location and size. A cell that moves significantly between images is matched by brightness and size across steps. As an example, Figure 9 shows a section of three consecutive images, where one of the cells is seen to be moving rapidly towards the bottom of the image. Once the bright round cells are numbered, the rest of the cells are then defined by pixel intensity standard deviation, areas brighter than the mean, sections of dark boundaries, and boundaries markers from a previous image. The steps are
388
A.P. Peskin, D.J. Hoeppner, and C.H. Stuelten
Fig. 9. Section of three consecutive images showing one of the cells (noted by the arrows) moving rapidly towards the bottom of the image and the corresponding sections of the segmentation masks
shown below. Figure 10 first shows the initial cell numbering for the bright, round cells on the left. Pixels representing the bright high standard deviation areas, or pixels associated with a cell in a previous image are shown on the right. As described above, pixels with intensities greater than 100 are collected, and clusters of greater than three pixels are selectively dilated. The resulting pixels are then kept if they have higher than the mean value of standard deviation, or if they are at locations of pixels from cells of the previous image. At this stage the clusters are split by the available dark boundaries and numbered. Clusters of pixels that are associated with a single cell from the previous image are assigned to that cell number. Clusters of pixels associated with more than one previous cell are eroded so that they split into several sections, each associated with separate cells or at least fewer cells. The eroding process is repeated if necessary.
Fig. 10. Numbered round bright cells (left), pixels that are brighter than average, have a high standard deviation, or are in locations of pixels from a previous cell image (right)
Fig. 11. Clusters of pixels associated with more than one cell. Several of these clusters are eroded to define separate cell clusters.
Segmentation and Cell Tracking of Breast Cancer Cells
389
Fig. 12. Images and resulting masks at 0, 10, 20, and 30 minutes
Fig. 13. Dividing cell at center of section of images taken two minutes apart, and the resulting segmentation masks
Figure 11 shows the clusters that include more than one cell, and then a final eroded set of clusters. There are a few final steps where accuracy is checked. Bright shadows under dark regions that look like potential new cells are eliminated. Gaps in cell boundaries that are overestimated in areas with bright shadows are redefined using the cell boundaries of the previous image. Cells that are less than 30 pixels are eliminated. Figure 12 shows the segmentation as a function of time for the first 30 minutes of cell growth. The images and resulting masks are given at 10-minute intervals. This segmentation method can keep track of mitosis events, with an example shown in Figure 13, where a section of images taken two minutes apart shows a dividing cell, and the corresponding segmentation masks also show that event.
7
Conclusions
For initial growth of this breast cancer cell line in culture, our segmentation/cell tracking program is capable of handling the five required tasks to robustly analyze migration of cells over time: segmenting and identifying cells, individually
390
A.P. Peskin, D.J. Hoeppner, and C.H. Stuelten
naming cells, assigning cells of one image to corresponding cells in the preceeding and succeeding images, identifying dead cells, and identifying mitosis events. For short periods we are able to track every cell in our test set of images. The purpose of this paper is to present new techniques for the challenging segmentation issues this cell line presents. Groundtruth data for this cell is painstakingly hard to collect for all of the cells, but future work will include larger scale tests with this method and comparisons with limited manually collected data. Future work will also include cell shape analysis to maintain high accuracy for longer periods. Because we track more than just a center cell location, we are able to analyze size and shape change information from these data over time as well, which will provide further insight into the biology of these cells. Acknowledgements. This research was supported in part by the Intramural Research Program of the National Cancer Institute, NIH, Bethesda, USA.
References 1. Chaffer, C.L., Weinberg, R.A.: A perspective on cancer cell metastasis. Science 331, 1559–1564 (2011) 2. Miller, F.R., Santner, S.J., Tait, L., Dawson, P.J.: MCF10DCIS.com xenograft model of human comedo ductal carcinoma in situ. J. Natl. Cancer. Inst. 92, 1185– 1186 (2000) 3. Simon, I., Pound, C.R., Parin, A.W., Clemnes, J.Q., Christens-Barry, W.A.: Automated Image Analysis System for Detecting Boundaries of Live Prostate Cancer Cells. Cytometry 31, 287–294 (1998) 4. Tscherepanow, M., Zollner, F., Hillebrand, M., Kummert, F.: Automatic Segmentation of Unstained Living Cells in Bright-Field Microscope Images. In: Perner, P., Salvetti, O. (eds.) MDA 2008. LNCS (LNAI), vol. 5108, pp. 158–172. Springer, Heidelberg (2008) 5. Zhang, K., Xiong, H., Yang, L., Zhou, X.: A Novel Coarse-to-Fine Adaptaton segmentation Approach for Cellular Image Analysis. In: Cham, T.-J., Cai, J., Dorai, C., Rajan, D., Chua, T.-S., Chia, L.-T. (eds.) MMM 2007. LNCS, vol. 4351, pp. 322–331. Springer, Heidelberg (2006) 6. Zanella, C., Campana, M., Rizzi, B., Melani, C., Sanguinetti, G., Bourgine, P., Mikula, K., Peyrieras, N., Sarit, A.: Cells Segmentation from 3-D Confocal Images of early Zebrafish Embryogeneis. IEEE Transactions on Image Processing (September 2009) 7. Wang, M., Zhou, X., Li, F., Huckins, J., King, R.W.: Novel cell segmentation and online SVM for cell cycle phase identification in automated microscopy. Bioinformatics 24(1), 94–101 (2007) 8. Palaniappan, K., Ersoy, I., Nath, S.: Moving object segmentation using the flux tensor for biological video microscopy. In: Ip, H.H.-S., Au, O.C., Leung, H., Sun, M.-T., Ma, W.-Y., Hu, S.-M. (eds.) PCM 2007. LNCS, vol. 4810, pp. 483–493. Springer, Heidelberg (2007) 9. Stuelten, C.H., Busch, J.I., Tang, B., et al.: Transient tumor-fibroblast interactions increase tumor cell malignancy by a TGF-Beta mediated mechanism in a mouse xenograft model of breast cancer. PLoS One 5, e9832 (2010)
Segmentation and Cell Tracking of Breast Cancer Cells
391
10. Tang, B., Vu, M., Booker, T., et al.: TGF-beta switches from tumor suppressor to prometastatic factor in a model of breast cancer progression. J. Clin. Invest. 112, 1116–1124 (2003) 11. Friedman, N., Russell, S.: Image segmentaton in video sequences: a probabilistic approach. In: Proceedings 13th Conference Uncertainty Artificial Intelligence (1997) 12. Kachouie, N.N., Fieguth, P., Ramunas, J., Jervis, E.: A Statistical Thresholding Method for Cell Tracking. In: IEEE International Symposium on Signal Processing and Information Technology (2006)
Registration for 3D Morphological Comparison of Brain Aneurysm Growth Carl Lederman1, Luminita Vese1, and Aichi Chien2 1
Department of Mathematics, University of California, Los Angeles, Los Angeles, CA 2 Division of Interventional Neuroradiology, Department of Radiological Sciences, University of California, Los Angeles, Los Angeles, CA [email protected]
Abstract. Recent advancements in 3D imaging technology have helped the early detection of brain aneurysms before aneurysm rupture. Developing management strategies for aneurysms has been an active research area. Because some unruptured aneurysms are followed up with medical images over years, there is an immediate need for methods to quantitatively compare aneurysm morphology to study the growth. We present a novel registration method which utilized the volumetric elastic model specifically for this medical application. Validations to test the accuracy of the algorithm using phantom models were performed to determine the robustness of the method. Examples of the medical application using aneurysm images are shown and compared with their clinical presentation.
1 Introduction Recent advancements in 3D imaging technology have helped the early detection of brain aneurysms before aneurysm rupture. Because of the high risks of mortality and morbidity related to aneurysm rupture, developing analysis strategies to evaluate the risk of rupture has been an active research area [1]. Many brain aneurysms are currently followed using medical images over years to monitor the growth without treatment. Recent studies have suggested that shape changes in these aneurysms may be an indication of increasing rupture risk and may need treatment to prevent rupture [2-3]. There is an immediate need for methods to analyze aneurysm growth, both for quantitative comparison of aneurysm geometry and to identify the location of growth to help treatment planning. Image registration and morphometry generally refer to morphing one image until it resembles another image and analyzing the deformation that occurs during the morphing. This process can be used as a new approach to study the disease progression, especially for the type of disease which is related to geometrical changes of an organ or tissue, for example brain aneurysms. Related work for brain aneurysm registration was presented by Byrne et al., who registered a high quality 3D aneurysm image with a lower quality 2D image to assist surgical planning [4]. Craene et al. also presented a method to analyze the surgical outcome by comparing images of an aneurysm treatment coil at different time points using non-rigid registration [5]. Although G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 392–399, 2011. © Springer-Verlag Berlin Heidelberg 2011
Registration for 3D Morphological Comparison of Brain Aneurysm Growth
393
medical applications for image registration have been well-developed, suitable algorithms to study aneurysm growth are not available because the 3D registration of the entire aneurysm involves the registration of complex brain vessels around the aneurysm [6]. We present a new algorithm which can be applied to both aneurysms and complex vessel structure in the registration process. Our approach incorporates the elastic energy model in the registration and uses the surrounding blood vessels as the reference to orient the aneurysm and achieve alignment of the aneurysm (the round shape) and the blood vessel (the tubular shape).
2 Methods 2.1 Image Segmentation and Mesh Generation Standard clinical 3D angiographic images of brain aneurysms were used in this study. Images were acquired by digital subtraction angiography (0.34×0.34×0.34 mm3) and computed topographic angiography (0.35×0.35×0.5 mm3). The region-based binary segmentation method was applied to the 3D images with equal weights for the contrast region (blood vessel and aneurysm) and the background region [7]. The contrast region is assumed to be equal to 1 and the rest of the area is equal to 0 for the segmentation. (Fig. 1 (a) (b)) Based on the segmented 3D image, the tetrahedral mesh was generated for the entire blood vessel [8] (Fig. 1 (c)). 2.2 Volumetric Registration The registration process involves deforming a tetrahedral mesh by moving the nodes from an initial state X to a deformed state x [9]. We fist define v = x – X so the registration is achieved by finding a v that minimizes an energy functional G. G(v) = A(v) + E(v).
(1)
G is the sum of two terms. The first is a data fidelity term using level sets, A, that deforms the mesh based on the geometry and the target image. The second term is a volumetric elastic energy E. By introducing an artificial time and morphing the mesh at each time step, we can reduce the energy function G. This formulation creates an interaction between the two terms (A(v) and E(v)) during the morphing which does not occur in the registration method based on a fixed vector field [10]. Since a successful registration yields the morphed mesh which completely covers the target vessel, we define M (Equation (2)) to measure how well the mesh is morphed. M=
∫ H (ϕ ( x − v( x)))I ∈ [0,1] ∫I 0
(2)
394
C. Lederman, L. Vese, and A. Chien
Fig. 1. The process of the prop posed method including the segmentation, generation of trianggular mesh, and registration using morphing m (bold arrows denote the aneurysm)
In Equation (2), H is the one o dimensional Heaviside function, and I represents the target image. ϕ0 ( x) is thee distance to the boundary of the initial tetrahedral m mesh generated from the initial im mage. If the two sets of images are perfectly matched, M is 1. Therefore, our proposed matching energy function A can be expressed as A( v ) = ∫ δ ( I − 1)(1 − H (ϕ 0 ( x − v ( x )))) + M ∫ (1 − δ ( I − 1)) H (ϕ 0 ( x − v ( x )))
+ (1 − M )∫ (1 − δ ( I − 1))(1 − H (ϕ0 ( x − v( x))))
(3)
where δ is the Dirac Delta function. A related model was proposed by Le Guyaleer et al. [11] which introduced th he nonlinear elastic smoother for a non-smooth 2D surfaace. In contrast, the energy A in n our model includes the exact values of the Heaviside and Delta functions and is prefferable for the problem—a smooth 2D boundary of a 3D object, the characteristics off blood vessel geometry. Since minimizing A may y result in large distortions, adding a regularization term mE becomes important to provide a reasonable elastic response when large deformatiions occur. We implemented Mooney-Rivlin M elasticity and calculated E directly on eeach tetrahedral element using th he initial state, X, and final state x, by Equation (4) whhere F=dx/dX [12].
∇ x v = I − F −1
(4)
The elasticity is defined in terms t of rotationally invariant measures of the deformattion using the Right Cauchy-Greeen deformation tensor C = F T F
J 1C = tr (C ),
J 2C =
1 (tr(C) 2 − tr (C 2 )), 2
J 3C = det( C )
(5)
The total elastic energy is obtained o by taking an integral over all tetrahedral elemennts.
W = k 1 J 1C + k 2 J 2C + g ( J 3C ),
E ( v ) = ∫ W (∇ v )
(6)
where g(x) is a quadratic function. f The advantages of introducing the elastic eneergy include (1) helping to preveent undesirable large deformation due to drastic local shhape changes and improve the morphing process. For example, when one part of the morphed mesh is in the corrrect location, it can pull nearby parts of the mesh into thheir correct locations based on elasticity. (2) as more of the morphed mesh matches the
Registration for 3D Morphological Comparison of Brain Aneurysm Growth
395
target region, the elastic energy can further improve the matching of the morphed mesh without changing A. Although an easy way to reduce the total energy G at each time step is by implementing the standard gradient descent using the L2 inner product (Equation (7)), the standard gradient descent does not allow large time steps; therefore, longer computational time is needed for complicated vessel registration.
dv = −∇ L2 G ( v ) dt
(7)
To speed up the computation for the clinical application, we applied the Sobolev gradient descent to reduce the total energy [13] (Equation (8)) .
dv = −∇ H 1 G ( v ) = − (1 − K Δ ) −1 ∇ L2 G ( v ) dt
(8)
where K is a positive constant and it only requires solving a linear system at each time step and allows for a much larger time step size than using the standard gradient descent.
3 Validations Using Phantom Models Synthetic data of phantom vessels were generated to test the robustness of the algorithm for the situation of vessel rotation and distortion. Because it is common to have slightly different head positions in clinical follow up images, we created an initial vessel (Fig 2 (a, left)) and target vessels with different rotation angles (Fig 2 (a, right)) to test how well the algorithm works in rotated images. The follow-up images may be taken a few years after the initial images. Sometimes the vessel anatomy changes due to the disease and causes distortion of the angle between two connecting vessels. Therefore, it is important to know how well our approach can be applied to distorted vessel data. To test different degrees of distortion, we also created an initial vessel (Fig 3 (a), left) and vessels with different distortion angles to define the target vessels (Fig 3 (a), right). Using the proposed registration algorithm, we were able to obtain the morphed vessels. Fig 2 (a, middle) and Fig 3 (a, middle) show the 2D slices of the morphed vessels for the tests of rotation and distortion, respectively. To compute how well the morphed vessel and target vessel match, the Dice coefficient was used quantify the similarity [14]. The Dice coefficient is defined as
D( I , B) =
2IIB I + B
∈ [0,1]
(9)
where I is the target vessel and B is the morphed vessel. The Dice coefficients for different degrees of rotation and distortion of the target vessels are shown in Fig 2 (b) and Fig 3 (b).
396
C. Lederman, L. Vese, and A. Chien
Fig. 2. Numerical experiments using synthetic vessel data with given rotation on the target image (a) shows the 2D slice of the initial, morphed, and target synthetic images. (b) shows the results of how the Dice coefficient changes for different rotation angels (c) is the representative results of the registration for the vessel rotated at 10 degrees. The gray scales are the triangle area of the morphed mesh element divided by area of the initial mesh element.
Fig. 3. Numerical experiments using synthetic vessel data with given distortion on vessel angles on the target. (a) shows the 2D slice of the initial, morphed, and target images. (b) shows the results of how the Dice coefficient changes for different distortion angels (c) is the representative results for vessel with 20 degree distortion (corresponds to the dashed squire and arrow in (a) and (b)). The grayscales are the area of the morphed mesh element divided by area in the initial mesh element.
Registration for 3D Morphological Comparison of Brain Aneurysm Growth
397
The matching results show that our method is adequate for the vessel geometry. It should be noted that although the target vessel which rotates 20 degrees from the original position is the limitation of the proposed method (Dice coefficient = 0.55, Fig 2 (a) dot square, (b) dot arrow), such rotation is beyond what can occur in brain aneurysm image data, because aneurysm images are acquired when the patient is lying down (face up) and the normal head rotation at that position is less than 10 degrees. Therefore, we found that for the application to analyze aneurysm growth, the proposed method can provide reliable results with high Dice coefficient. We also examined the changes in triangle areas in the synthetic phantom vessels. Changes of triangle area can be the indication of aneurysm growth. Because in the phantom tests, there are only changes in angles and there are no changes in shape, we expected no area changes in the phantom vessel. Gray scales are used to represent the ratio of the triangle area of the morphed mesh to the triangle area of the initial mesh. Representative models are shown in Fig 2 (c) and Fig 3 (c). The uniform changes on the vessels confirm that the algorithm correctly computed that there are no shape changes in the morphed vessels.
4 Registration of Clinical Aneurysm Images and Examinations of the Growth High resolution 3D images of brain aneurysm geometry acquired during patients’ clinical examinations were analyzed using the proposed registration method. Each aneurysm has two sets of 3D images—an initial image and a follow-up image (target image). The morphed mesh is used to generate a visualization that shows changes in an aneurysm over time based on these two sets of images. The ratio of the triangle area of the morphed mesh to the triangle area of the initial mesh is denoted with ranges from 1 (black) to 2 (white). The higher ratio corresponds to increases in size. Fig. 4 shows the result of registration for three aneurysms. The locations of aneurysms are denoted by arrows. In case A, there is no overall difference in either the large blood vessel or aneurysm suggesting there is no growth at the aneurysm. This result is validated by the clinical report. In cases B and C, we observed clear changes in aneurysm size, denoted by the noticeably more white areas at the aneurysms. This finding is in agreement with the clinical record. We also observed a wide range of deformation at areas with smaller blood vessels in all cases. This is due to the error generated during the segmentation process because those small vessels have diameter less than one or two voxels. These distortions, however, appear to have no effect on registration and morphing in the large blood vessel and aneurysm. High resolution 3D images for brain vessels have become broadly used in clinical practice in recent years. There will be an increasing amount of aneurysm growth data to be collected and analyzed. We demonstrated the feasibility of the proposed method using three sets of clinical patient data. In the future, more patient data should be tested to further validate and improve the proposed algorithm.
398
C. Lederman, L. Vese, and A. Chien
Fig. 4. The gray scales give th he triangle area in the morphed mesh divided by area in the innitial mesh. The aneurysms are ind dicated by arrows. The aneurysm in case A shows no grow wth. Aneurysms in cases B and C sh how growth. The results concur with the clinical reports.
Registration for 3D Morphological Comparison of Brain Aneurysm Growth
399
4 Conclusion We proposed a novel registration method to study aneurysm growth. This method matches 3D aneurysm images which were acquired at two different times from the same patient. We presented the validation study in phantom vessels and showed the medical application of this method. In all three presented patient cases, we were able to obtain and visualize the correct geometrical changes in the aneurysms which concurred with the clinical report. It is a feasible method for aneurysm growth analysis and provides valuable information and visualization for clinical aneurysm management.
References [1] Wiebers, D.O., Whisnant, J.P., Huston, J., Meissner, I., Brown, R.D., Piepgras, D.G., Forbes, G.S., et al.: Unruptured intracranial aneurysms: natural history, clinical outcome, and risks of surgical and endovascular treatment. Lancet 362(9378), 103–110 (2003) [2] Chien, A., Sayre, J., Vinuela, F.: Comparative Morphological Analysis of the Geometry of Rupture and Unruptured Aneurysms. Neurosurgery (March 15, 2011) [3] Boussel, L., Rayz, V., McCulloch, C., Martin, A., Acevedo-Bolton, G., Lawton, M., Higa-shida, R., et al.: Aneurysm growth occurs at region of low wall shear stress: patientspecific correlation of hemodynamics and growth in a longitudinal study. Stroke 39(11), 2997–3002 (2008) [4] Byrne, J.V., Colominas, C., Hipwell, J., Cox, T., Noble, J.A., Penney, G.P., Hawkes, D.J.: Assessment of a technique for 2D–3D registration of cerebral intra-arterial angiography. The British Journal of Radiology 77, 123–128 (2004) [5] Craene, M., Pozo, J., Cruz, M., Vivas, E., Sola, T., Leopoldo, G., Blasco, J., et al.: Coil Com-paction and Aneurysm Growth: Image-based Quantification using Non-rigid Registration. In: Medical Imaging 2008: Computer-Aided Diagnosis. Proc. of SPIE, vol. 6915, 69151R, pp. 1605–7422 (2008) [6] Thompson, P.M., Giedd, J., Woods, R.P., MacDonald, D., Evans, A.C., Toga, A.W.: Growth patterns in the developing brain detected by using continuum mechanical tensor maps. Nature 404(6774), 190–193 (2000) [7] Chan, T., Vese, L.A.: Active Contours Without Edges. IEEE Trans. Image Process. 10(2), 266–277 (2001) [8] Lederman, C., Joshi, A., Dinov, I., Vese, L., Toga, A., Van Horn, J.: The generation of tetra-hedral mesh models for neuroanatomical MRI. NeuroImage 55(1), 153–164 (2011) [9] Hughes, T.: The Finite Element Method: Linear Static and Dynamic Finite Element Analysis, pp. 109–182. Prentice Hall, Englewood Cliffs (1987) [10] Christensen, G.E., Rabbitt, R.D., Miller, M.I.: Deformable templates using large deformation kinematics. IEEE Transactions on Image Processing 5(10), 1435–1447 (1996) [11] Le Guyader C., Vese L.: A Combined Segmentation and Registration Framework with a nonlinear Elasticity Smoother. UCLA C.A.M. Report 08-16 (2008) [12] Pedregal, P.: Variational Methods in Nonlinear Elasticity, pp. 1–6. Society for Industrial and Applied Mathematics, Philadelphia (2000) [13] Neuberger, J.: Sobolev Gradients and Differential Equations, pp. 69–74. Springer, Heidelberg (1997) [14] Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)
An Interactive Editing Framework for Electron Microscopy Image Segmentation Huei-Fang Yang and Yoonsuck Choe Department of Computer Science and Engineering Texas A&M University College Station, TX 77843-3112 [email protected], [email protected]
Abstract. There are various automated segmentation algorithms for medical images. However, 100% accuracy may be hard to achieve because medical images usually have low contrast and high noise content. These segmentation errors may require manual correction. In this paper, we present an interactive editing framework that allows the user to quickly correct segmentation errors produced by automated segmentation algorithms. The framework includes two editing methods: (1) editing through multiple choice and (2) interactive editing through graph cuts. The first method provides a set of alternative segmentations generated from a confidence map that carries valuable information from multiple cues, such as the probability of a boundary, an intervening contour cue, and a soft segmentation by a random walker. The user can then choose the most acceptable one from those segmentation alternatives. The second method introduces an interactive editing tool where the user can interactively connect or disconnect presegmented regions. The editing task is posed as an energy minimization problem: We incorporate a set of constraints into the energy function and obtain an optimal solution via graph cuts. The results show that the proposed editing framework provides a promising solution for the efficient correction of incorrect segmentation results.
1
Introduction
With the significant progress made in the image acquisition process, electron microscopy (EM) instruments are now able to image tissues at a nanometer resolution and thus produce large-scale data volumes, which require image analysis methods for connectomic reconstruction. Therefore, researchers have put in a great amount of effort in developing segmentation algorithms to extract neuronal morphologies from stacks of serial EM images [1,2,3,4,5,6]. A great majority of these existing segmentation algorithms are designed for the automation of the segmentation pipeline. However, fully automated segmentation algorithms sometimes yield incorrect results even when their parameters are optimally tuned, which is mainly because of their failing to capture all variations posed by the data sets to be processed. Thus, erroneous segmentations would inevitably arise, and they may require manual correction. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 400–409, 2011. c Springer-Verlag Berlin Heidelberg 2011
An Interactive Editing Framework
401
Whereas much attempt has been made to design segmentation algorithms, relatively little attention has been given to the development of editing tools [7]. Hence, editing software for segmented EM imaging data remains in its infancy [8], and more effort in this area is needed. In an effort to provide editing tools, in this paper, we present an interactive editing framework allowing the user to refine segmentation results with minimal user interaction. The framework includes two editing methods: (1) editing through multiple choice that provides a set of segmentation alternatives from which a user can select an acceptable one and (2) interactive editing through graph cuts that allows a user to edit segmentation results by interactively placing editing strokes. The first editing method aims to minimize user interaction by providing a set of potential segmentations. The editing process begins with combining multiple cues to generate a confidence map that indicates the degree of each pixel belonging to a specific label. By thresholding the confidence map using different values, a pool of segmentations are generated. The user can choose the most acceptable one, if available, among the automatically generated segmentations before starting any manual editing. Allowing the user to select the desired segmentation prior to manual editing will simplify the editing process. The second editing method utilizes recent advances in interactive segmentation algorithms. These algorithms take a few of the user inputs, either scribbles on foreground/background objects [9,10] or contour points belonging to an object’s boundary [11], and produce a segmentation result accordingly. Our editing tools are based on the interactive graph cuts algorithms [9], where the editing task is cast as an energy minimization problem. We incorporate a set of constraints into the energy function and then obtain an optimal solution via graph cuts. The remainder of this paper is organized as follows. Section 2 details the approach for generating the confidence map and a set of potential segmentations. Section 3 elaborates on the editing tool allowing the user to place editing strokes to refine segmentation results. Section 4 offers several results followed by conclusions in section 5.
2
Editing through Multiple Choice
Image segmentation is an ill-posed problem. A particular feature used or parameter choice for a segmentation algorithm strongly affects the quality of segmentations, thus many researchers have considered multiple segmentations for an image for the segmentation task [12]. Inspired by this, we generate a pool of candidate segmentations by thresholding a confidence map at different values. The user thus can choose the correct segmentation, if available, from these generated segmentations. 2.1
Cues for Confidence Map Generation
It is shown in [13] that the visual system integrates different attributes (e.g., luminance, color, motion, and texture) for contour localization because all
402
H.-F. Yang and Y. Choe
(a) Image
(b) PB
(c) RW
(d) IC(1)
(e) IC(2)
(f) Conf. map
(g) Seg. 1
(h) Seg. 2
(i) Seg. 3
(j) Seg. 4
Fig. 1. Generation of the confidence map and segmentation alternatives. (a) is the original image. (b), (c), (d), and (e) show the results produced by different cues. (f) is the confidence map generated by a linear combination of different cues. Note that a brighter color indicates a higher probability of belonging to an object. (g), (h), (i), and (j) are segmentations obtained by thresholding the confidence map at 0.4, 0.5, 0.6, and 0.7, respectively.
attributes play an essential role for a contour localization task. Motivated by this, we use multiple cues, analogous to different attributes in localization of contours, to generate a confidence map that indicates the degree of a pixel belonging to a label. These cues are: – Probability of boundary (PB). Boundary is an important cue for distinguishing different segments in the image segmentation task. This cue takes the output of a classifier that provides the posterior probability of a boundary at each pixel. The classifier is trained by combining local brightness, color, and texture features in order to accurately detect the boundary [14]. Figure 1(b) shows the probability of a boundary of the image in Figure 1(a). – Random walker segmentation (RW). In contrast to a hard segmentation, that is, a pixel belonging to either the foreground (1) or not (0) for the binary case, random walker segmentation [15] produces a soft segmentation, as shown in Figure 1(c), obtained by determining the probability a random walker starting from an unlabeled pixel reaches to the user pre-labeled pixels. The segmentation is obtained by assigning each point to the label that is of the largest probability. – Intervening contour (IC). The intervening contour [16] concept suggests that pixels on the two different sides of a boundary are more likely to belong to different segments. Given two pixels on an image, the affinity between them is measured as the maximum gradient magnitude (or other measurements) of a straight-line path between them, as shown in Figure 1(d) (IC1).
An Interactive Editing Framework
403
In addition to using gradient magnitude as a measurement, we also compute the affinity between pixels based on the probability of a boundary. Figure 1(e) (IC2) shows the result of considering the probability of a boundary as a measurement in computing an intervening contour cue. 2.2
Cue Combination
Before all ofthe individual cues are combined, their values are normalized to [0, 1]. Let P ckp I) be the value at pixel p produced by cue k. The confidence value at pixel p is defined as a linear combination of all cues, which is P (cp |I) = wk P ckp |I , (1) k
where wk is the relative importance of cue k, and sum of all wk is equal to 1. The weight for each cue was set empirically. The confidence map shown in Figure 1(f) is a combination of Figure 1(b) through Figure 1(e). Note that the brighter color indicates a higher probability of belonging to an object. 2.3
Generation of Segmentation Alternatives
To generate multiple segmentation alternatives of an image, the confidence map is thresholded at different values. Figure 1(g) through Figure 1(j) show the segmentation alternatives obtained by using the threshold values between 0.4 and 0.7. The user can choose the most acceptable segmentation, if one existed, from these alternatives before applying any editing. This reduces the amount of time for user interaction.
3
Interactive Editing through Graph Cuts
User interaction to correct erroneous segmentation results is key to providing accurate segmentations. We cast the interactive editing problem as the Maximum A Posteriori (MAP) estimation of Markov Random Field (MRF). The MAPMRF formulation minimizes a Gibbs energy function that is defined over the user input, presegmentation1, and image intensity. Graph cuts are then used to obtain the optimal solution to the energy function. 3.1
Editing Energy Function
The interactive editing aims to obtain a new segmentation that satisfies a set of constraints: the user input, presegmentation, and image data. Analogous to image segmentation, the interactive editing is a labeling problem which involves assigning image pixels a set of labels [17]. Consider an image I containing the set of pixels P = {1, 2, ..., M }, a set of labels L = {li , l2 , ..., lK }, 1
Here, following the definition from [10], we define presegmentation as the prior, incorrect, pre-existing segmentation.
404
H.-F. Yang and Y. Choe
(a)
(b)
(c)
(d)
(e)
Fig. 2. The difference between using the Euclidean distance and using the intervening contour cue in the unary potential. The blue and green scribbles in (c) indicate the foreground and background marks, respectively. (d) shows the penalty, based on the Euclidean distance, for changing the pixels classified as the background in (b) to the foreground. (e) shows the penalty by using the intervening contour cue. The color clearly shows transitions from black (low penalty) to gray (high penalty) when the intervening contour cue is used, and the transitions coincide with the object boundary.
the user input U = {up : p ∈ P, up ∈ L}, and a nearly correct presegmentation y = {yp : p ∈ P, yp ∈ L}. We seek a new optimal labeling x by means of minimizing an energy function given the user input U , presegmentation y, and image data D. The energy function is given as [18]: Vp (xp | y, U, D) + Vpq (xp , xq | D) , (2) E (x | y, U, D) = p∈P
p∈P q∈Np
where Np is the set of neighboring pixels of p, Vp (xp | y, U, D) is the unary clique potential, and Vpq (xp , xq | D) is the piecewise clique potential. We incorporate the user input and presegmentation into the unary potential and impose the boundary smoothness in the piecewise potential. 3.2
User Input and Presegmentation Constraints
To solve Equation 2 via graph cuts, two terminal nodes, s and t, are added into the graph. The unary potential defines the weights between a node p and terminal nodes. Because a presegmentation is nearly correct, the new segmentation should be similar to the presegmentation after editing. Only the pixels with changed labels are penalized. The penalty for changing a label is defined as: ∞ p ∈ Uf (3) wsp = ICb (p) p ∈ / Uf
and wpt =
∞ ICf (p)
p ∈ Ub , p∈ / Ub
(4)
where Uf and Ub are the foreground and background labels, respectively, and ICf (p) and ICb (p) are the affinities between a pixel p to the nearest of the user labeled pixels Uf and Ub , respectively. The unary potential is similar to that
An Interactive Editing Framework
405
in [10] that suggests use of the Euclidean distance from a pixel to the nearest of the user labeled pixels; however, our work considers the intervening contour cue that is important for distinguishing different objects. Figure 2 demonstrates the difference in using Euclidean distance and the intervening contour cue as the unary potential. As can be seen from Figure 2(e), the color clearly shows transitions from black (low penalty) to gray (high penalty), whereas no such transitions are shown in Figure 2(d) in which the penalties are given based on the Euclidean distance. The transitions in Figure 2(e) coincide with the object boundary, where a cut is more likely to occur. 3.3
Image Data Constraint
Piecewise potential ensures boundary smoothness by penalizing neighboring pixels assigned different labels. Based on Potts model [9], it is given as: Vpq (xp , xq | D) = wpq · (1 − δ (xp , xq )) ,
(5)
where δ (xp , xq ) is the Kronecker delta, and wpq is a penalty for assigning two neighboring pixels, p, and q, to different labels, defined using a Gaussian weighting function: 2 (Ip − Iq ) 1 , (6) wpq = exp − · 2 2σ p − q where Ip and Iq are pixel intensities ranging from 0 to 255, p−q is the Euclidean distance between p and q, and σ is a positive parameter.
4
Results
We carried out the experiments on the Serial Block Face Scanning Electron Microscopy (SBFSEM) [19] image stack to demonstrate the effectiveness of the proposed segmentation editing methods. The volumetric larval zebrafish tectum SBFSEM data set contains 561 images, each of which has the size of 631×539 pixels. A major challenge in reconstructing neuronal morphologies from SBFSEM image data lies in segmenting densely packed cells that have blurred boundaries mainly resulting from the imperfect image acquisition process. As a result, user editing is key to resolving the boundary ambiguities, in which case a segmentation algorithm fails to produce a satisfactory result. Thirty incorrect segmentations produced by an automated segmentation algorithm were used in the experiments. A few examples of editing through multiple choice are shown in Figure 3. These examples demonstrate that the generated alternatives contain a few acceptable segmentations, showing that the confidence map obtained by integrating different cues is reliable. Figure 4 shows the examples of interactive editing through graph cuts. The first, second, third, and last columns of Figure 4 show the original images, incorrect segmentations, user edits, and editing segmentation results, respectively. The blue strokes are the foreground marks, and the green strokes are background marks. This editing
406
H.-F. Yang and Y. Choe
(a) Image
(b) Seg. 1
(c) Seg. 2
(d) Seg. 3
(e) Seg. 4
Fig. 3. Examples of generated segmentation alternatives. (a) Original images. (b)-(d) The generated segmentation alternatives by thresholding the confidence map at 0.4, 0.5, 0.6, and 0.7, respectively. As can be seen, the generated alternatives contain at least a few acceptable segmentations. This shows that the confidence map obtained by integrating different cues is reliable. As a result, generating segmentations based on the confidence map is able to provide reasonable segmentation alternatives from which the user can choose one.
tool gives the user flexibility of placing strokes on a few pixels and produces a segmentation that meets the user’s requirements. For quantitative evaluation, we applied Dice similarity coefficient (DSC) [20] and F-measure to measuring the similarity between two segmentations. DSC measures the amount of overlap between the obtained segmentation results and the ground truth. More specifically, letting Z be the set of voxels of the obtained 2|Z∩G| , segmentation results and G be the ground truth, DSC is given as DSC = |Z|+|G| where | · | is the number of voxels. A DSC value of 0 indicates no overlap between two segmentations, and 1 means two segmentations are identical. F-measure is R , where P = |Z∩G| and R = |Z∩G| are the precision and defined as F = P2P+R |Z| |G| recall of the segmentation results relative to the ground truth.
An Interactive Editing Framework
(a) Image
(b) Incorrect seg.
(c) User edit
407
(d) Result
Fig. 4. Examples of interactive segmentation editing. (a) shows the original SBFSEM images, (b) the incorrect segmentations produced by an automated segmentation algorithm, (c) the user edits, where the blue and green strokes represent the foreground and background marks, respectively, and (d) the editing results. The editing algorithm takes a number of user inputs, together with the incorrect segmentation and image data constraints, and computes a new segmentation accordingly.
When the editing through multiple choice method was applied to refining segmentations, around 15 out of 30 segmentation errors can be corrected (i.e., the user can obtain an acceptable segmentation from the generated alternatives). The DSC and F-measure for this method were 0.9670 and 0.9672, respectively. The interactive editing through graph cuts was used to correct all 30 erroneous Table 1. Quantitative evaluation results of the proposed methods. The DSC values of both methods are above 0.96, indicating the obtained editing results are highly overlapped with ground truth segmentations. Both methods yield high F-measure values as well.
Editing through multiple choice Editing through graph cuts
DSC 0.9670 0.9763
Precision 0.9767 0.9676
Recall 0.9578 0.9947
F-measure 0.9672 0.9810
408
H.-F. Yang and Y. Choe
segmentations. The DSC and F-measure for this method were 0.9763 and 0.9810, respectively. Table 1 summaries the quantitative evaluation results of the proposed framework. As we can see, the DSC values of both methods are above 0.96, indicating that the editing results are highly overlapped with the manual segmentations. Moreover, the editing through multiple choice method can correct more than 50% of the segmentation errors, minimizing the user intervention prior to applying manual editing.
5
Conclusions and Future Work
In this paper, we presented an interactive editing framework that allows the user to correct segmentation errors produced by automated segmentation algorithms. By thresholding a confidence map using different values, the proposed editing framework first obtains a pool of alternative segmentations from which the user can select the most acceptable result, aiming for minimizing user interaction. In addition, the editing framework includes an editing tool that the user can place editing marks on a few pixels to produce the desired segmentation result. The editing task is formalized as an energy minimization problem and incorporates a set of constraints, ranging from the user input and presegmentation to image data, into the energy function. Experimental results showed that the proposed editing framework provides a promising solution to the segmentation of SBFSEM image data sets. Future work includes using the knowledge of neuronal topologies to analyze the reconstruction results and to detect potential errors and thus pointing out these errors to the user for further correction, minimizing the amount of time required for manual proofreading of the reconstruction results. Acknowledgements. This work was supported in part by NIH/NINDS #1R01NS54252 and NSF CRCNS #0905041. We would like to thank Stephen J. Smith (Stanford) for the SBFSEM data and Timothy Mann for the valuable comments and proofread.
References 1. Andres, B., K¨ othe, U., Helmstaedter, M., Denk, W., Hamprecht, F.A.: Segmentation of SBFSEM volume data of neural tissue by hierarchical classification. In: Rigoll, G. (ed.) DAGM 2008. LNCS, vol. 5096, pp. 142–152. Springer, Heidelberg (2008) 2. Jain, V., Murray, J.F., Roth, F., Turaga, S.C., Zhigulin, V.P., Briggman, K.L., Helmstaedter, M., Denk, W., Seung, H.S.: Supervised learning of image restoration with convolutional networks. In: Proc. IEEE Int’l Conf. Computer Vision, pp. 1–8 (2007) 3. Jurrus, E., Hardy, M., Tasdizen, T., Fletcher, P., Koshevoy, P., Chien, C.B., Denk, W., Whitaker, R.: Axon tracking in serial block-face scanning electron microscopy. Medical Image Analysis 13, 180–188 (2009)
An Interactive Editing Framework
409
4. Kaynig, V., Fuchs, T., Buhmann, J.M.: Neuron geometry extraction by perceptual grouping in ssTEM images. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2902–2909 (2010) 5. Mishchenko, Y.: Automation of 3D reconstruction of neural tissue from large volume of conventional serial section transmission electron micrographs. J. Neuroscience Methods 176, 276–289 (2009) 6. Yang, H.F., Choe, Y.: Cell tracking and segmentation in electron microscopy images using graph cuts. In: Proc. IEEE Int’l. Symp. Biomedical Imaging, pp. 306–309 (2009) 7. Kang, Y., Engelke, K., Kalender, W.A.: Interactive 3D editing tools for image segmentation. Medical Image Analysis 8, 35–46 (2004) 8. Jain, V., Seung, H.S., Turaga, S.C.: Machines that learn to segment images: a crucial technology for connectomics. Current Opinion in Neurobiology 20, 653–666 (2010) 9. Boykov, Y., Jolly, M.: Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In: Proc. IEEE Int’l. Conf. Computer Vision, pp. 105–112 (2001) 10. Grady, L., Funka-Lea, G.: An energy minimization approach to the data driven editing of presegmented images/volumes. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4191, pp. 888–895. Springer, Heidelberg (2006) 11. Mortensen, E.N., Barrett, W.A.: Intelligent scissors for image composition. In: SIGGRAPH 1995: Int’l conf. Computer Graphics and Interactive Techniques, pp. 191–198 (1995) 12. Pantofaru, C., Schmid, C., Hebert, M.: Object recognition by integrating multiple image segmentations. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 481–494. Springer, Heidelberg (2008) 13. Rivest, J., Cabanagh, P.: Localizing contours defined by more than one attribute. Vision Research 36, 53–66 (1996) 14. Martin, D.R., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 26, 530–549 (2004) 15. Grady, L.: Random walks for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 28, 1768–1783 (2006) 16. Leung, T., Malik, J.: Contour continuity in region based image segmentation. In: Burkhardt, H.-J., Neumann, B. (eds.) ECCV 1998, Part I. LNCS, vol. 1406, pp. 544–559. Springer, Heidelberg (1998) 17. Li, S.Z.: Markov random field modeling in image analysis. Springer-Verlag New York, Inc., Secaucus (2001) 18. Vu, N.: Image Segmentation with Semantic Priors: A Graph Cut Approach. PhD thesis, The University of California at Santa Barbara (2008) 19. Denk, W., Horstmann, H.: Serial block-face scanning electron microscopy to reconstruct three-dimensional tissue nanostructure. PLoS Biology 2, e329 (2004) 20. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26, 297–302 (1945)
Retinal Vessel Extraction Using First-Order Derivative of Gaussian and Morphological Processing M.M. Fraz1, P. Remagnino1, A. Hoppe1, B. Uyyanonvara2, Christopher G. Owen3, Alicja R. Rudnicka3, and S.A. Barman1 1
Faculty of Computing, Information Systems and Mathematics, Kingston University, London, United Kingdom 2 Department of Information Technology, Thammasat University, Thailand 3 Division of Population Health Sciences and Education, St. George’s, University of London, United Kingdom {moazam.fraz,s.barman,p.remagnino,a.hoppe}@kingston.ac.uk, [email protected],{cowen,arudnick}@sgul.ac.uk
Abstract. The change in morphology, diameter, branching pattern and/or tortuosity of retinal blood vessels is an important indicator of various clinical disorders of the eye and the body. This paper reports an automated method for segmentation of blood vessels in retinal images by means of a unique combination of differential filtering and morphological processing. The centerlines are extracted by the application of first order derivative of Gaussian in four orientations and then the evaluation of derivative signs and average derivative values is made. The shape and orientation map of the blood vessel is obtained by applying a multidirectional morphological top-hat operator followed by bit plane slicing of a vessel enhanced grayscale image. The centerlines are combined with these maps to obtain the segmented vessel tree. The approach is tested on two publicly available databases and results show that the proposed algorithm can obtain robust and accurate vessel tracings with a performance comparable to other leading systems.
1 Introduction The morphological features of retinal vessels have been related to cardiovascular risk factors in childhood and adulthood, such as blood pressure and body size, and with cardiovascular outcomes in later life, including both coronary heart disease and stroke [1]. Retinal vessels are composed of arteriolars and venules, which appear as elongated branched features emanating from the optic disc within a retinal image. Retinal vessels often have strong light reflexes along their centerline, which is more apparent on arteriolars than venules, and in younger compared to older participants, especially those with hypertension. The vessel cross-sectional intensity profiles approximate to a Gaussian shape, or a mixture of Gaussians in the case where a central vessel reflex is present. The orientation and grey level of a vessel does not change abruptly; they are locally linear and gradually change in intensity along their lengths. The vessels can be expected to be connected and, in the retina, form a binary treelike structure. However, the shape, size and local grey level of blood vessels can vary hugely and some background features G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 410–420, 2010. © Springer-Verlag Berlin Heidelberg 2010
Retinal Vessel Extraction Using First-Order Derivative
411
may have similar attributes to vessels. Arteriolar-venule crossing and branching can further complicate the profile model. As with the processing of most medical images, signal noise, drift in image intensity and lack of image contrast pose significant challenges to the extraction of blood vessels. It is commonly accepted by the medical community that automatic quantification of retinal vessels is the first step in the development of a computer assisted diagnostic system for ophthalmologic disorders. The methodologies for detecting retinal blood vessels reported in the literature can be classified into techniques based on pattern recognition, match filtering, morphological processing, vessel tracking, multiscale analysis and model based algorithms. The pattern recognition methods can be further divided into two categories: supervised methods and unsupervised methods. Supervised methods utilize ground truth data for the classification of vessels based on given features. These methods include neural networks, principal component analysis (PCA) [2], k-nearest neighbor classifiers and ridge based primitives classification [3], line operators and support vector classification [4], Gabor wavelet in combination with PCA and mixture of Gaussian [5]. The unsupervised methods work without any labeled ground truths and prior knowledge such as the Fuzzy C-Means clustering algorithm [6]. The matched filtering methodology exploits the piecewise linear approximation and the Gaussian-like intensity profile of retinal blood vessels and uses a kernel based on a Gaussian or its derivatives to enhance the vessel features in the retinal image. The matched filter (MF) was first proposed by Chaudhuri [7] and later adapted and extended by Hoover [8] and Jiang [9]. An amplitude-modified second order Gaussian filter was proposed by Gang and Chutatape [10] whereas Zhang et al. [11] proposed an extension and generalization of MF with a first-order derivative of Gaussian. A hybrid model of the MF and ant colony algorithm for retinal vessel segmentation was proposed by Cinsdikici et al. [12]. The algorithms based on mathematical morphology for identifying vessel structures have the advantage of speed and noise resistance. Zana and Klein [13] combines morphological filters and cross-curvature evaluation to segment vessel-like patterns. Mendonca et al. [14] detected vessel centerlines in combination with multiscale morphological reconstruction. Fraz at el. [15] combined vessel centerlines with bit planes to extract the retinal vasculature. The tracking based approaches [6] segment a vessel between two points using local information and works at the level of a single vessel rather than the entire vasculature. The multiscale approaches for vessel segmentation are based on scale space analysis [16, 17]. The width of a vessel decreases gradually across its length as it travels radially outward from the optic disc. Therefore the idea behind scale-space representation for vascular extraction is to separate out information related to the blood vessel having varying widths. The model based approaches utilize the vessel profile models [18], active contour models [19] and geometric models based on level sets [20] for vessel segmentation. This paper presents a new approach for the automated localization of the retinal vasculature. The foundation of this work is formed by the combination of two processes: (1) the detection of vessel centerlines and (2) the bit plane slicing of the morphologically enhanced image. There are several methods found in the literature for detection of vessel centerlines in retinal images. Lowel [21] uses a tramline filter for labeling of vessel centerlines. The algorithm presented by Staal [3] is based on an extraction of image ridges, which coincides approximately with vessel centerlines. Mendonca [14] extracts the vessel centerlines with the Difference of Offset Gaussian
412
M.M. Fraz et al.
filter kernel which is derived by subtracting the binomial kernel from itself at a distance along the cardinal axes and is an approximation of the Gaussian derivative. Sun [22] detects vessel centerlines by employing fuzzy morphological enhancement followed by thresholding and morphological thinning. Sofka [23] presents a likelihood ratio test for vessel centerlines extraction that combines matched-filter responses, confidence measures and vessel boundary measures. The normalized gradient vector field for centerline extraction is used by Lam [24]. Espona [25] extracts the vessel centerlines using Multilevel Set Extrinsic Curvature based on the Structure Tensor (MSEC-ST) [26] which guides the proposed contour model for vessel segmentation. The uniqueness of this work lies in the detection of vessel centerlines by employing the filter kernel which is derived by computing the first order derivative of the original matched filter [7] that was initially proposed for vessel detection. The other key contribution is to demonstrate the application of morphological bit planes and their combination with vessel centerlines for segmentation. The results of this work are evaluated with a number of performance metrics and are compared with those from other recent methods; leading to the conclusion that proposed algorithm is comparable with other approaches in terms of accuracy, sensitivity and specificity. The organization of the paper is as follows. Section 2 illustrates the proposed methodology in detail. Some illustrative experimental results of the algorithm on the images of the DRIVE [3] and STARE [8] image sets are presented in Section 3, and the paper is concluded in Section 4.
2 Proposed Methodology The methodology adopts a two stage approach, in which the centerlines are detected using First-order Derivative of Gaussian (FoDoG), which is combined with the vessel shape and orientation maps to produce the segmented image. The variation in background grey levels in the retinal image is normalized by computing the maximum principal curvature thus obtaining a vessel enhanced image. The FoDoG filter is applied in four orientations to this vessel enhanced image. The candidate pixels for vessel centerlines are detected by evaluation of derivative signs and the average derivative values in the four matched filter response (MFR) images. The vessel centerline pixels are identified in each directional candidate pixel image by using image statistical measures. The shape and orientation map of the retinal vessels is acquired with bit planes slicing of the morphologically vessel enhanced image. The enhanced vessel image is the result of the sum of images obtained by applying the top-hat transformation on a monochromatic representation of the retinal image using a linear structuring element rotated in eight directions. The detected vessel centerlines are then combined with the shape and orientation map to produce the binary segmented vascular tree. 2.1 Differential Filtering The matched filter that was first proposed in [7] to detect the vessels in retinal images is defined as,
Retinal Vessel Extraction Using First-Order Derivative
f ( x, y ) =
x2 1 exp(− 2 ); 2σ 2πσ
for |x| ≤ (t,σ ), | y |≤ L / 2
And the first order derivative g(x,y) of f(x,y) is defined as, x x2 g ( x, y ) = − exp( − ); for |x| ≤ (t,σ ),| y |≤ L / 2 2σ 2 2πσ 3
413
(1)
(2)
Where, “σ” represents the scale of filter; “L” is the length of the neighborhood along the y-axis; “t” is a constant and is usually set to 3 because more than 99% of the area under the Gaussian curve lies within the range [-3σ, 3σ]. There is a gradual intensity variation in the background of retinal images taking place from the periphery of the retina towards the central macular region. The vessels are enhanced from the background by computing the maximum principal curvature and Eigen decomposition of the Hessian matrix [16] for each pixel in the retinal image which is illustrated in Fig. 1(b). When a first order derivative filter is applied orthogonally to the main orientation of a vessel, then there will be positive values on one side of a vessel cross section and negative values on the other. The maximum local intensity across the vessel cross section profile generally occurs at the vessel center pixels, therefore the connected sets of pixels which correspond to intensity maxima at the vessel cross section profile are identified as centerline pixels. The FoDoG filter kernel is convolved with the vessel enhanced image in four orientations of 0o, 45o, 90o and 135o. The filter responses are illustrated in Fig. 1(c)-(f).
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. (a) Green channel image; (b) vessel enhanced image; (c)-(f) FoDoG filter response on image shown in (b) in 0o, 45 o, 90 o, 135 o respectively
The candidate pixels for vessel centerlines are marked by looking for a specific combination of derivative signs across the cross section profile of the vessel and orthogonally to the vessel in each one of the resulting four filter response images. We have adopted the technique used by [14] for evaluation of derivative sign and average
414
M.M. Fraz et al.
derivative values of the filter response images. A set of centerline segments in one particular direction is produced for each of the filter response image as shown in Fig. 2 (a)-(d). The final vessel centerline image is obtained by combining all the four images which is illustrated Fig. 2(e).
(a)
(b)
(d)
(e)
(c)
(f)
Fig. 2. (a)-(d) Vessel centerlines segments in 0o, 45o, 90o, 135o respectively; (e) vessel centerline image; (f) centerlines superimposed on grayscale retinal image
2.2
Morphological Processing
The structure of blood vessels is enhanced in contrast from the background of the retinal image by the application of morphological operators on the green channel of colored retinal image.
I thθ = I − ( I o S eθ ) Isth =
θ = ang
∑ θ
I thθ
(3) (4)
The morphological top-hat transformation is shown in equation (3) where “Ithθ” is the top-hat transformed image with line structuring element oriented at “θ” angle, “I” is the image to be processed and “Se” is structuring elements for morphological opening, “o” and “θ” is the angular rotation of structuring element. A set of line structuring elements where each one is a matrix representing a line with 21 pixels of length and rotated at every 22.5o is used for the top-hat transformation. The sum of the resulting images is obtained as depicted in equation (4), where “Isth” is the sum of the top-hat transformation. The set “ang” can be defined as: ang={ x| 0≤x≤180 & x mod(22.5) = 0}. The enhancement of vessels by the sum of the top-hat transformation is illustrated
Retinal Vessel Extraction Using First-Order Derivative
415
in Fig. 3(a). Bit plane slicing highlights the contribution made to the total image appearance by specific bits. Separating a digital image into its bit planes is useful for analyzing the relative importance played by each bit of the image. It is a process that aids in determining the adequacy of the number of bits used to quantize each pixel. The grayscale image resulting from the sum of the top-hat operation can be represented in the form of bit planes, ranging from bit plane 1 for the least significant bit to bit plane 8 for the most significant bit. It is observed that the higher order bits especially the top two contain the majority of the visually significant data. The other bit planes contribute to more subtle details in the image and appear as noise. A single binary image is obtained by taking the sum of bit plane 7 and bit plane 8, as shown in Fig. 3(b). This image is the approximation of the skeleton of retinal blood vessels. The final image of the vascular tree is obtained by a region growing procedure by taking the vessel centerlines as seed points and the vessel skeleton image as an aggregate threshold as shown in the Fig. 3(c).
(a)
(b)
(c)
(d)
Fig. 3. (a)Sum of top hat image; (b) logical OR of bit 7 and bit 8 of sum of top hat image; (c) segmented image; (d) segmented vessels superimposed on grayscale retinal image
3
Results and Discussion
The methodology has been evaluated on two publicly available DRIVE and STARE databases. The manual segmentation of vessels performed by two different observers is available with the databases. The 1st human observer is selected as the ground truth and the comparison of the 2nd observer with the ground truth image is regarded as the target performance level. In the retinal vessel segmentation process, the outcome is a pixel-based classification result. Any pixel is classified either as vessel or surrounding tissue. Consequently, there are four events; two classifications and two misclassifications. The classifications are the True Positive (TP) where a pixel is identified as vessel in both the ground truth and segmented image, and the True Negative (TN) where a pixel is classified as a non-vessel in the ground truth and segmented image. The two misclassifications are the False Negative (FN) where a pixel is classified as non-vessel in the segmented image but as a vessel pixel in the ground truth image, and the False Positive (FP) where a pixel is marked as vessel in the segmented image but non-vessel in the ground truth image. These terminologies are illustrated in Table 1.
416
M.M. Fraz et al. Table 1. Vessel Classification
Vessel detected Vessel not detected
Vessel Present True Positive (TP) False Negative (FN)
Vessel Absent False Positive (FP) True Negative (TN)
The algorithm is evaluated in terms of sensitivity (SN), specificity (SP), accuracy, positive predictive value (PPV), negative predictive value (NPV) and false discovery rate (FDR) and the Matthews correlation coefficient (MCC)[27], which is often used in machine learning and is a measure of the quality of binary (two-class) classifications. These metrics are defined in Table 2 based on the terms in Table 1 and are computed for the databases using the green channel of RGB images. Table 2. Performance metrics for retinal vessel segmentation
Measure SN SP Accuracy PPV NPV FDR
Description TP/(TP+FN) TN/(TN+FP) (TP+TN)/(TP+FP+TN+FN) TP/(TP+FP) TN/(TN+FN) FP / (FP+TP) (TP.TN - FP.FN) (TP+FP).(TP+FN).(TN+FP).(TN+FN)
MCC
Table 3 illustrates the performance of the proposed segmentation methodology according to the defined performance metrics for the test set of the DRIVE and STARE databases. The Root Mean Squared Errors (RMSE) for average accuracy for the DRIVE and STARE databases are shown in Table 4. RMSE is the measure of the Table 3. Performance metrics on DRIVE and STARE databases
Metric Used
DRIVE Database 2 Human Algorithm result Observer 0.9470 0.9430 0.7763 0.7152 0.9723 0.9768 0.8066 0.8205 0.9672 0.9587 0.1933 0.1794 nd
Average Accuracy Average SN Average SP Average PPV Average NPV Average FDR Average MCC
0.7601
0.7333
STARE Database 2 Human Algorithm result Observer 0.9348 0.9437 0.8951 0.7409 0.9384 0.9665 0.6425 0.7363 0.98837 0.9709 0.3574 0.2637 nd
0.7224
0.7003
Retinal Vessel Extraction Using First-Order Derivative
417
Table 4. Root Mean Square Error for Average Acuuracy
DRIVE database 0.2371
STARE database 0.3485
Table 5. Performance of vessel segmentation methods (DRIVE Images)
No
Methods nd
1 2 3 4 5 6 7 8 9 10 11 12
2 Human Observer Zana [13] Niemeijer [28] Jiang [9] Al-Diri [19] Martinez P.[17] Chaudhuri [7] Mendonca [14] Soares [5] Staal [3] Fraz [15] Proposed Method
Average Accuracy 0.9470
Sensitivity
0.9377 0.9416 0.9212 0.9258 0.9181 0.8773 0.9452 0.9466 0.9441 0.9303 0.9430
0.6971 0.7145 0.6399 0.6716 0.6389 0.3357 0.7344 0.7285 0.7345 0.7114 0.7152
0.7763
Specificity 0.9725 N.A 0.9696 N.A N.A N.A N.A 0.9764 0.9786 N.A 0.9680 0.9768
N.A=Not Available
Table 6. Performance of Vessel Segmentation Methods (STARE Images)
No 1 2 3 4 5 6 7 8
Methods
Average Accuracy
Sensitivity
Specificity
0.9348
0.8951
0.9384
0.9267 0.9009 0.9516 0.9478 0.9440 0.9367 0.9442
0.6751 N.A 0.6970 0.7197 0.6996 0.6849 0.7311
0.9567 N.A 0.9810 0.9747 0.9730 0.9710 0.9680
nd
2 Human Observer Hoover [8] Jiang[9] Staal [3] Soares [5] Mendonca [14] Fraz [15] Proposed Method
N.A=Not Available
differences between the values predicted by the algorithm and the values observed by the 2nd human observer. The RMSE of the average accuracy is less than 0.5 illustrating that the results from the proposed algorithm are statistically indistinguishable from the 2nd human observer in terms of average accuracy.
418
M.M. Fraz et al.
The results of the algorithm are also compared with other published methods in terms of average accuracy, specificity and sensitivity. The performance metrics of PPV, NPV, FDR and MCC are not available for most of the published methods. The comparison with other segmentation methods for DRIVE and STARE is illustrated in Table 5 and Table 6 respectively
4 Conclusion A new method for retinal vessel segmentation based on first-order derivative of Gaussian and morphological image processing proposed in this paper offers a valuable tool for vascular tree segmentation in ocular fundus images. The obtained results illustrate that the proposed method is comparable with contemporary methods in terms of the average accuracy, true positives and false positives. The skeleton detection algorithm is based on the fact that the maximum local intensity across a blood vessel cross-section profile in the retina generally occurs at the vessel center pixels and the vessel centerlines are measured as the connected sets of pixels which correspond to intensity maxima computed from the intensity profiles of the vessel cross sections. This assumption works perfectly with the vessels featuring the Gaussian shaped profile. However, these methods appear less suited to retinal images where arterioles show prominent light reflexes (e.g. in images of younger patients). The application of dual Gaussian profile or hermite polynomial shaped model to images in children with prominent vessel reflexes is the topic of future work. The vessel segmentation results of this algorithm are available online at http://studentnet.kingston.ac.uk/~k0961040/fodog.html.
References 1. Leung, H., Wang, J.J., Rochtchina, E., Tan, A.G., Wong, T.Y., Klein, R., Hubbard, L.D., Mitchell, P.: Relationships between Age, Blood Pressure, and Retinal Vessel Diameters in an Older Population. Investigative Ophthalmology & Vis. (2003) 2. Sinthanayothin, C., Boyce, J.F., Cook, H.L., Williamson, T.H.: Automated localisation of the optic disc, fovea, and retinal blood vessels from digital colour fundus images. British Journal of Ophthalmology 83, 902–910 (1999) 3. Staal, J., Abramoff, M.D., Niemeijer, M., Viergever, M.A., van Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. IEEE Transactions on Medical Imaging 23, 501–509 (2004) 4. Ricci, E., Perfetti, R.: Retinal Blood Vessel Segmentation Using Line Operators and Support Vector Classification. IEEE Transactions on Medical Imaging 26, 1357–1365 (2007) 5. Soares, J.V.B., Leandro, J.J.G., Cesar, R.M., Jelinek, H.F., Cree, M.J.: Retinal vessel segmentation using the 2-D Gabor wavelet and supervised classification. IEEE Transactions on Medical Imaging 25, 1214–1222 (2006) 6. Tolias, Y.A., Panas, S.M.: A fuzzy vessel tracking algorithm for retinal images based on fuzzy clustering. IEEE Transactions on Medical Imaging 17, 263–273 (1998) 7. Chaudhuri, S., Chatterjee, S., Katz, N., Nelson, M., Goldbaum, M.: Detection of blood vessels in retinal images using two-dimensional matched filters. IEEE Transactions on Medical Imaging 8, 263–269 (1989)
Retinal Vessel Extraction Using First-Order Derivative
419
8. Hoover, A.D., Kouznetsova, V., Goldbaum, M.: Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Transactions on Medical Imaging 19, 203–210 (2000) 9. Xiaoyi, J., Mojon, D.: Adaptive local thresholding by verification-based multithreshold probing with application to vessel detection in retinal images. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 131–137 (2003) 10. Gang, L., Chutatape, O., Krishnan, S.M.: Detection and measurement of retinal vessels in fundus images using amplitude modified second-order Gaussian filter. IEEE Transactions on Biomedical Engineering 49, 168–172 (2002) 11. Zhang, B., Zhang, L., Zhang, L., Karray, F.: Retinal vessel extraction by matched filter with first-order derivative of Gaussian. Computers in Biology and Medicine 40, 438–445 (2010) 12. Cinsdikici, M.G., Aydin, D.: Detection of blood vessels in ophthalmoscope images using MF/ant (matched filter/ant colony) algorithm. Computer Methods and Programs in Biomedicine 96, 85–95 (2009) 13. Zana, F., Klein, J.C.: Segmentation of vessel-like patterns using mathematical morphology and curvature evaluation. IEEE Transactions on Image Processing 10, 1010–1019 (2001) 14. Mendonca, A.M., Campilho, A.: Segmentation of retinal blood vessels by combining the detection of centerlines and morphological reconstruction. IEEE Transactions on Medical Imaging 25, 1200–1213 (2006) 15. Fraz, M.M., Javed, M.Y., Basit, A.: Retinal Vessels Extraction Using Bitplanes. Presented at Eight IASTED International Conference on Visualization, Imaging, and Image Processing, Palma De Mallorca, Spain (2008) 16. Frangi, A.F., Niessen, W.J., Vincken, K.L., Viergever, M.A., William, W., Alan, C., Scott, D.: Multiscale Vessel Enhancement Filtering. In: Wells, W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 130–137. Springer, Heidelberg (1998) 17. Martinez-Perez, M.E., Hughes, A.D., Stanton, A.V., Thom, S.A., Bharath, A.A., Parker, K.H.: Retinal Blood Vessel Segmentation by Means of Scale-Space Analysis and Region Growing. Presented at Proceedings of the Second International Conference on Medical Image Computing and Computer-Assisted Intervention, London, UK (1999) 18. Li, W., Bhalerao, A., Wilson, R.: Analysis of Retinal Vasculature Using a Multiresolution Hermite Model. IEEE Transactions on Medical Imaging 26, 137–152 (2007) 19. Al-Diri, B., Hunter, A., Steel, D.: An Active Contour Model for Segmenting and Measuring Retinal Vessels. IEEE Transactions on Medical Imaging 28, 1488–1497 (2009) 20. Sum, K.W., Cheung, P.Y.S.: Vessel Extraction Under Non-Uniform Illumination: A Level Set Approach. IEEE Transactions on Biomedical Engineering 55, 358–360 (2008) 21. Lowell, J., Hunter, A., Steel, D., Basu, A., Ryder, R., Kennedy, R.L.: Measurement of retinal vessel widths from fundus images based on 2-D modeling. IEEE Transactions on Medical Imaging 23, 1196–1204 (2004) 22. Sun, K., Chen, Z., Jiang, S., Wang, Y.: Morphological Multiscale Enhancement, Fuzzy Filter and Watershed for Vascular Tree Extraction in Angiogram. Journal of Medical Systems (2010) 23. Sofka, M., Stewart, C.V.: Retinal Vessel Centerline Extraction Using Multiscale Matched Filters, Confidence and Edge Measures. IEEE Transactions on Medical Imaging 25, 1531– 1546 (2006) 24. Lam, B.S.Y., Hong, Y.: A Novel Vessel Segmentation Algorithm for Pathological Retina Images Based on the Divergence of Vector Fields. IEEE Transactions on Medical Imaging 27, 237–246 (2008)
420
M.M. Fraz et al.
25. Espona, L., Carreira, M.J., Ortega, M., Penedo, M.G.: A Snake for Retinal Vessel Segmentation. In: Martí, J., Benedí, J.M., Mendonça, A.M., Serrat, J. (eds.) IbPRIA 2007, Part II. LNCS, vol. 4478, pp. 178–185. Springer, Heidelberg (2007) 26. López, A.M., Lloret, D., Serrat, J., Villanueva, J.J.: Multilocal Creaseness Based on the Level-Set Extrinsic Curvature. Computer Vision and Image Understanding 77, 111–144 (2000) 27. Kohavi, R., Provost, F.: Glossary of Terms. Machine Learning 30, 271–274 (1998) 28. Abramoff, M.N., Staal, J.J., Ginneken, B.v., Loog, M., Abramooff, M.D.: Comparative study of retinal vessel segmentation methods on a new publicly available database. Presented at SPIE Medical Imaging (2004)
High-Quality Shadows with Improved Paraboloid Mapping Juraj Vanek, Jan Navr´ atil, Adam Herout, and Pavel Zemˇc´ık Brno University of Technology, Faculty of Information Technology, Czech Republic http://www.fit.vutbr.cz
Abstract. This paper introduces a novel approach in the rendering of high-quality shadows in real-time. The contemporary approach to shadow rendering in real-time is shadow maps algorithm. The standard shadow maps suffer from discretization and aliasing due to the limited resolution of the shadow texture. Moreover, omnidirectional lights need additional treatment to work correctly (e.g. with dual-paraboloid shadow maps). We propose a new technique that significantly reduces the aliasing problem and works correctly for various kinds of light sources. Our technique adapts the light’s field-of-view to the visible part of the scene and thus stores only the relevant part of the scene to the full resolution of the shadow map. It produces high-quality shadows which are not constrained by position or type of the light sources and also works in fully dynamic scenes. This paper describes the new approach, presents some results, and discusses the potential, features and future of the approach.
1
Introduction
Shadows constitute an important part of computer graphics rendering methods because they allow for better perception of the depth complexity of the scene. Without shadows human viewers are not able to fully determine in-depth object positions in the virtually created scene. Virtual scenes are often very dynamic, so if we want to achieve high quality shadows the algorithm has to be robust and
Fig. 1. Large indoor scene with hemispherical light source and 1024 × 1024 paraboloid shadow map (Left). Same scene with improved paraboloid map of the same size (Right). G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 421–430, 2011. c Springer-Verlag Berlin Heidelberg 2011
422
J. Vanek et al.
work regardless of the scene’s configuration, specifically the light and camera position. An example can be seen in applications for modeling and visualization where developers require fast and accurate shadow casting, independent from light types, camera position and scene complexity. At present, the shadow algorithms used in real-time applications can be divided into two large groups. The first one is based on shadow volumes [1] and the second one on shadow maps [2]. Each of these two methods has advantages and disadvantages and is differently suitable for different types of light sources. Shadow volumes can be easily used to implement omnidirectional lights with per-pixel accuracy without any modifications. However, they depend heavily on scene geometry so in complex scenes this approach is very fill-rate intensive and can hardly ever be used to compute shadows in such scenes in real-time. On the contrary, shadow maps algorithm is much less dependent on the geometric complexity of the scene as it only requires a single additional drawing pass into the depth buffer. However, being a discrete image-space algorithm, the shadow maps algorithm suffers from discretization and aliasing problems and, moreover, the implementation of omnidirectional shadows from point lights is not straightforward because it requires more than one shadow map [3]. At present time, with increasing scene’s geometrical complexity, the shadow maps algorithm is the most frequently used one to render shadows in real-time applications despite its limitations. Our approach introduces a new method that improves the dual-paraboloid shadow maps method first introduced by Brabec et al [3] and further enhanced by Osman et al. [4]. This improvement is based on adding optimal paraboloid orientation and cutting the paraboloid unevenly to improve the sampling of shadows in important areas corresponding to the view frustum. It allows for the rendering of shadows with high quality regardless of light type (e.g point light, spot light or directional light) and light or camera position. Therefore, the method is suitable for large, dynamic, indoor and also outdoor scenes where high accuracy shadows are needed, but because of the high geometric complexity of the scene, the traditional approaches to render shadows from point lights like the shadow volumes cannot be used. Our solution is designed to run optimally on modern graphics accelerators and the measurements show no measurable performance penalty compared to the original dual-paraboloid shadow maps.
2
Previous Work
As mentioned above, two main branches of shadow algorithms are contemporarily being used for real-time rendering. A shadow volumes algorithm [1] works in the object space and a shadow maps algorithm [2] works in the image space. Shadow volumes are created from an object’s silhouettes towards the light source. Since this algorithm works in the object space, all shadows are hard with per-pixel accuracy and they are very suitable and simply applicable for omnidirectional lights. However, a generation of the silhouette and rendering of the shadow volumes is done on the CPU, so for scenes with high geometry complexity, the
High-Quality Shadows with Improved Paraboloid Mapping
423
algorithm is slow. However, modifications allowing for acceleration of computation of the silhouettes on the GPU are available [5]. The shadow maps algorithm presents an advantage compared to the shadow volumes algorithm because it is not as dependent on the scene complexity. However, as the shadow texture has limited resolution, sampling and aliasing problems inevitably occur. Nowadays, when geometrical complexity of graphics models and scenes is increasing, the shadow maps algorithm seems to be the better solution for real-time scene shadowing despite its limitations. Many methods of removing shadow map imperfections exist, especially the ones dealing with the sampling and aliasing problems [6][7][8]. In these methods, the scene and the light source are transformed into the postperspective space and the shadow map is generated in this space as well. The Cascaded Shadow Maps algorithm [9] can be seen as discretization of these approaches where re-parameterization is replaced with the use of multiple shadow maps. The above mentioned shadow mapping techniques work when used with spot lights only because when generating the shadow map, the field of view must be limited. In order to obtain shadows from omnidirectional light sources, the whole environment needs to be covered as seen from the light source. At the moment, multiple shadow maps [3,4] seem to be the only available solution. The most straightforward solution is to render the scene into six shadow maps for each side of the cube so all scene directions are captured. This process requires no additional scene warping and hardware perspective projection with cube mapping can be used, but six additional passes to render the scene present a serious disadvantage. Although it is possible to use single-pass rendering with a geometry shader on the newest hardware, GPU still has to create six additional vertex render passes, making it unsuitable for complex, dynamic scenes. A dual-paraboloid shadow maps algorithm (DPSM), first introduced by Brabec et al. [3], is in most cases a better solution. It uses only two passes and the scene is rendered in a 180◦ field of view for each pass, while paraboloid projection is used for each vertex. Paraboloid projection does not suffer from scene distortion and singularities that occur in traditional sphere mapping: the two paraboloids are put back to back, covering full a 360◦ view of the environment around the light source. Since depth values are stored according to the paraboloid scene representation in the texture, texture coordinates need to be computed appropriately. The original DPSM algorithm suffers from a tessellation problem as only the vertices are projected by the paraboloid projection, but the entities are rasterized linearly. Most of such issues can be solved through the approach shown by Osman et al. [4]. The tessellation problem for shadow receivers can be solved by computing the paraboloid shadow map coordinates in a per-pixel manner and the shadow casters can be adaptively tessellated based on triangle size and distance from the light source using a hardware tessellation unit.
424
J. Vanek et al.
Fig. 2. Small outdoor scene with hemispherical light source and 1024×1024 paraboloid shadow map (Left). Same scene with improved paraboloid map of the same size (Right). One should note the improvement of the shadow through a transparent fence texture.
3
High-Quality Shadows from Point Light Sources
Our solution extends the idea of the dual-paraboloid shadow maps (DPSM) [3] combined with some improvements proposed by Osman et al. [4], specifically the proper choice of the splitting plane, tessellation of shadow casters, and per-pixel computation of paraboloid coordinates. However, one serious drawback of the original dual-paraboloid shadow maps still remains. A DPSM algorithm is based on a uniform distribution of shadow samples from the paraboloid resulting in sampling and aliasing artifacts, especially when the viewer is far from the light position. In the following section, we will present a solution which can notably reduce the aliasing artifacts. The proposed idea is to adjust the sampling density on the paraboloid in order to increase sampling density in the parts of the scene that are close to the viewer. The approach is based on the fact that it is not necessary to sample the whole paraboloid. Moreover, some parts of the paraboloid can be completely excluded from sampling in the shadow texture. Also, there is no need for expensive twopass rendering into both paraboloids when the light source lies outside the view frustum and the single paraboloid is sufficient to cover the whole view frustum with a single shadow map. However, this approach works only when the light lies outside the view frustum. When light is positioned inside, the algorithm falls into the regular DPSM. 3.1
Improved Paraboloid Shadow Maps
The original DPSM algorithm uses two paraboloids - front and back - in order to capture the scene environment around the light source from both directions. Each paraboloid is sampled uniformly and the rotation of the paraboloids remains fixed. The environment is captured in two depth textures. Shadow map generation is then done by calculating the paraboloid projection and such projection can be performed by using a vertex shader [3]. Example of the vertex shader performing the paraboloid projection vec4 vertexEyeSpace = in_LightModelViewMatrix * vec4(in_Vertex,1.0);
High-Quality Shadows with Improved Paraboloid Mapping
425
vertexEyeSpace.xyz = normalize( vertexEyeSpace.xyz ); vertexEyeSpace.z += 1.0; vertexEyeSpace.xy /= vertexEyeSpace.z;
In order to achieve better sampling of the shadow map, we must ensure that rendering from the light source point of view with parabolic projection will maximally cover the objects lying inside the camera view frustum (it is not necessary to sample the scene parts for objects which receive shadows outside the view frustum). Therefore, the algorithm consists of four main steps: 1. Locate clipping planes for the view frustum to mark the boundary where all potential shadow receivers are present, forming a truncated view frustum. 2. Determine an optimal field of view for the scene rendered from the light source to cover the whole truncated view frustum. 3. Find the directional vector which sets the rotation of the paraboloid. 4. Perform a suitable cut on the paraboloid to constrain shadow map rendering only to the selected part of the paraboloid. These steps lead to an improvement of the paraboloid shadow maps, so we will refer to them as improved paraboloid shadow maps (IPSM) in the following text. Because rotating of the front paraboloid will cover most of the view frustum, the back paraboloid is needed only when the light lies inside the view frustum. In such cases, shadows must cover the entire environment around the view frustum and the back paraboloid is needed. In such cases, the parameterization converges to the standard DPSM. Otherwise, we can save one rendering pass and speed up rendering. 3.2
Determining Optimal Coverage
In the 3D space, an area illuminated by a point light with certain field-of-view and direction has the shape of a cone, as seen in Figure 3 (right). We will refer to it as a light cone below. In order to optimally cover the view frustum with the light cone, we need to locate the boundaries around the visible objects in the view frustum first, so that shadow sampling will be performed only within its boundaries. We will refer to them as minimal depth and maximal depth clipping planes. In order to determine the boundaries we store the z-values of all transformed and clipped vertices in the eye space to obtain minimum and maximum z-values. In the next text, we will refer to the view frustum with maximum/minimum depth boundaries as a truncated view frustum (see Figure 3 left). In order to locate optimal coverage and field-of-view, we suggest the following method. Prior to starting the calculation, we obtain a light position L and a position of eight frustum border points (FBPs) on the minimal (N1..4 ) and maximal (F1..4 ) depth clipping planes of truncated view frustum, as seen in Figure 3 (left). The algorithm computes the optimal field-of-view Fv and direction vector for the light cone Cd .
426
J. Vanek et al.
L
DX DY
FV CD F1
F2
N1 C
near plane
F1 F4 min. depth
N2
max. depth
far plane
Truncated view frustum
F3
F2
N2 N1 L
N4
N3
Fig. 3. Illustration of optimal coverage of the truncated view frustum with the light cone
Firstly, the normalized sum of the vectors Dj from the light source to the FBPs is computed (Eq. 1). This computation expresses the average direction C¯d from the light source L to the truncated view frustum. C¯d = norm( Dj ) (1) j
In the next step, we iterate through all vectors Dj in order to obtain the maximal angle from FBPs to an average direction C¯d (Eq. 2). Fv = max(Dj · C¯d ), 1 ≤ j ≤ 8
(2)
This maximal angle defines the field-of-view Fv of the light cone. Since all frustum border points are contained in the light cone, this implies that the whole truncated view frustum will also be covered by the light cone (Figure 3 right). This evaluation is performed only once per frame, so it does not have a crucial impact on the performance. It should be noted that we propose the numerical solution and thus it is not optimal. 3.3
Paraboloid Cutting and Rotation
After obtaining the field of view Fv and the average direction vector C¯d using the steps above, the light cone can be defined to cover the truncated view frustum. Since the single light paraboloid has a 180◦ field-of-view, we suggest the cutting scheme that is used for smaller angles. We exploit the parameterization for DPSM (see Section 3.1) to find the zoom factor. We set an auxiliary vector whose direction is derived from the field-of-view Fv . Then we let the vector be processed in the same manner as the transformed vertices position (vertexEyeSpace) in the vertex shader. This processing could be done in 2D because of the symmetry of the paraboloid. The resulted xaxis coordinate expresses the zoom factor Z which is applied to the computed coordinates in the vertex shader: vertexEyeSpace.xy∗ = 1/Z. RThis operation
High-Quality Shadows with Improved Paraboloid Mapping
-Z
L
427
Z
field of view
shadow texture res res
Fig. 4. Paraboloid cut to constrain shadow map sampling to certain parts of the paraboloid
causes the paraboloid to be cut at the point (Z, f (x)) which precisely creates the desired field-of-view Fv (see Figure 4), but the resolution of the shadow map texture on the output remains the same.
4
Implementation
The implementation was done using an OpenGL 4 core profile in our own framework with use of the GLSL shading language. The scenes were rendered in the 1600 × 1200 resolution on the PC with Intel Core i5 3,33GHz, 4GB RAM, AMD Radeon 5970 and Windows 7 32-bit. We used three demonstration scenes: Sibenik cathedral (70k of polygons), car (160k of polygons) and trees (150k of polygons). The purpose of the scenes was to demonstrate various light source types (see Figure 5). The shadow maps for the front and back paraboloids are contained in a texture array for easy accessing and the indexing of both textures. In order to avoid artifacts when rendering into the shadow map with parabolic projection on sparse tessellated mesh, a hardware tessellation unit is used. Tessellation factor is based on the distance of the tessellated polygon near the edge of paraboloid where possible distortion can occur. New vertices generated by the tessellator are parabolically projected in the tessellation evaluation shader. In order to compute minimal/maximal depth clipping planes we need z-values of the transformed vertices, so the scene’s normal and depth values are rendered into a floating point render target with the attached texture. This process is common in graphics engines using deferred shading and can be easily added into the scene rendering pipeline. This texture is scaled down on the GPU to a small size (e.g. 322 ) so as to reduce data transfer (we do not need per-pixel accuracy) and is then transferred from GPU to CPU. The stored buffer is analyzed and minimum, maximum and average depth values are searched for. Because of the scaling on the GPU and the transfering of only a small block of data to CPU, the effect on performance is not very big (as will be seen in the next section).
428
J. Vanek et al.
Fig. 5. Examples of various light source types with IPSM. From the left: omnidirectional, hemispherical and directional.
Shaders from the original DPSM can be left almost intact and our solution can be easily implemented into any graphics engine using deferred shading techniques without major changes against the original DPSM.
5
Experimental Results and Discussion
As can be seen in Figures 1, 2, 6 and 7, improved paraboloid shadow maps (IPSM) can significantly improve the visual look of the shadow in the scene without the necessity of large shadow maps. For all of our test scenes, resolution 1024 × 1024 was sufficient. When the light lies outside the view frustum, rotating and cutting the paraboloid produces far better results than the original DPSM or the uniform shadow maps (Figures 1, 2 and 7) because of optimal paraboloid rotation covering all objects lying inside the view frustum. Another interesting fact is that closer the camera gets to the shadow of the object, the more detailed the shadow will appear as a result of small differences between minimal and maximal depth clipping planes. This is a result of dense sampling of this small part of view frustum (Figure 2). It is also possible to use IPSM in outdoor scenes (Figure 7.) with distant light which acts in this case as directional light. However, when light lies inside the view frustum, only paraboloid rotation can be used because we cannot cut the paraboloid as we could lose important shadow information. Therefore, in this situation the field-of-view is always 180◦ and our solution is equal to the original DPSM (Figure 5, left). The worst case for our algorithm is, therefore, a camera near to the ground looking at the distant light and an object casting shadow towards camera - such a shadow will have poor quality, but this is also an issue of the original DPSM. Although additional calculations when compared to the original DPSM need to be performed, all of them are calculated per-frame only, so the performance impact is very low and close to the border of measurement error, as can be seen in Table 1. When compared to the uniform shadow maps, DPSM and IPSM are slower because of the higher computational complexity of parabolic projection. On the other hand, we cannot use a single uniform shadow map to capture the whole
High-Quality Shadows with Improved Paraboloid Mapping
429
Table 1. Performance of the different shadow approaches (in frames per second). Measured on the all demonstration scenes with AMD Radeon 5970. Sibenik Car Trees Shadow resolution 5122 20482 5122 20482 5122 20482 Uniform SM 180 140 162 142 217 197 DPSM 102 78 97 84 135 123 IPSM 95 76 107 95 148 134
Fig. 6. This image sequence shows differences between shadow methods. From left to right: uniform shadow maps, DPSM and IPSM. The shadow resolution is 512 × 512.
Fig. 7. Example of the large outdoor scene with the uniform shadow maps (Left) with close-up on a single tree (Middle) and IPSM (Right). In this case, paraboloid projection is similar to a focused parallel shadow map and is also suitable for directional lights without any changes. The shadow map resolution is 1024 × 1024.
environment. IPSM can be in some cases faster than the original DPSM because when light lies outside the view frustum, back paraboloid can be omitted as front paraboloid can cover all visible objects in the view frustum itself (Figure 3). Otherwise, IPSM are slower due to additional computations (mainly determining minimal and maximal depth values), but this performance hit is not significant.
6
Conclusion
This contribution introduces an improvement to existing paraboloid shadow maps based on the dense sampling of the important areas in the view frustum. Our solution (IPSM) can in most cases provide accurate, high quality shadows
430
J. Vanek et al.
regardless of the light source type and its position, without any performance penalty when compared to original paraboloid shadow maps. The solution is suitable for dynamic, indoor as well as outdoor scenes with the high geometric complexity where sharp, high quality shadows are needed with the various light source types. It is worth mentioning that because IPSM evolved from the standard DPSM, it should be possible to implement our method to any modern graphics engine without too much effort. Future work should include further improvements, such as solving the situation where the light source lies inside the view frustum and we need effectively sample areas close to the viewer. We also want to exploit the functions of modern graphic accelerators and geometry shaders to render all two shadow textures at once in a single pass, thus speeding up rendering and real light sources simulation by blurring shadows based on the distance from such light sources. Acknowledgement. This work was supported by the Ministry of Education, Youth and Sport of the Czech Republic under the research program LC06008 (Center of Computer Graphics). The Sibenik cathedral model is courtesy of Marko Dabrovic.
References 1. Crow, F.C.: Shadow algorithms for computer graphics. SIGGRAPH Comput. Graph. 11, 242–248 (1977) 2. Williams, L.: Casting curved shadows on curved surfaces. SIGGRAPH Comput. Graph. 12, 270–274 (1978) 3. Brabec, S., Annen, T., Seidel, H.P.: Shadow mapping for hemispherical and omnidirectional light sources. In: Proceedings of Computer Graphics International, pp. 397–408 (2002) 4. Osman, B., Bukowski, M., McEvoy, C.: Practical implementation of dual paraboloid shadow maps. In: Proceedings of the 2006 ACM SIGGRAPH Symposium on Videogames, pp. 103–106. ACM, New York (2006) 5. Stich, M., W¨ achter, C., Keller, A.: Efficient and Robust Shadow Volumes Using Hierarchical Occlusion Culling and Geometry Shaders. In: GPU Gems 3. AddisonWesley Professional, Reading (2007) 6. Fernando, R., Fernandez, S., Bala, K., Greenberg, D.P.: Adaptive shadow maps. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Technique, pp. 387–390. ACM, New York (2001) 7. Stamminger, M., Drettakis, G.: Perspective shadow maps. In: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, pp. 557–562. ACM, New York (2002) 8. Wimmer, M., Scherzer, D., Purgathofer, W.: Light space perspective shadow maps. In: The Eurographics Symposium on Rendering (2004) 9. Zhang, F., Sun, H., Xu, L., Lun, L.K.: Parallel-split shadow maps for large-scale virtual environments. In: Proceedings of the 2006 ACM International Conference on Virtual Reality Continuum and its Applications, pp. 311–318. ACM, New York (2006)
Terramechanics Based Terrain Deformation for Real-Time Off-Road Vehicle Simulation Ying Zhu, Xiao Chen, and G. Scott Owen Department of Computer Science, Georgia State University, Atlanta, USA
Abstract. When ground vehicles navigate on soft terrains such as sand, snow and mud, they often leave distinctive tracks. The realistic simulation of such vehicle-terrain interaction is important for ground based visual simulations and many video games. However, the existing research in terrain deformation has not addressed this issue effectively. In this paper, we present a new terrain deformation algorithm for simulating vehicle-terrain interaction in real time. The algorithm is based on the classic terramechanics theories, and calculates terrain deformation according to the vehicle load, velocity, tire size, and soil concentration. As a result, this algorithm can simulate different vehicle tracks on different types of terrains with different vehicle properties. We demonstrate our algorithm by vehicle tracks on soft terrain.
1
Introduction
Terrain is a critical part of many 3D graphics applications, such as ground based training and simulation, civil engineering simulations, scientific visualizations, and games. When ground vehicles navigate on soft terrains such as sand, snow and mud, they often leave distinctive tracks. Simulating such vehicle tracks is not only important for visual realism, but also useful for creating realistic vehicle-terrain interactions. For example, the video game “MX vs. ATV Reflex”, released in December 2009, features dynamically generated motorcycle tracks as a main selling point. This kind of simulation, often called dynamic terrain or deformable terrain, requires that the 3D terrain models be deformed during run time, which is more complex than displaying static terrains. Deformable terrain introduces a number of challenging problems. First, the terrain deformation must be visually and physically realistic. The deformations should vary based on the vehicle’s load, as well as different soil types. Second, proper lighting and texture mapping may need to be applied dynamically to display tire marks. Third, the resolution of the terrain mesh around the vehicle may need to be dynamically increased to achieve better visual realism, a level of detail (LOD) problem. Fourth, for large scale terrain that is constantly paged in and out of the memory, the dynamically created vehicle tracks need to be “remembered” and synchronized across the terrain databases. For a networked environment, this may lead to bandwidth and latency issues. Fifth, the simulated vehicle should react properly to the dynamically deformed terrain. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 431–440, 2011. © Springer-Verlag Berlin Heidelberg 2011
432
Y. Zhu, X. Chen, and G.S. Owen
In this paper we present our attempt to address the first two problems: the dynamic deformation of vehicle tracks and the dynamic display of tire marks. (In our previous work [11], we attempted to address the third problem of terrain LOD.) Our method is based on the classic terramechanics theories [13]. Specifically, our terrain deformation method deals with two problems: vertical displacement (compression) and lateral displacement. The vertical displacement is dependent on vehicle load and soil concentration, and the lateral displacement is dependent on contact area and the duration of loadings. In section 2, we will discuss the related work; in section 3, we discuss the theories of dynamic terrain deformation; in section 4, we will discuss the implementation and results with figures; section 5 will be the conclusion and future work.
2
Related Work
Most of the existing research on terrain focuses on static terrains [1-5], while the research on deformable terrains is still in its early stage. In this review, we focus on the latter. We classify terrain deformation approaches into two categories: physicsbased and appearance-based approaches. In physics-based solutions, the simulation model is often derived from the soil mechanics and geotechnical engineering. The result is more physically realistic but often with high complexity and computational cost. Li and Moshell [6] presents a simplified computational model for simulating soil slippage and soil mass displacement. Their method, based on the Mohr-Soulomb theory, simulates erosion of soil as it moves along a failure plane until reaching a state of stability. It is suitable for simulating the pushing, piling, and excavation of soil, but not for simulating vehicle tracks because it does not address soil compression. Chanclou et. al. [7] modeled the terrain surface as a generic elastic model of stringmass. Although a physics-based approach, this model does not deal with the real world soil properties. Pan, et al. [14] developed a vehicle-terrain interaction system for vehicle mobility analysis and driving simulation. In this system, the soil model is based on the BekkerWong approach [13] with parameters from a high-fidelity finite element model and test data. However, with few visual demonstrations, it is unclear how well this system performs in real time or whether it can generate visually realistic vehicle tracks. Our model, also based on the Bekker model, is less sophisticated than the above system because we focus more on real-time performance and visual realism than engineering correctness. Appearance-based methods attempt to create convincing visuals without a physics based model. Sumner et al. presented an appearance-based method for animating sand, mud, and snow [8]. Their method can simulate simple soil compression and lateral displacement. However, because the system lacks a terramechanics based model, the simulated tracks do not reflect the change of vehicle load, and soil properties. Onoue and Nishita [9] improved upon the work of Sumner et al. by incorporating a Height Span Map, which allows granular materials to be piled onto the top of objects. In addition, the direction of impacting forces is taken into consideration during the displacement step. However, this method has the same limitations as the previous approach [8]. Zeng, et al. [15] further improved on Onoue and Nishita’s work by introducing a momentum based deformation model, which is partly based on Li and Moshell’s work [6]. Thus the work by Zeng, et al. [15] is a hybrid of appearance and physics based approaches.
Terramechanics Based Terrain Deformation for Real-Time Off-Road Vehicle Simulation
433
Our approach is also a hybrid approach that tries to balance performance, visual realism, and physical realism for training simulation or video games. Our goal is to create more “terramechanically correct” deformations based on vehicle load, and soil types. For example, in our approach, the lateral displacement is based on the terrain material. The pressure applied to the deformable terrain is calculated based on vehicle load and soil type. None of the above approaches addresses the scalability issue: these simulations are confined in a rather small, high resolution terrain area. In order to simulate vehicleterrain interaction on large scale terrain, the terrain deformation must be supplemented by a dynamic multiple level of detail (LOD) solution. There are a number of attempts to address this issue. For example, He, et al. [10] developed a Dynamic Extension of Resolution (DEXTER) technique to increase the resolution of terrain mesh around the moving vehicle. DEXTER is based on the traditional multi-resolution technique ROAM [1] but provides better support for mesh deformations. However, this solution is CPU-bound and does not take advantage of the GPUs. To address this issue, Aquilio, et al. [11] proposed a Dynamically Divisible Regions (DDR) technique to handle both terrain LOD and deformation on GPU.
3
Dynamic Terrain Deformation
Soil mechanics is complicated and difficult to model. In this simulation, we have to simplify the soil mechanics of vehicle-terrain interaction into two main components: soil compression (vertical sinkage under the normal pressure) and lateral displacement (i.e. side pushing effect). We also have to make a number of assumptions. First, we assume that the contact area of a pneumatic tire to be an elliptic shape. Second, we assume that the vehicle weight is evenly distributed across this elliptic area. Third, we model soil as a two-layered entity, with one visible, plastic layer on top and one invisible, elastic layer beneath it. All the above assumptions are based on the classic theories of land locomotion [13].
2b 0.9 0.7
0.5
0.3p 0.2p
X
plastic layer 2b Y
elastic layer
Fig. 1. Terrain deformation on different layers
Fig. 2. Normal pressure distribution. The curves represent equal pressure lines.
When the vehicle interacts with the soil, the plastic layer is deformed. The amount of vertical sinkage is a function of the resistant forces from the elastic layer beneath it.
434
Y. Zhu, X. Chen, and G.S. Owen
The soil elasticity is different for different types of soil or for the same soil under different climate conditions. Fig.1 illustrates this process. The lateral displacement is the result of the force at the edge of the tire-soil contact area that pushes the top layer slightly upwards and sideways. The amount of lateral displacement, based on the Terramechanics empirical data, is a function of the contact area and soil types. The above calculations, implemented in a vertex shader, can determine the amount of vertical and lateral displacement of vertices in contact with the vehicle tires. However, they do not produce distinctive tire marks. To achieve this, a displacement mapping is performed on top of the already deformed terrain mesh. 3.1
A Terramechanics Model for Terrain Compression
In our system, the terrain contains two horizontal layers, with a plastic layer on top and one elastic layer beneath it. Based on the Terramechanics theory, the soil mass located in the immediate vicinity of the contact area (called disturbed zone) does not behave elastically. Therefore it is modeled as a sheet of plastic material. The underlying elastic layers are simulated as the traditional spring-mass models. When pressures are applied to these elastic layers, the springs will deform and generate forces to balance the pressure. The plastic layer will deform along with the underlying elastic layers. Once equilibrium is reached, the deformation of the top plastic layer becomes permanent. In our system, we use only one elastic layer for simplicity and efficiency. (see Fig. 1) Fig.2 illustrates the normal pressure distribution under the tires, showing lines of equal pressure. This illustrates that, during vehicle-terrain interaction, different points on an elastic layer receive different pressures and therefore deform for different amount, thus creating the visual form of a vehicle track. From this illustration, we see that at a depth of 2b (the width of the tire), the load pressure is reduced by about 50% and almost vanishes at a depth of 4b (twice the width of the tire). Therefore the height of the elastic layer is set at 4b. As mentioned above, we assume that the tire-terrain contact area is an elliptic shape. We derive Eq. 1 from Bekker’s classic terramechanics theory [13] to calculate the normal pressure on a point within an elliptic shape.
σz = p0
⁄
(1)
In Eq.1, (x, y, z) is the coordinate of the point; p0 is the unit uniform load; r0 is the is the load acting upon an element radius of the elliptic shape; the value of p0 of the surface . Eq.1 provides a good theoretical basis for calculating terrain deformation but it is not practical for real-time simulation. Also Eq.1 doesn’t have soil concentration factor which we need for different kinds of terrains. Since our simulation only uses one plastic layer and one elastic layer, we have developed Eq. 2, which is based on Eq. 1 but simplified for better performance.
σz =
, ,
,
(2)
Terramechanics Based Terrain Deformation for Real-Time Off-Road Vehicle Simulation
435
Eq. 2 addresses the stress distribution for a concentrated load. Specifically it calculates the normal pressure on any given point on a horizontal elastic layer located at depth z. The Eq.2 is written in polar coordinates, in which σz is the horizontal stress; we add W which is the weight of the vehicle, and r = ; R is the distance from the point to the center of the contact area; m is a variable. A soil concentration factor n is introduced to simulate different soil types. For grass, n = 1; mud, n = 3, and for loose sands, n = 6 or n = 7. Snow often displays the character of a plastic layer thus it needs a different treatment [13]. Fig. 4 shows the parameters in Eq.2. The results of our Eq.2 (see Fig.3) fit the experimental data in Krotkov [17]. Normal Force (N)
Grass
10000 8000 6000 4000 2000 0
Mud Sand Snow 0
1
2
3
4
5
6 Sinkage (cm)
Fig. 3. Terrain Stiffness Comparison
3.2
Calculating Terrain Deformation
In the previous section, we discuss the method of calculating forces applied from the vehicle to the terrain. In this section, we discuss how to simulate terrain sinkage (deformation) based on these forces.
Vehicle Weight Forces Balancing Forces generated by the springs
X R
z Y
x y
r Z
z
r
Fig. 4. Normal pressure on a given point in soil
Fig. 5. Simulation terrain deformation with a spring-mass model
The depth of terrain deformation is a function of the external force (from the vehicle) and terrain elasticity. Under certain circumstances, the terrain deformation can be simulated with a traditional spring-mass model (Fig.5). Based on Wong [16], terrain can be seen as a piecewise, linear elastic material, and Hooke’s Law applies to
436
Y. Zhu, X. Chen, and G.S. Owen
linear elastic material as long as the load does not exceed the elastic limit. Therefore we may assume that each vertex on the terrain model is attached to a spring. When a tire contacts a vertex, the attached spring will deform and generate a balancing force. The amount of deformation depends on how much force is needed to counter-balance the force from the vehicle. In reality, terrain does not behave as a simple, linear spring-mass model. The stiffness of the terrain usually increases with the external force. As a result, the amount of terrain deformation decreases as the external force increases. Therefore, we adopt a modified spring-mass model for terrain deformation (Eq.3). This model is based on the Krotkov’s model [17], which is supported by lab experiments. This equation describes the relationship between normal forces and displacement on three types of terrains: sand, soil, and sawdust. (3)
x=
In Eq.3, x is the amount of terrain deformation; f represents the normal, external force on the terrain. This force can be calculated from Eq. 3, depending on the terrain types. Both k and n are constants specific to different materials. When n = 1, Eq. 3 is reduced to Hooke’s law. The data we applied to Eq.3 can be found in Table 1. Terrain deformation is carried out in a vertex shader in two steps. First, the terrain is deformed with a height map of tire tread pattern (Fig.6-B). This deformation generates the base shape of the tire tracks. Second, the depth of the base tire track is adjusted according to Eq. 3. 3.3
Lateral Displacement
The sinkage of the tire-soil contact area is due to two factors: soil compression and lateral and upward displacement of the soil particles around the border of the contact area. The lateral displacement is influenced by both the contact area and the duration of loadings. Experiments show that: 1. the smaller the contact area, the stronger the effect of the lateral displacement; 2. the longer the duration of loading, the stronger the effect of lateral displacement [13]. Based on the classic terramechanic theory [13], the lateral settlement can be calculated based on Eq.4.
∆zs =
σ
(4)
Here ∆zs is the terrain settlement; E is the modulus of soil elasticity; h is the ground layer thickness. However, using this equation in real-time applications is impractical and it is lacking the soil concentration factor which we use to differentiate the four types of terrains. Instead, we propose a new heuristic model Eq. 5 to simulate the terrain settlement in a real-time graphics program.
∆zs =
2n
(5)
Terramechanics Based Terrain Deformation for Real-Time Off-Road Vehicle Simulation
437
Here, F is the normal force on the terrain; k is the spring constant; d is the distance from the border of the sinkage; n is the soil concentration factor which has been introduced in section 3.1; a is a variable. Since experiments show that within certain threshold, the smaller the loaded area, the higher the lateral settlement, the value of a is set to different numbers for each types of terrain. Eq.5 is simpler than Eq. 4 and is designed to fit the data curve of Eq. 5 at critical points.
4
Implementation and Experiment
Algorithm.1 and Fig.6 describe the algorithm and process of our simulation. We implemented our algorithm using OpenGL/GLSL on a PC with Intel core i7 Q740 1.73GHz, 4GB RAM, and NVIDIA GeForce 425m GPU with 1GB RAM. The rendered scene resolution was 1200×1200 and the average frame rate is 120 frames per second. We simulate four types of terrain: grass, mud, sand, and snow. Fig.7 and 8 show the result of our simulation. GPU Vertex Shader Fragment Shader
CPU A
B
C
D
Rendering Results
ABC D
Fig. 6. Illustration of the simulation process
Algorithm 1. Creating tire tracks 1: Process mesh and textures on the CPU. 2: Send all the vertices and textures from the CPU to the GPU. Four textures are sent to the shader: one terrain texture image (Fig.6-A); one height map used for track deformation; another height map for lateral displacement; a texture image for the tire mark. (Fig.6-D). 3: In the vertex shader, the displacement mapping technique is used to deform the terrain. A height map (Fig.6-B) is used to generate the tire tread patterns. The height values from the height map are blended with the terrain sinkage factor, calculated based on Eq. 2 and 3 to generate the vertex displacement value. 4: The lateral settlement is added through an additional displacement mapping step, in which a lateral displacement height map (Fig.6-C) is used. The height values from this height map are then blended with the lateral settlement calculated based on Eq.5. As a result, the lateral displacement is a function of tire width, vehicle load, and types of terrain. 5: In the fragment shader, both images of the terrain and the tire track are applied through multi-texturing.
438
Y. Zhu, X. Chen, and G.S. Owen
Height map Fig.6-B, Fig.6-C and texture map Fig.6-D are retrieved from a premodeled track mesh which has been created in a 3D modeling tool. Since Fig.6-B, Fig.6-C and Fig.6-D are based on the same track model, thus the texture of Fig.6-D matches the veins of Fig.6-B closely. Table 1. Simulation parameters
Terrain Grass Mud Sand Snow
Spring Constant k 1⁄0.08 1⁄0.09 1⁄0.11 1⁄0.14
n
W
1 3 7 none
3000 3000 3000 3000
Input Primitives 15028 15028 15028 15028
Frame Rate 125 125 124 123
Table.1 shows our simulation parameters, which include spring constants k, constant n for different terrains and the weight of load W. Note that we have mentioned that snow often displays the character of a plastic layer thus it uses a different equation of Eq.2. The values of constant n are based on Bekker [13]. Krotkov [17] does not specific numbers for different spring constants k, so the numbers in Table.1 are based on our experience and testing. The spring constants of grass and mud are relatively close. Grass has the biggest spring constant among the four types of terrain, thus the sinkage of grass is the lowest. The snow spring constant is based on Krotkov’s experiments on sawdust, which is close to snow in terms of deformability. Thus snow has the smallest spring constant, and therefore tracks on snow are the deepest, which matches the real world observation. We use the same weight load on four types of the terrain. Each type of terrain is deformed differently based on the soil concentration factor mentioned in section 3. Fig.7 shows our simulation in wireframe mode and Fig.8 shows that with texture mapping. The deformation on the terrain is calculated by our model and since the spring constants of each types of terrains are different, the depths of the tracks are different. The depth of track on grass is the lowest and that of snow is the deepest. Specifically, Depthgrass
Fig. 7. Screenshots of our simulation in wireframe mode: (from left to right column) grass, mud, sand and snow
Terramechanics Based Terrain Deformation for Real-Time Off-Road Vehicle Simulation
439
Fig. 8. Screenshots of our simulation with texture mapping: (from left column to right column) grass, mud, sand and snow
5
Conclusion and Future Work
We have presented a new real-time terrain deformation method, specifically for simulating vehicle tracks on soft terrain. Our method is based on the classic terramechanics theories and can simulate realistic vehicle tracks on different types of terrains. We will continue to refine our method by adding bump mapping and integrating terrain deformation with terrain LOD techniques. We are also working on a new dynamic terrain mesh generation and tessellation methods in geometry shader. We will apply this technique to large scale terrain rendering. In the near future, we plan to port our system to OpenSceneGraph for better support of large scale terrain rendering.
References 1. Duchaineau, M., Wolinsky, M., Sigeti, D.E., Miller, M.C., Aldrich, C., Mineev-Weinstein, M.B.: ROAMing terrain: real-time optimally adapting meshes. In: Proceedings of Visualization 1977, pp. 81–88 (1997) 2. Lindstrom, P., Koller, D., Ribarsky, W., Hodges, L.F., Faust, N., Turner, G.A.: Real-time, continuous level of detail rendering of height fields. Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 109–118 (1996) 3. Lindstrom, P., Pascucci, V.: Visualization of large terrains made easy. In: Proceedings of the Conference on Visualization 2001, pp. 363–371 (2001) 4. Losasso, F., Hoppe, H.: Geometry clipmaps: terrain rendering using nested regular grids. ACM Transactions on Graphics 23, 769–776 (2004) 5. Hoppe, H.: Smooth view-dependent level-of-detail control and its application to terrain rendering. In: Proceedings of the Conference on Visualization 1998, pp. 35–42 (1998) 6. Li, X., Moshell, J.M.: Modeling soil: realtime dynamic models for soil slippage and manipulation. In: Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, pp. 361–368 (1993)
440
Y. Zhu, X. Chen, and G.S. Owen
7. Chanclou, B., Luciani, A., Habibi, A.: Physical models of loss soils dynamically marked by a moving object. In: Proceedings of the 9th IEEE Computer Animation Conference, pp. 27–35 (1996) 8. Sumner, R.W., O’Brien, J.F., Hodgins, J.K.: Animating sand, mud, and snow. Computer Graphics Forum 18, 17–26 (1999) 9. Onoue, K., Nishita, T.: An interactive deformation system for granular material. Computer Graphics Forum 24, 51–60 (2005) 10. He, Y.: Real-time visualization of dynamic terrain for ground vehicle simulation, PhD Thesis, University of Iowa (2000) 11. Aquilio, A.S., Brooks, J.C., Zhu, Y., Owen, G.S.: Real-time GPU-based simulation of dynamic terrain. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4291, pp. 891–900. Springer, Heidelberg (2006) 12. Asirvatham, A., Hoppe, H.: Terrain rendering using GPU-based geometry clipmaps. In: Pharr, M., Fernando, R. (eds.) GPU Gems 2. Addison-Wesley, Reading (2005) 13. Bekker, G.: Theory of Land Locomotion. University of Michigan Press, Ann Arbor (1956) 14. Pan, W., Yiannis, P., He, Y.: A vehicle-terrain system modeling and simulation approach to mobility analysis of vehicles on Soft-Terrain. In: Proceedings of SPIE Unmanned Ground Vehicle Technology VI, vol. 5422, pp. 520–531 (2004) 15. Zeng, Y., Tan, C.I., Tai, W., Yang, M., Chiang, C., Chang, C.: A momentum-based deformation system for granular material. Computer Animation and Virtual Worlds 18(4-5), 289–300 (2007) 16. Wong, J.Y.: Theory of Ground Vehicles, 3rd edn. Wiley Interscience, Hoboken (2001) 17. Krotkov, E.: Active perception for legged locomotion: every step is an experiment. In: Proceedings of the 5th IEEE International Symposium on Intelligent Control, vol. 1, pp. 227–232 (1990)
An Approach to Point Based Approximate Color Bleeding with Volumes Christopher J. Gibson and Zo¨e J. Wood California Polytechnic State University {cgibson,zwood}@calpoly.edu Abstract. Achieving realistic or believable global illumination in scenes with participating media is expensive. Light interacts with the particles of a volume, creating complex radiance patterns. This paper introduces an explicit extension to the commonly used point-based color bleeding technique which allows fast, believable in- and out-scattering effects building on existing data structures and paradigms. The proposed method achieves results comparable to that of existing Monte Carlo integration methods, that is realistic looking renders of scenes which include volume data elements, obtaining render speeds between 10 and 36 times faster while keeping memory overhead under 5%.
1
Introduction
The ability to render scenes with realistic lighting is desirable for many entertainment application settings such as film. Great results have been achieved for large scenes using point based color bleeding [1] algorithms. In their basic form, these algorithms tend to limit or omit the lighting contribution from volumetric data or participating media within the scenes. This paper presents an explicit extension to the point-based color bleeding (PCB) algorithm and data representation tuned to volumes, light-voxel (or lvoxel) to address the need to represent participating media which leverages a point cloud representation of a scene. The proposed method achieves results comparable to those produced with Monte Carlo ray tracing [2], that is, images that include color bleeding from volume elements but with drastically reduced run times, speeding up renders by around 10 to 36 times. Figure 3 illustrates a comparison of our algorithm and Monte Carlo ray traced results. 1.1
Background
The goal of the proposed method is to include volumetric representations into a global illumination algorithm in a fast and coherent way. One of the unique features of participating media is that they must be represented with a more complex data-structure than solid geometric objects which are usually polygonalized in most rendering processes. Light interacts with the particles of a volume, creating complex radiance patterns (increasing the necessary computational complexity exponentially.) In particular the most fundamental concepts are presented here, (based off of [3]). G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 441–450, 2011. c Springer-Verlag Berlin Heidelberg 2011
442
C.J. Gibson and Z.J. Wood
In contrast to polygonal models, in volumes, as light passes through a participating media, light will become absorbed based on its absorption probability density σa , scatter based on a scatter probability density σs or simply pass through the volume. Both absorption and scatter-out involve the reduction of energy through a volume, lowering its transmittance. This can be combined into the following representation: σt (p, w) = σa (p, w) + σs (p, w). Then, integrating the transmittion of the volume over a ray gives us equation (1.1): d Tr (p → p ) = e− 0 σt (p+tw,w)dt . Additionally, in volumes, the probability that light may scatter from direction w to w is described using a distribution function or phase function, which describes the angular distribution of light scattered by particles in a volume. All tests in this paper were rendered using one of the simplest phase functions, known as the isotropic or constant phase function which represents the BRDF analog for participating media [4]. Finally, while σs may reduce the energy of a ray passing through a volume, radiance from other rays may contribute to the original ray’s radiance, called scatter-in. After we guarantee that the distribution is normalized, to make sure that the phase function actually defines a proper probability distribution for a particular direction, we can integrate the total radiance scatter based on the normalized phase function phase(w → w ) over all directions w to get our total scatter in a direction w: S (p, w) = Lve (p, w) + σs (p, w) S2 phase(p, −w → w)Li (p, w )dw . The volume emission coefficient, Lve (p, w), is not discussed here.
2
Related Work
Global Illumination: Global illumination is an important problem in computer graphics, with numerous successful algorithmic solutions, including, photon mapping [5], radiosity [6] and Monte Carlo sampling techniques for ray tracing [2]. The most closely related methods to this work are those that sample the scene and use a two stage approach to model direct and indirect illumination. Of particular relevance is Point-Based Approximate Color Bleeding [1], which describes the process of sampling the scene in order to create a point cloud representation, used to evaluate the incoming radiance surrounding a given point on a surface. As recently as 2010, discussion of approximating volume scattering using point clouds was mentioned in [7], but without algorithmic details, such as how back-to-front or front-to-back rasterization would be achieved with the current rasterization method (handled by our octree traversal method) or how scatter, extinction and absorption would be managed within the volume representation in the point cloud. Other closely related work includes attempts at simulating light scatter and absorption properties of participating media through existing photon mapping techniques, which have shown promise in the past. Jensen in [8] describes a process where photons participate and become stored inside the volume itself for later gathers during volume integration. While this technique is shown to work, it primarily focuses on caustic effects in volumes and the generated photon map.
An Approach to Point Based Approximate Color Bleeding with Volumes
443
Our storage method does not require data to be stored in the volume, but in a separate, more lightweight data-structure better suited for out-of-core rendering. Volume Rendering: This paper is focused on the rendering of scenes which contain volume data. A number of approaches have been developed in order to render volume data [9][10]. Volume data representations often include an efficient multi-resolution data representation [11][12]. When dealing with multiresolution volume octree datastructures, removing occluded nodes from being tested can drastically increase performance [13]. Our algorithm takes advantage of a multi-resolution, view-independant octree datastructure in order to handle a large amount of complex lighting and volume data, skipping material occluded by opaque geometry cached in the data-structure in the form of surfels. Although the idea of using these techniques in volumes is not new, utilizing them to guarentee correct volume integration in the point cloud used in PCB is.
3
Algorithm
We present an algorithm which is an extension to the point cloud techniques described in [14] and [1], specifically building off the point-based color bleeding technique by Christensen. The modifications involve evaluating light scatter and absorption properties at discrete points in the volume and adding them to the point cloud. Using a front-to-back traversal method, we can correctly and quickly approximate the light-volume representation’s contribution to a scene’s indirect lighting evaluation. In general, PCB methods subdivide the world into small representational segments, called surfels [1], which are stored in a large point cloud, representing the scene. Surfels are used to model direct illumination, and are then used in a later phase to compute indirect lighting and color bleeding in an efficient manner. The goal of our method is to include volumetric representations into PCB methods in a fast and coherent way while keeping memory overhead and computational complexity to a minimum. In the existing algorithm [1], surfels represent opaque materials within the point cloud, thus to incorporate a representation of volumetric data, an additional data representation was necessary to handle the scatter and absorption properties of participating media. Our data representation closely follows the model of surfels, in that we choose to sample the volume at discrete locations and store a finite representation of the lighting at those discrete locations, but with modifications to handle the special attributes of lighting in transparent media. In keeping with the naming conventions established, we call our discrete sampling of lighting elements for a volume: lvoxels. In general, our algorithm must 1) sample the scene geometry (including the volume) and store the direct lighting (or relevant lighting properties) 2) gather indirect lighting and 3) model the scatter-out and scatter-in properties of volumetric lighting.
444
3.1
C.J. Gibson and Z.J. Wood
Sampling the Scene
The goal of this stage of the algorithm is to sample the scene geometry (including the volume) and store the direct lighting in a finite data representation to be used later for global illumination lighting effects. As all of our finite data represents the direct lighting of some small portion of a surface or element in a threedimensional scene, we refer to the union of all finite lighting samples as a “point cloud”. This point cloud is stored in an octree representation for efficient access to all data elements, surfels and lvoxels. Surfels differ from lvoxels only in that surfels represent a flat, solid geometry while lvoxels represent a transparent, volumetric medium. Both have radii and position so both can be placed within the same point cloud. We sample the opaque geometry as surfels, which are computed using a perspective viewing volume slightly larger then the current viewing frustum, with a sampling rate two times that of the desired pixel resolution. Rays are cast from this logical camera just as we would ray trace a normal scene, with surfels generated where those rays intersect with the scene. Lvoxels are generated by marching over the entire domain of the volume by a specific, preset interval, sampling scatter and absorption coefficients in order to get an average throughout the area an lvoxel will occupy. Typically this involves eight to sixteen absorption and scatter samples per lvoxel. These values, as well as the radius of the lvoxels, may differ depending on the complexity and raw resolution of the volume. In our tests, one lvoxel for every 23 − 43 voxels achieved good results. Caching the direct light contribution at each lvoxel by testing the transmittance (equation 1.1) to each light source saves us from re-computing light calculations during sampling in sections 3.3 and 3.4 [15]. 3.2
Gathering Light
Next, our algorithm uses a gather stage similar to the one in PCB, which calculates the irradiance at a point on a surface, given the radiance of the scene around it. Unlike PCB, which uses a software rasterization method, we chose to evaluate irradiance by raycasting into the point-cloud around a hemisphere oriented along the surface’s normal. This decision was made to simplify the tests which compare traditional Monte Carlo sampling methods to the extended PCB algorithm, but also to simplify evaluation of the transparent lvoxels. In order to approximate the integral of incoming light at point p on the surface, we sample across a hemisphere oriented along the surface’s normal N at p. Each sample cast out from p evaluates L(p ← w) which is then multiplied by w · N in order to represent cosθ. In order to obtain good results, 128-256 samples are typically necessary to combat noise caused by the samples. How radiance and transmittance are handled with lvoxels will be explained in the next section. 3.3
Adding Scatter-Out
Modifications to the previously mentioned irradiance sampling technique for scatter-out effects with volumes are few. The most significant changes are to
An Approach to Point Based Approximate Color Bleeding with Volumes
445
the point cloud octree and its traversal. Specifically, when computing lighting, we must account for the fact that when an element of the point cloud is hit, it may be transparent. In the standard algorithm, absorption and transmittance would not be taken into account and the traversal would stop at the first lvoxel encountered. In order to properly evaluate transparent and opaque surfaces within the point cloud, we made changes to node-level octree traversal. Each branch traverses its children from closest to farthest, guaranteeing that closer leaf nodes are evaluated first. Leaf nodes then use the pre-evaluated scatter (σs ) and absorption (σt ) coefficients for each lvoxel to appropriately alter the sample ray’s transmittance, and continue with the traversal, with each hit contributing to the final resulting radiance value. Once a surfel is hit, there is no need to continue traversing the octree. In addition to traversing the tree front-to-back, we also keep track of the incoming radiance and current transmittance. Both of these values are modified according to the equations described in Section 1.1 taking into account each lvoxel’s scatter and absorption coefficients. 3.4
Adding Scatter-In
After adding lvoxels to our octree structure and evaluation algorithm, the only modifications necessary for scatter-in are to the volume rendering equation. Recall that to model lighting for a volume, in-scattering requires integrating over all directions. Casting Monte Carlo sample rays through the volume and into the scene would be computationally expensive. Instead, for each sample we send out rays into the point cloud, iterating through a much sparser dataset. This helps us replace expensive S(p, w) evaluations with traversals into the octree. The two main differences between sampling scattered light within a volume and evaluating the irradiance on a surface are 1) the distribution function, which is based on the volume’s phase function, and 2) the samples are distributed over a sphere rather than a hemisphere. Each of these samples gather light as described in Section 3.2.
4
Results
Our algorithm is able to achieve realistic lighting effects for scenes that include volumetric elements using our lvoxel representation with a point-based color bleeding approach to global illumination. All test cases were run on a commodityclass Intel i5 3 GHz machine with 4 Gb of RAM. Because of the disparity between academic-level versus production-class ray tracer implementations, we tested and compared our results against a naive implementation of Monte Carlo global illumination not using the point cloud representation. We then compared the resulting images and render times of each. Our algorithm is able to achieve a very small image difference and an increase in render time efficiency. The scene tested involved a 60,000 triangle Sponza Atrium model including only vertex and normal information for simplicity. CT scan datasets of a human
446
C.J. Gibson and Z.J. Wood
Fig. 1. CT Head scene comparison of the PCB extension (left) and traditional Monte Carlo results (right)
(a)
(b) Fig. 2. (a) Zoomed image showing traditional PCB (left) and PCB with extension (right.) and (b) Zoomed image showing PCB extension (left) and Monte Carlo (right.)
An Approach to Point Based Approximate Color Bleeding with Volumes
447
head and the Stanford Bunny were used in order to test scatter in/out contributions involving complex participating media. Both Figures 3 and 1 show the CT data in Sponza Atrium rendered with traditional Monte Carlo scattering and our extended PCB. These resulting renders are very similar, and a closer look in Figure 2(b), exemplifies the great similarity between the two images. However, there are some small artifacts present in the image rendered with the point cloud representation, and the indirect lighting is slightly darker overall. Further results that show proper color bleeding can be seen in Figure 4. Close up comparison of traditional PCB versus extended PCB is shown in Figure 2(b). 4.1
Data Comparison
Table 1 shows render times, image differences and memory overhead for the CT scan datasets in the Sponza Atrium with varying sized volumes and for three different rendering methods: Monte Carlo without PCB, traditional PCB and finally our extended PCB algorithm. Table 1. Data for the Sponza Scene with the two different CT datasets Scene: CT Bunny in Sponza Render Time (s) Image Delta 643 resolution volume Monte Carlo w/o PCB 3229 sec NONE Traditional PCB 348 sec 5.8% Extended PCB 433 sec 2.1% 1283 resolution volume Monte Carlo w/o PCB 3297 sec NONE Traditional PCB 348 sec 5.6% Extended PCB 402 sec 2.4% 5123 resolution volume Monte Carlo w/o PCB 3674 sec NONE Traditional PCB 348 sec 9.6% Extended PCB 417 sec 3.8% Scene:CT head in Sponza Render Time (s) Image Delta 643 resolution volume Monte Carlo w/o PCB 10150 sec NONE Traditional PCB 348 sec 14.2% Extended PCB 756 sec 3.7% 1283 resolution volume Monte Carlo w/o PCB 15811 sec NONE Traditional PCB 348 sec 14.4% Extended PCB 755 sec 4.2% 2563 resolution volume Monte Carlo w/o PCB 31373 sec NONE Traditional PCB 348 sec 14.2% Extended PCB 864 sec 4.3%
Memory Overhead NONE 466.3 MB (4.780%) 466.7 MB (4.786%) NONE 466.3 MB (4.780%) 467.5 MB (4.783%) NONE 466.3 MB (4.780%) 466.4 MB (4.785%) Memory Overhead NONE 466.3 MB (4.780%) 468.0 MB (4.800%) NONE 466.3 MB (4.780%) 467.3 MB (4.790%) NONE 466.3 MB (4.780%) 467.1 MB (4.790%)
448
C.J. Gibson and Z.J. Wood
Fig. 3. Bunny Scene comparison of the PCB extension (left) and traditional Monte Carlo results (right)
Memory: When using traditional PCB, the real benefit to its surfel representation is shown in more complex scenes. In the Sponza Atrium, the scene generated over 2.5 million surfels for a 60,000 triangle scene. Adding volume data to the scene does not add an objectionable amount of data to the point cloud (see Table 1), but for scenes with large volumes the costs could quickly add up without some form of multi-resolution light caching. In this regard, adding yet another representation of the volumes may be expensive, but not prohibitively so. Additionally, larger scenes would benefit from this representation, as the point cloud would be significantly simpler than the entire scene and can be moved to another system for out-of-core evaluation. Speed: Even without volume integration, Monte Carlo integration without a lighting representation like PCB is prohibitively slow for even the simplest scenes. Adding a point cloud representation gave us an impressive speedup. That speedup was compounded even more when volume scattering was added into the tests, showing a speedup upwards of 36 times that of the Monte Carlo renders. Compared to traditional PCB runs, we found the increase in overall runtime to be well worth the improvements to the renders we achieved. Even on sparse octrees without volumes, our front to back octree traversal operates at an efficiency of O log n for each node traversal while skipping nodes occluded by surfels, leading to an average performance increase of over 18%. Image Quality: To objectively compare rendering results, we used a perceptual image difference program called pdiff to identify how close any two renders matched. The results, which ranged from 2.1% and 4.3%, are shown in Table 1.
An Approach to Point Based Approximate Color Bleeding with Volumes
449
Figure 2(a) compares the non-PCB Monte Carlo image with that of the traditional PCB renders, showing the clear lack of proper in-/out-scattering. With the extended algorithm, however, the results are comparable. 4.2
Conclusion
We have presented an extension to the PCB algorithm by [1] which handles both scatter-in and scatter-out contributions. The addition of the lvoxel paradigm to the already successful point-based color bleeding algorithm is shown to be a cost effective method of approximating and evaluating complex scatter functions based on participating media. We obtained render speeds up to 36 times faster than that of pure Monte Carlo renders with a memory overhead between 2 to 5 MB with an image difference of less than 5% across all tests. For more information on this subject, please refer to [16].
Fig. 4. Additional examples of clear color bleeding using our algorithm
4.3
Future Work
In our tests, we focused on isotropic phase functions in our volumes, while many volumes could have unique scatter functions. We could simulate more complex surface scattering functions by creating spherical harmonic representations of the radiance at any specific point in the volume. Another fruitful addition to our current method would be including rougher estimations, usually in the form of a series of spherical harmonic coefficients, at higher levels in the octree, to be evaluated depending on that node’s solid angle to a sample point (often present in traditional PCB algorithms). Finally, as mentioned by Christensen [1], surfels can be modified to “gather” light recursively from their position in the point cloud, allowing for simulated multi-bounce lighting. This would require only a small change to the current algorithm, and would apply to volumes as well to allow very realistic scatter approximations in participating media.
450
C.J. Gibson and Z.J. Wood
Acknowledgements. A special thanks to Patrick Kelly from DreamWorks Animations for his consistent help and support. Our brainstorming sessions were invaluable, and always left me full of new ideas as well as helping me hone the subject for this project.
References 1. Christensen, P.H.: Point-based approximate color bleeding (2008) 2. Robert, L., Cook, T.P., Carpenter, L.: Distributed ray tracing. In: Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1984, pp. 165–174. ACM, New York (1984) 3. Matt Pharr, G.H.: Physically Based Rendering: From Theory to Implementation, 2nd edn. Morgan Kaufmann, San Francisco (2010) 4. Cerezo, E., Perez, F., Pueyo, X., Seron, F.J., Sillion, F.X.: A survey on participating media rendering techniques, vol. 21, pp. 303–328 (2005) 5. Jensen, H.W.: Realistic Image Synthesis Using Photon Mapping. A.K. Peters, Ltd., Natick (2009) 6. Dorsey, J.: Radiosity and global illumination. The Visual Computer 11, 397–398 (1995), doi:10.1007/BF01909880 7. Christensen, P.H.: Point-based global illumination for movie production. In: SIGGRAPH 2010 (2010) 8. Jensen, H.W., Christensen, P.H.: Effcient simulation of light transport in scenes with participating media using photon maps. In: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1998, pp. 311–320. ACM, New York (1998) 9. Levoy, M.: Display of surfaces from volume data (1988) 10. Kajiya, J.T., Herzen, B.P.V.: Ray tracing volume densities (1984) 11. Westermann, R.: A multiresolution framework for volume rendering. In: Symposium on Volume Visualization, pp. 51–58. ACM Press, New York (1994) 12. Levoy, M.: Effcient ray tracing of volume data. ACM Transactions on Graphics 9, 245–261 (1990) 13. Guthe, S., Strasser, W.: Advanced techniques for high-quality multi-resolution volume rendering, vol. 28, pp. 51–58 (2004) 14. Tabellion, E., Lamorlette, A.: An approximate global illumination system for computer generated films, vol. 23, pp. 469–476 (2004) 15. Wrenninge, M., Bin Zafar, N.: Volumetric methods in visual effects (2010) 16. Gibson, C., Wood, Z.: Point based color bleeding with volumes. Technical report, California Polytechnic State University (2011), http://digitalcommons.calpoly.edu/theses/533
3D Reconstruction of Buildings with Automatic Facade Refinement C. Larsen and T.B. Moeslund Department for Architecture, Design and Media Technology Aalborg University, Denmark
Abstract. 3D reconstruction and texturing of buildings have a large number of applications and have therefore been the focus of much attention in recent years. One aspect that is still lacking, however, is a way to reconstruct recessed features such as windows and doors. These may have little value when seen from a frontal viewpoint. But when the reconstructed model is rotated and zoomed the lack of details will leap out. In this work we therefore aim at reconstructing a 3D model with refined details. To this end we apply a structure from motion approach based on bottom up bundle adjustment to first estimate a 3D point cloud of a building. Next, a rectified texture of the facade is extracted and analyzed in order to detect recessed features and their depths, and enhance the 3D model accordingly. For evaluation we apply the method to a number of different buildings.
1 Introduction 3D reconstruction and texturing of buildings have a large number of applications ranging from urban planning and marketing, to tourism where services such as Google Earth and Microsoft’s Virtual Earth have promoted the great potential to the public. Much research has been focused on different aspects of the basic problems of reconstruction and texturing [1,2,3] and the frontiers have in the later years been pushed forward. As the technology matures so does the desire for better and more precise models. One aspect that is currently lacking is an explicit modeling of recessed features such as windows and doors [4]. The consequence of not modeling such facade details is minor when viewing the reconstructed model from a frontal viewpoint or a distance. But when the reconstructed model is rotated and zoomed the lack of details will leap out. In this work we therefore aim at reconstructing a 3D model with refined details. Recently methods have been reported that aim to recover detailed facade features. In [5] high resolution 3D data of a building facade is obtained using a terrestrial laser scanner. RANSAC is applied to segment the 3D point cloud into a number of planes corresponding to different facade features. In [6] a similar approach (LiDAR) is applied to obtain high resolution 3D data. Facade features (windows) are segmented based on the notion that glass does not reflect the LiDAR pulses, hence no data. From the detected windows a grammar is made that can generate synthetic facade features for occluding building parts or on new buildings (assuming a similar architecture). In [7] LiDAR data is also applied together with a grammar. A reversible jump MCMC approach is used to select hypotheses which are then compared with the input data. In [8] overlapping images are analyzed to find facade features in buildings with many repeated patterns. The G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 451–460, 2011. c Springer-Verlag Berlin Heidelberg 2011
452
C. Larsen and T.B. Moeslund
patterns are detected by looking for similar intensity profiles between interest points found by a Harris corner detector. In [9] facade features are estimated based on just one image. The inherent lack of 3D data is partly overcome by relying on a strong grammar and other architectural insides. In [4] a number of images of the same building are used to extract structure from motion. A sketch-based interface then allows a user to construct first the overall facades and next more and more detailed features. Detailed reconstruction, but the procedure takes minutes of user interaction for small buildings and tens of minutes for bigger buildings. We seek a solution that does not involve complicated and expensive hardware, hence we aim at a solution based on imagery from a standard camera. The solution should not require learning a grammar or being limited to certain building types (like big buildings with a high number of repetitive facade features). In figure 1 our approach is illustrated. It starts out by estimating the intrinsic camera parameters without the need for a calibration. Next, a structure from motion approach is applied to estimate a 3D point-cloud from which a coarse model of a facade together with a rectified texture is extracted. Finally the detailed facade features are detected together with their depths and the final 3D detailed model is built. The paper follows the structure of the blocks in figure 1 together with a result section and a conclusive section.
Fig. 1. Overview of the proposed system
2 Pre-processing Before any 3D reconstruction can commence we need to find the intrinsic camera parameters: the focal length α in pixels and the principal point (x0 , y0 ) in pixels. A standard camera calibration can be applied to obtain these parameters. But if images are captured with different zoom factors multiple calibrations are required. Moreover, if the user is a non-expert calibration should be avoided altogether. To this end we suggest to approximate the intrinsic camera parameters based on the meta data, which
3D Reconstruction of Buildings with Automatic Facade Refinement
453
is stored together with a picture in most modern cameras. The meta data is stored in the Exchangeable Image File Format (EXIF) and includes, among other things, the camera type, resolution, and focal length (in mm). A good approximation of the principal point is given by the resolution, i.e., the center dimage , where f is the of the picture. The focal length in pixels is given by α = f · dsensor focal length in mm read from the EXIF data, dimage is the diagonal of the image in pixels (directly given by the resolution), and dsensor is the diagonal of the sensor in mm. dsensor is unfortunately not present in the EXIF data, but the camera type is, and since dsensor is static for a camera a look-up table can be constructed and applied. In this work we have used an Olympus E-520 camera. We have estimated the intrinsic camera parameters using the approximation scheme suggested above and compared them with those obtained using a real camera calibration. We found that all three parameters have errors less than 3% and even more importantly that the results of the 3D reconstruction did not suffer from using the approximation.
3 Matching The purpose of the matching step is to estimate the relative pose of the cameras in all pairs of input images if possible. For each pair of input images a robust set of corresponding keypoints is identified using SIFT and filtered using RANSAC. Then the fundamental matrix is estimated using the 8-point algorithm [10]. Combining this with the intrinsic camera parameters the essential matrix and finally the relative pose is recovered. For details see [11,12].
4 Clustering The purpose of clustering is to recover structure and motion for all input images. In this context a cluster is the collection of cameras with estimated extrinsic parameters and a point cloud of estimated keypoint positions, see figure 6. Traditionally methods for recovering structure and motion are based on finding a global initial guess and then performing a final optimization using bundle adjustment. For bundle adjustment to succeed, however, a good initial guess is required [13]. In this work a bottom up approach, where bundle adjustment is applied throughout clustering, has been developed in order to minimize the risk of a bad initial guess preventing recovery of structure and motion. The overall approach is to initialize the cluster from the best matching image pair, and then add images to the cluster iteratively until all input images are included. When adding an image to the cluster, the relative pose recovered during matching is used for computing an initial estimate of the extrinsic camera parameters, and keypoint inliers are triangulated to obtain an estimate of keypoint positions. Each time an image has been added, the whole cluster is optimized by applying bundle adjustment. For details see [11,13,14].
454
C. Larsen and T.B. Moeslund
5 Coarse Model We now have a cloud of 3D points representing the building, see figure 6. Extracting structures directly from the point cloud is erroneous in general and especially for fine details such as recessed features. Inspired by [4] we therefore introduce a semiautomatic approach. Firstly the user is asked to click dominant features, denoted locators, in two images. From these a polygon is built and textured automatically. A GUI is created wherein the user can flip through the different images. For each (overall) planar surface the user clicks its corners in two images and a polygon is built and visualized, see figure 2. If the user accepts the polygon its 3D corners are extracted using the relation between the two images found in the previous section, and a 3D surface is fitted using a least squares solution.
1
4
2
9
3
8
5
7
6
Fig. 2. Interactively defining a polygon by clicking at corresponding locators in two images. Here nine locators are used.
To create a rectified texture for a polygon, it is necessary to determine which of the input images that has the best view of the polygon. We select the input image corresponding to the camera with the smallest angle between the normal of the plane and the direction from the centroid to the camera, see figure 3. As cos θ goes towards 0 the quality of the obtained rectified texture decreases. Therefore for the selected image it is required that cos θ > τ , where the threshold τ is set to cos 80◦ in this work. Furthermore it is required that the reprojections of all locators Li used as vertices for the polygon lie within the boundary of the selected image, because otherwise a texture for the whole polygon can not be extracted. m n
θ dj Cj
Fig. 3. The vectors and angles used for selecting the best image
3D Reconstruction of Buildings with Automatic Facade Refinement
455
5.1 Creating the Rectified Texture For each 3D vertex of the polygon, two 2D projections are computed: a projection into the plane of the polygon, and a projection into the input image. The vertices of the polygon are defined by the locator positions Li , and the projection of each vertex into the plane is computed using two orthogonal axes ux and uy of unit length lying in the polygon plane. First the centroid, m, is subtracted to obtain the local vertex coordinates Vi = Li −m, which are then projected onto the plane to obtain the 3D coordinates Vi = Vi − (Vi · n)n.
(1)
Now to compute the 2D projections of the vertices, each Vi is projected onto the plane axes to obtain the vector ⎡ ⎤ Vi · ux v ˆi = ⎣Vi · uy ⎦ , (2) 1 which is the homogeneous 3D-vector representing the 2D projection of locator i into the plane of the polygon. The projection of each vertex into the input image is simply achieved by projecting the locator position Li using the camera calibration information available for the best image, a. That is the projection into image a of the locator with position Li is computed as the homogeneous 3D-vector (3) vi = Ka Li , where the camera calibration matrix Ka is applied such that vi is expressed in pixel coordinates. The mapping between the polygon and the input image is now represented by the two corresponding sets of 2D projections: the set in the plane of the polygon defined by v ˆi , and the set in image a defined by vi . The final step is to actually create the rectified texture using a homography computed from corresponding points vi and vi . From the previous step, the projections vi of the locators in image a are available, and they are expressed in pixel coordinates. The corresponding pixel coordinates vi in the rectified texture to be extracted are needed, and these can be derived from the projections v ˆi in the plane obtained in the previous step. But as the length of the computed plane axes do not correspond to the size of a pixel in the rectified texture, we perform isotropic scaling and translation. Let T be a 3×3 matrix representing a transformation consisting of isotropic scaling and translation, then vi , (4) vi = Tˆ where the scale factor is chosen such that the size of a pixel in the rectified texture becomes approximately the same size as in the input image, and the translation is chosen such that the rectified texture is cropped to the polygon but leaving a small border.
6 Facade Reconstruction The final step is to refine the coarse model by adding local facade features, such as recessed windows, in order to increase the realism of the reconstructed model. To this end we first find the regions in question and second the depth of each region.
456
C. Larsen and T.B. Moeslund
6.1 Facade Segmentation A facade polygon typically consists of wall regions with large areas of similar color and texture, and smaller regions inside the wall regions, which represent facade features, that deviate in appearance from the wall. Thus for identifying regions in the facade we analyze the rectified texture. Often a wall appears primarily in one color with no significant texture or it is a brick wall consisting of bricks of similar color with joints between them. In the case of a plain colored wall, identifying the whole wall as one region is simple. For brick walls, however, it is more difficult due to the often high contrast between bricks and joints. We therefore blur the image such that the joints between bricks and other high frequency elements of the image are reduced. The segmentation is based on color classification in the HSV space. First we apply the single-linkage algorithm, which is based on agglomerative hierarchical clustering. It starts with randomly selecting a subset of samples (HSV pixels). In this work 1.000 samples are used. Each sample starts as a single cluster and then iteratively merges with similar clusters until the cost (distance) of merging yet another cluster pair becomes to big. Specifically, let Di and Dj be the subsets of observations in two candidate clusters for merging, then the distance between the clusters is given by dmin (Di , Dj ) = min ||x − x ||.
(5)
x∈Di x ∈Dj
We terminate the algorithm when dmin for the nearest cluster exceeds 0.025. In figure 4 a representative subset of samples is shown. The colors illustrate the two non-singular clusters when the algorithm is terminated. For all available test sequences the biggest cluster represents the wall. We model this by a 3D Gaussian distribution and apply Mahalanobis distance to classify all HSV pixels in the rectified and blurred texture image, see figure 4. For illustrative purposes we also designed a classifier for the second biggest cluster (green pixels). The resulting binary image is filtered using morphology and connected-component analysis. Finally, each recessed feature is detected as a region of connected (black) pixels inside the filtered image, see figure 4.
100
1 Value
200 300
0.5
400 0 1
500 600
1
0.5
200
400
600
800
Saturation
100
100
200
200
300
300
400
400
500
500
600
600
0.5 0
0
Hue
200
400
600
800
200
400
600
800
Fig. 4. 1: Rectified texture. 2: The single-linkage algorithm applied to the random sampled pixels. The two largest clusters are marked in red and green. 3: Classification of all pixels in the polygon texture. 4: The segmented regions are filtered.
6.2 Refining the Model Each of the segmented regions is represented as a subset of pixels in the binary image. For refining the coarse model, however, it is necessary to transform the pixels into
3D Reconstruction of Buildings with Automatic Facade Refinement
457
polygons that represent the contour of the detected facade region. Both contour sampling and polygon approximation were investigated, but it turned out that finding the oriented rectangle with minimum area enclosing the pixels was a more stable solution1 . For each rectangle a hole is carped out of the polygon and replaced by five new polygons. One representing the recessed feature and four representing the edges of the wall surrounding the recessed feature. The textures are found as explained in section 5. For textures not defined we copy the texture from the opposite edge. The actual depth of a recessed feature is estimated using an analysis-by-synthesis approach. For each possible depth value we update the 3D model. We then synthesize the texture of the model for the recessed feature (including the edges) as seen from the viewpoint of a specific image and compare with the actual pixel values in that image using the correlation coefficient (xi − x)(yi − y) (6) r = (xi − x)2 (yi − y)2 where xi and yi are the ith pixels in the image and synthesized texture, respectively, and x and y are the mean pixel values in the image and synthesized texture, respectively. This is done for different depth values and averaged over three different images (the most frontal and the most extreme viewpoints from each side). The maximum of the averaged curve defines the most likely depth value.
7 Results Eleven different data sets are used to evaluate the system. In figure 5 sample images can be seen.
Fig. 5. Sample input images from the eleven data sets
1
This is of course only true for rectangular regions.
458
C. Larsen and T.B. Moeslund
Fig. 6. The different steps in our system illustrated for data set i. Top left: Recovered structure from motion. Top center: The coarse model. Top right: The rectified texture of the course model. Bottom left: The coarse model superimposed on an input image. Bottom center: The refined model for one facade. Bottom right: The refined model including texture.
Fig. 7. Three different synthetic viewpoints of the building in data set a with recessed features inserted (top row) and without recessed features (bottom row)
3D Reconstruction of Buildings with Automatic Facade Refinement
459
In figure 6 the sub- and final results of our approach are shown for data set g. In the preprocessing step a massive amount of keypoints are found in each image due to especially the significant high frequency information coming from the bricks. To avoid such repetitive keypoints we enforce a conservative threshold when matching keypoints and hence only keep the most distinctive keypoints. The quality of the structure from motion (RMS reprojection error) estimated using our method is below 0.4 pixels, which is acceptable for reconstruction. In figure 6 a representative example is shown. In six of the eleven data sets, the number of correctly detected facade features equals ground truth. For the remaining data sets, the percentage of correctly detected features is between 65% and 86%, and for all data sets on average 89% of the ground truth features were correctly detected. The main source of error is occlusion. That is, when a foreground object (tree or lamppost) is occluding parts of the building. Since all pictures were taken from the ground some edges cannot be textured (illustrated by a yellow/black pattern). We handle this by copying the texture from one of the other recessed edges. This can be seen in figures 7 and 8 where close ups of different facade features are shown with and without refining the model. These figures also illustrate the benefit of modeling detailed features.
Fig. 8. Three different synthetic viewpoints of the building in data set c with recessed features inserted (top row) and without recessed features (bottom row)
460
C. Larsen and T.B. Moeslund
8 Conclusion We have presented a method for detecting facade features and including that in a refined 3D textured model of a building. The approach is based on structure from motion to estimate a 3D point cloud wherein a coarse facade model is defined. We segment the rectified texture of the coarse model using a clustering approach and hereby find the local facade features. Their depths are estimated using an analysis-by-synthesis approach. The approach is tested on eleven data sets and detects 89% of the facade features correctly. Future work includes a texture selection mechanism for occlusion handling and an automated process for finding the coarse model, e.g., by finding intersections of planes fitted to the 3D data.
References 1. Pollefeys, M., Van Gool, L.: From Images to 3D Models. Communications of the ACM 45, 50–55 (2002) 2. Frahm, J., Georgel, P., Gallup, D., Johnson, T., Raguram, R., Wu, C., Jen, Y., Dunn, E., Clipp, B., Lazebnik, S., Pollefeys, M.: Building Rome on a Cloudless Day. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 368–381. Springer, Heidelberg (2010) 3. Benitez, S., Denis, E., Baillard, C.: Automatic Production of Occlusion-Free Rectified Facade Textures using Vehicle-Based Imagery. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, Commission XXXVIII, Saint-Mande, France (2010) 4. Sinha, S.N., Steedly, D., Szeliski, R., Agrawala, M., Pollefeys, M.: Interactive 3D Architectural Modeling from Unordered Photo Collections. In: SIGGRAPH, Asia (2008) 5. Boulaassal, H., Landes, T., Grussenmeyer, P., Tarsha-Kurdi, F.: Automatic Segmentation of Building Facades using Terrestial Laser Data. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, Commission V, Munich, Germany (2007) 6. Becker, S., Haala, N.: Grammar Supported Facade Reconstruction From Mobile LiDAR Mapping. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, Commission XXXVII, Paris, France (2009) 7. Rippererda, N., Brenner, C.: Application of a Formal Grammar to Facade Reconstruction in Semiautomatic and Automatic Environments. In: AGILE International Conference on Geographic Information Science, Hannover, Germany (2009) 8. Recky, M., Wendel, A., Leberl, F.: Facade Segmentation in a Multi-View Scenario. In: International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, Hangzhou, China (2011) 9. Gool, L., Zeng, G., Borre, F., Muller, P.: Towards Mass-Produced Building Models. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, Commission III, Munich, Germany (2007) 10. Hartley, R.I.: In Defence of the 8-point Algorithm. In: ICCV, Cambridge, USA (1995) 11. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004) ISBN: 0521540518 12. Nister, D.: An Efficient Solution to the Five-Point Relative Pose Problem. PAMI 26(6), 756– 770 (2004) 13. Schwartz, C., Klein, R.: Improving Initial Estimations for Structure from Motion Methods. In: The 13th Central European Seminar on Computer Graphics, CESCG 2009 (2009) 14. Lourakis, M.I.A., Argyros, A.A.: SBA: A Software Package for Generic Sparse Bundle Adjustment. ACM Transactions of Mathematical Software 36 (2009)
Surface Reconstruction of Maltese Cisterns Using ROV Sonar Data for Archeological Study C. Forney1 , J. Forrester1 , B. Bagley1 , W. McVicker1 , J. White1 , T. Smith1 , J. Batryn1 , A. Gonzalez1 , J. Lehr1 , T. Gambin2 , C.M. Clark1 , and Z.J. Wood1 1
California Polytechnic State University, San Luis Obispo, CA, USA 2 AURORA Special Purpose Trust, Malta
Abstract. We present a methodology and algorithm for the reconstruction of three dimensional geometric models of ancient Maltese water storage systems, i.e. cisterns, from sonar data. This project was conducted as a part of a four week expedition on the islands of Malta and Gozo. During this expedition, investigators used underwater robot systems capable of mapping ancient underwater cisterns and tunnels. The mapping included probabilistic algorithms for constructing the maps of the sonar data and computer graphics for surface reconstruction and visualization. This paper presents the general methodology for the data acquisition and the novel application of algorithms from computer graphics for surface reconstruction to this new data setting. In addition to reconstructing the geometry of the cisterns, the visualization system includes methods to enhance the understanding of the data by visualizing water level and texture detail either through the application of real image data via projective textures or by more standard texture mapping techniques. The resulting surface reconstructions and visualizations can be used by archaeologists for educational purposes and to help understand the shape and history of such water receptacles.
1
Introduction
Due to its strategic location in the Mediterranean Sea, Malta’s culture has been strongly influenced by neighboring countries in both Europe and Africa. Today, Malta serves as a prime location for exploring archeological artifacts, many of which remain relatively in tact and found beneath the water. The work reported here is interdisciplinary, utilizing emerging robotics and computer science technology to explore previously unexamined, underwater structures. The project concerns investigation of ancient water storage systems, i.e. cisterns, located in most houses, churches, and fortresses of Malta. Cisterns are typically man-made stone underground water storage receptacles, which served a vital role in the survival of the Maltese, especially during sieges. According to [1], their importance was of such significance to residents during medieval times, that they were required by law to maintain access to three years of water supply. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 461–471, 2011. c Springer-Verlag Berlin Heidelberg 2011
462
C. Forney et al.
This project was conducted as a part of a four-week expedition on the islands of Malta and Gozo in March 2011. The project focused on obtaining archeological and historical data, as well as validating robot technology and computer graphics visualizations. This was accomplished through a series of underwater robot system deployments within cisterns. Archaeologists looking to study and document underwater cisterns in Malta have found it too expensive, difficult and dangerous to use people. The goal of this project was to gather data about the underwater cisterns, and enable the creation of geometric models of the cisterns that could later be explored and visualized in a computer graphics application. This work builds from two previous expeditions [2], after which only two-dimensional maps were created. In all, this project involved the exploration of over twenty cisterns and water features. A small underwater robot, or specifically a VideoRay micro-ROV [Remotely Operated Vehicle] was used, (see Figure 1). The ROV is controlled above the surface using a joystick housed within a control box. A tether connects the ROV to the control box, allowing sensor information (i.e. depth, bearing, and video camera images) to be displayed within the control box. The tether also allows power, and control signals to be sent to the ROV. The ROV is actuated with two co-planar thrusters (motor driven propellers) and one vertical thruster. On board lights are used to illuminate the ROV video camera’s field of view. Investigators deployed the ROV into cistern access points until it was submerged. The investigators then tele-operated the robot to navigate the underwater environment. Video images were recorded using the ROV’s onboard camera. Two and three-dimensional maps of the cistern are created using data from a SeaSprite scanning sonar mounted on top of the ROV. These sonar measurements were used with multiple algorithms to conduct ROV localization and for the purpose of creating three-dimensional occupancy maps to represent the cisterns. These occupancy maps (or grids) were then used to generate the geometric models of the cisterns. This paper presents the general methodology for the data acquisition and the novel application of algorithms from computer graphics for surface reconstruction to this new data setting. In addition, we present algorithms for the visualization and texturing mapping of such models (see Figure 6). These visualizations are used by archeologists for understanding the shape, connectivity and history of the cisterns and can be used for educational purposes to illustrate the importance and history of such water features.
2
Related Work
Mapping via Underwater Robot Systems: Creating maps using robots equipped with appropriate onboard sensing has been an active area of research for many years. Laser scanners, stereo-vision, and ultrasonic sensors have all been used to enable a variety of mapping tasks with ground-based, aerial, and underwater robots. This work focuses on mapping with underwater robots (e.g. [3], [4,5], [6],and [2]), which has seen less attention when compared to mapping in other
Surface Reconstruction of Maltese Cisterns Using ROV Sonar Data
463
Fig. 1. On the left are images of the data acquisition process, while the right shows acquired (and mosiaced) sonar data. The bottom left photo is the VideoRay microROV. Top left is a photo of investigators flying the ROV through the Mdina Archives cistern to gather sonar and video data. The middle photo shows the deployment of the ROV via lowering it into the cistern access point.
environments. Regardless, there are a large number of significant real life applications that can benefit from such technology including oceanography [7], marine biology [8], and archaeology [6], as well as for robot navigation (e.g. [9,10]). In the last 20 years, research has been conducted in developing algorithms capable of fusing data from multiple sensors to create maps. In particular, Simultaneous Localization and Mapping (SLAM) algorithms have been developed that can simultaneously construct an estimate of a map and calculate an estimate of the state of the robot (e.g. position) with respect to the map. A good survey of the core techniques including both Kalman Filtering and Particle Filtering based techniques can be found in [11]. Other relevant work includes research in underwater robot SLAM. One of the first instances includes the work done in [3] where sonar scans were used to map and track features of the environment. Scan-matching techniques have been used by researchers conducting mapping and localizing with robots in harbor environments [12]. More recently, successful 3D tunnel mapping in underwater environments was demonstrated in [4,5]. Unlike the related work, this research requires mapping in tunnels with passages and access points that are relatively small (e.g. 0.30m diameter at some points). This requires a smaller robot with low payload capacity and minimal sensors (i.e. scanning sonar, depth sensor, and compass). White et. al. demonstrated 2-dimensional maps of cisterns could be constructed with this limited sensor suite [2], and here the work in [2] is extended to construct 3-dimensional occupancy grid maps, which can later be surfaced and texture-mapped for high quality visualization. Surface Reconstruction and Visualization: The problem of digital surface reconstruction is well known in the computer graphics community, (e.g. [13,14,15]),
464
C. Forney et al.
due to the importance of creating three dimensional computer models with which users can interact. Once such a geometric model has been constructed, a user can then examine it from any view in order to gain insight about its nature and areas of interest. Creating such a tool for this setting assists archaeologists in their examination of the underwater structures and helps with the planning of future robot missions. Surface reconstruction in the maritime setting is a relatively new area of research [7]. This project involves the novel application of well known surface reconstruction algorithms [16] and data visualization [17] to this new setting and data.
3
Methodology and Algorithms
As illustrated in Figure 2, this project generally involves a two stage process, data acquisition in the field and off line data post-processing to construct the maps and visualizations. The focus of this paper is the later stages in the pipeline, but we include here a brief mention of the process by which sonar and image/video data is acquired.
Fig. 2. A diagram of the major pipeline stages of this project
3.1
Data Acquisition
At each cistern site, the ROV is lowered into the cistern for data acquisition. During deployment, video images of the cistern are recorded, and stationary sonar scans are obtained. A portion of each horizontal plane scan will be taken while the ROV rests on the bottom of the cistern. Due to scale or occlusions, the ROV must be re-positioned several times to acquire complete coverage of the shape of each cistern. For each horizontal plane, the ROV is positioned such that scans would overlap each other to facilitate mosaicing. Sonar data in three dimensions can be acquired by flying the ROV to different depths and continuing to scan the cistern horizontally. Alternatively, in certain cases, the sonar can be configured to scan in the vertical plane, and the ROV can then be rotated to different headings while resting on the bottom for scans. Figure 1 shows versions of mosaiced horizontal and vertical scans. Control signals, depth and heading measurements are recorded for use in producing the occupancy map. A more complete overview of the data acquisition system can be found in [18].
Surface Reconstruction of Maltese Cisterns Using ROV Sonar Data
465
Fig. 3. The cistern located at Wignacourt Museum College Garden. This was a fairly large and complex cistern, including several pillars and stairs down into the cistern. On the left is shown the visualization system’s GUI and a wire frame version of the mesh extracted from the occupancy grid. On the right is a texture mapped model.
3.2
Occupancy Grid Construction
Once data has been acquired from the deployment, it is fused with a SLAM algorithm to construct 3-dimensional maps. Since little information regarding the cisterns is available a-priori, a feature-based mapping was not used. Instead, an occupancy grid mapping approach was used to represent the map as done in [19]. To be specific, the space being mapped was discretized into a three dimensional grid of cells, and each cell was given a likelihood pi,j,k ∈ [0, 1] of being occupied. For a given robot’s position, the cells in an occupancy grid map can be updated via the log likelihood approach [11]. Specifically, the map was updated using a FastSLAM algorithm [11] that fuses a series of time-stamped sensor measurements including: a 360 degree sonar scan, depth measurement, compass measurement, and robot state predicted by either a) a dynamic model of the robot, b) a manual match of overlapping sonar scans, or c) an algorithmic match of overlapping sonar scans. Important to this sensor fusion is the experimentally modeled noise in both the sensors and motion model. Further details about this mapping algorithm will appear in a manuscript under development. An example of a three dimensional occupancy grid produced from this algorithm is shown in Figure 4. Note that in certain cases where vertical flight proved difficult due to water depth (i.e. less than 2 m) or depth sensor failure, three dimensional maps were constructed by extruding two dimensional maps in the vertical direction. 3.3
Isosurface Extraction
The occupancy grid built during the 3D mapping is used as input data for our surface reconstruction pipeline. The goal of this stage is to produce a geometric model of the surface of the cistern. To accomplish this, we treat the occupancy
466
C. Forney et al.
grid as a volume of probabilities and extract an isosurface from the occupancy grid using marching cubes [16]. Figure 3 shows the wireframe structure of a mesh extracted from the corresponding occupancy volume. For the surfaces shown, a probability value greater than 0.51 was used to extract the surfaces representing the most likely location of the walls of the cistern. In order to construct visually appealing and more meaningful geometric models, our surface reconstruction algorithm uses two additional algorithms to enhance the data, volumetric smoothing and diffusion. Volume Smoothing: To counteract noise in the data, a smoothing function was used as a pre-process on the volume data. The smoothing function is applied locally in a one voxel neighborhood in the volume, and averages the probability values of the surrounding six neighbors and the current grid point’s probability. Specifically, for all i, j, k: pi,j,k = (pi,j,k + pi+1,j,k + pi−1,j,k + pi,j+1,k + pi,j−1,k + pi,j,k+1 +pi,j,k−1 )/7. All of the images shown include one step of smoothing. Note that applying the smoothing function removes small extraneous noise from the occupancy grid (which can arise from small occlusions to the sonar, e.g. sticks in the water, etc.) which would translate into small disconnected components in the data. Examples of the small disconnected surfaces can be seen on the right side of Figure 7, but have been removed from the same data in Figure 6. Alternate filters, such as a Gaussian will be explored in the future. An additional step is used to remove other small disconnected components from the final geometric model by starting from a random seed face and counting the number of connected faces, f . Any component with f < threshold can be eliminated from the final geometric model.
Fig. 4. Two images of the acquired data of the Mdina Archive. On the left is the occupancy grid produced from the SLAM algorithm. On the right is a visualization of the texture mapped surface reconstruction. This cistern included a main shaft and two bell like structures branching off the shaft.
Surface Reconstruction of Maltese Cisterns Using ROV Sonar Data
467
Volume Diffusion: In more complex cisterns, due to occlusions or incomplete data, gaps in the occupancy volume result in holes in the reconstructed surface. A specific scenario that creates gaps in the occupancy data is when sonar readings are composited from multiple horizontal layers of data with gaps between horizontal slices. Our goal is to construct a single connected surface for this sequence of horizontal scans. In order to form a solid surface, we experimented with a very simplified diffusion function to distribute known surface data to vertical neighboring voxels when assumptions about shape could be made. Our algorithm is a very simplified version, related to work done by Davis et. al in [20]. Future work will include investigations into alternative diffusion functions and hole filling. In our setting, we can make a number of simplifying assumptions when tackling hole filling due to the fact that the occupancy grid was created with the knowledge of the ROV’s orientation and the volume is written in planar sections where ‘height’ (traversing in j) directly maps to the vertical orientation in the world. We also know that the shape of the exterior of the cistern should be continuous. Thus, in our setting, the function diffuses volume data through the volume only for surface patches of a particular orientation. Current experiments have focused on applying diffusion for any voxel adjacent to an isosurface polygon with a surface normal, n ˆ , pointing in a vertical direction. Specifically, each polygonal facet of the isosurface is evaluated and if the facet’s normal, n ˆ is within 45 degrees from vertical, then every pi,j,k adjacent to the polygon will diffuse its probability value to vertical neighbors in the volume. Specifically, each pi,j+n,k , within some neighborhood of size n, is set to the probability value of pi,j,k . Figure 5 shows two steps of diffusion using a step size, n, of ten.
Fig. 5. Three side view images of the Mdina archive, on the right is the archive with no diffusion, then one step, then two. Note that Figure 4 illustrates the final shape of the archive geometric model.
468
3.4
C. Forney et al.
Texturing and Visualization
Once a satisfactory geometric model has been constructed from the occupancy grid, textured models can be displayed using two different methods; either by traditional texturing using input images of stone (for example from the cistern walls) or by using projective texture mapping from the video obtained during deployment. For traditional texturing, given the enclosed shape of cisterns, we found that both cylindrical and spherical texture coordinates were sufficient when texturing the models. Projective Textures: One of the goals of this project is the replication of an actual cistern environment using both acquired geometry and textures. To complete this goal, images from the real world acquired with the onboard camera were used for texturing. A completely accurate representation requires that acquired images be placed in the location in the virtual world that corresponds with the precise location in the real world where they were acquired. In addition, these multiple images must be stitched and blended together. In our current implementation, localization of the virtual camera is not computed automatically, and instead an interactive system allows the user to select and place the acquired images in the virtual environment while navigating through the scene. These images are then displayed via projective texture mapping [17]. Due to the fact that numerous projections are necessary to cover the acquired geometry, we sought solutions that did not require rendering the entire scene multiple times per projection. We optimize rendering by only displaying the viewable triangles for each individual projection. Upon creation, all of the faces of the geometry are compared against the current projection’s frustum and only those within this boundary are rendered for that projection. As the program iterates, it displays the combination of all of the saved projections and their viewable geometry in the virtual world with each vertex rendered once. This makes runtime smooth and interactive-friendly. During runtime, each projection is oriented using inverse matrix multiplication of the current view and matrix multiplication of the projection’s view. Figure 7 shows the results of texturing the geometric data with real images acquired from the on board camera. Note that we found most cisterns to contain fairly murky water, resulting in poor video quality as seen in Figure 7. Future work includes automatic virtual camera placement, image enhancements and better blending of multiple images. General visualization. In order to help the user see and understand the shape of the cistern, our visualization system has the ability to explore and manipulate the surface data. A simple GUI was created to allow the user to rotate, translate, view as a wireframe mesh, and change the probability value for the reconstructed surface (see Figure 3). This enables users to have a good representation of the cistern, while allowing them to examine the surface and construction of the walls. This is especially important to archaeologists who wish to analyze the shape of the cistern. In addition for visualization purposes, a water surface is created to convey the water level within the cistern (for these visualizations the water height is mapped for appearance only).
Surface Reconstruction of Maltese Cisterns Using ROV Sonar Data
469
Fig. 6. On the left is a geometric model of the cistern located at the Mdina Cathedral Sacristine, Malta. On the right is the Gozo ‘key hole’ cistern. Both models are visualized with representational water and using standard texture mapping.
Fig. 7. Images showing the ‘keyhole cistern’ with projective textures acquired from the video camera on the ROV during flight. On the left is a close-up view within the cistern showing the textured walls. On the right is an overview showing the textured cistern. Note that the water in the cisterns tended to be murky and thus images pulled from the video are likewise murky.
4
Results and Future Work
We have presented the methodology and algorithms used for the acquisition, surface reconstruction and visualization of cisterns located on Malta. For the March 2011 trip to Malta and Gozo, over twenty different water features were explored using the ROV. In this current paper, we show surface reconstruction results from four different cisterns. For a complete database of the cisterns visited, see “http://users.csc.calpoly.edu/˜cmclark/MaltaMapping/sites.html”. The physical scale of all grids used for this paper is 0.1m per cell. The four cisterns shown include: – Figure 6, shows the cistern located at Mdina Cathedral Sacristine (Malta). This figure shows the surface reconstructed from the occupancy grid,
470
C. Forney et al.
textured and with a water visualization included. This model was generated from an occupancy grid of 120x80x20 cells. – Figure 6 and 7 shows the cistern located at House Dar Ta’Anna - Upper courtyard, Gozo, which is a keyhole shaped cistern. These models were generated from an occupancy grid of 122x117x11 cells. – Figure 3 is the cistern at Wignacourt Museum College Garden (Malta). It was situated in a courtyard with the cistern access in a centrally located stone structure. The cistern itself is fairly complex and connects to another cistern access site on the same grounds in the Wignacout Museum hallway. It includes stairs entering the cistern at one end and several pillars. This model was extracted from an occupancy grid of size 60x150x15. – Figure 4 shows results from sonar scans of the Mdina Archives. The archive’s cistern is located in a small courtyard directly facing the main entrance to the building. The ROV was lowered down a shaft which opened up to two bell shaped chambers. The occupancy grid for the Archive data is 135x145x69. These visualizations will be used by archeologists in order to explore the shape of the cisterns and in educational material about these hidden underwater structure in Malta. Future work on this project includes further enhancements on surface reconstruction, such as improved handling of holes in the data or the addition of ‘caps’ on top of the cisterns and improvements to projective texturing including registering video data for more accurate texturing of the extracted surface. In addition, we would like to include uncertainty visualization to better depict regions of uncertainty or error based on acquired measurements. Acknowledgements. This material is based upon work supported by the National Science Foundation under Grant No. 0966608.
References 1. Blouet, B.: The story of Malta. Allied Publications (2007) 2. White, C., Hiranandani, D., Olstad, C., Buhagiar, K., Gabmin, T., Clark, C.: The Malta cistern mapping project: Underwater robot mapping and localization within ancient tunnel systems. Journal of Field Robotics (2010) 3. Williams, S., Newman, P., Dissanayake, G., Durrant-Whyte, H.: Autonomous underwater simultaneous localization and map building. In: Proceedings of the 2000 IEEE International Conference (2000) 4. Fairfield, N., Kantor, G., Wettergreen, D.: Three dimensional evidence grids for SLAM in complex underwater environments. In: Proceedings of the 14th International Symposium of Unmanned Untethered Submersible Technology (UUST) (2005)) 5. Fairfield, N., Kantor, G., Wettergreen, D.: Real-time SLAM with octree evidence grids for exploration in underwater tunnels. Journal of Field Robotics 24(1-2), 3–21 (2006) 6. Clark, C.M., Olstad, C., Buhagiar, K., Gambin, T.: Archaeology via underwater robots: Mapping and localization within Maltese cistern systems. In: Proc. of the 10th International Conference on Control, Automation, Robotics and Vision (ICARCV 2008) (2008)
Surface Reconstruction of Maltese Cisterns Using ROV Sonar Data
471
7. Pizarro, O., Eustice, R.M., Singh, H.: Large area 3D reconstructions from underwater optical surveys. IEEE Journal of Oceanic Engineering (2009) (in press) 8. Williams, S.B., Pizarro, O., Jakuba, M., Barrett, N.: AUV benthic habitat mapping in South Eastern Tasmania. In: Howard, A., Iagnemma, K., Kelly, A. (eds.) Field and Service Robotics. Springer Tracts in Advanced Robotics, vol. 62, pp. 275–284. Springer, Heidelberg (2010) 9. Ribas, D., Ridao, P., Neira, J., Tardos, J.: SLAM using an imaging sonar for partially structured underwater environments. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS (2006)) 10. Conte, G., Zanoli, S., Gambella, L.: Acoustic mapping and localization of an ROV. In: 14th Mediterranean Conference on Control and Automation, pp. 1–6 (2006) 11. Thrun, S., Burgard, W., Fox, D.: Probabilistic robotics. MIT Press, Cambridge (2005) 12. Hernndez, E., Ridao, P., Ribas, D., Batlle, J.: Msispic: A probabilistic scan matching algorithm using a mechanical scanned imaging sonar. Journal of Physical Agents 3, 3–11 (2009) 13. Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and interactive Techniques SIGGRAPH 1996, pp. 303–312. ACM Press, New York (1996) 14. Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D., Pereira, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., Shade, J., Fulk, D.: The digital Michelangelo project: 3D scanning of large statues. In: SIGGRAPH 2000 (2000) 15. Hoppe, H., DeRose, T., Duchamp, T., McDonald, J., Stuetzle, W.: Surface reconstruction from unorganized points. In: SIGGRAPH 1992, vol. 26, pp. 71–78 (1992) 16. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3D surface construction algorithm. In: Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, pp. 163–169 (1987) 17. Segal, M., Korobkin, C., van Widenfelt, R., Foran, J., Haeberli, P.: Fast shadows and lighting effects using texture mapping. In: Proceedings of Computer Graphics, SIGGRAPH 1992 (1992) 18. Hiranandani, D., White, C., Clark, C., Gambin, T., Buhagiar, K.: Underwater robots with sonar and smart tether for underground cistern mapping and exploration. In: The 10th International Symposium on Virtual Reality, Archaeology and Cultural Heritage VAST (2009) 19. Moravec, H.: Sensor fusion in certainty grids for mobile robots. AI Magazine 9, 61–74 (1988) 20. Davis, J., Marschner, S., Garr, M., Levoy, M.: Filling holes in complex surfaces using volumetric diffusion. In: 1st International Conference on 3D Data Processing, Visualization, and Transmission (2002)
Supporting Display Scalability by Redundant Mapping Axel Radloff1 , Martin Luboschik1 , Mike Sips2 , and Heidrun Schumann1 1
2
Institute for Computer Science, University of Rostock, Germany Geoinformatics, German Research Center for GeoSciences, Potsdam, Germany
Abstract. Visual analysis sessions are increasingly conducted in multidisplay environments. However, presenting a data set simultaneously on heterogenous displays to users is a challenging task. In this paper we propose a two-step mapping strategy to address this problem. The first mapping step applies primary mapping functions to generate the same basic layout for all output devices and adapts the object size based on the display characteristic to guarantee the visibility of all elements. The second mapping step introduces additional visual cues to enhance the effectiveness of the visual encoding for different output devices. To demonstrate the Two-Step-Mapping we apply this concept to scatter plots presenting cluster data.
1
Introduction
Smart environments integrate a multitude of interconnected devices to facilitate a pro-active assistance for multi-user scenarios. Those ensembles consist of stationary devices such as desktop devices, projectors, motion trackers, or wall-sized displays, but also aim to integrate personal devices of users such as laptops, PDAs and smart phones. Smart meeting rooms are a typical application scenario of smart environments [1] serving as a basis to communicate information to facilitate discussions and to support decisions. However, in [2] the challenge of Display Scalability has been described dealing with the consistent visual encoding of the same data on different output devices such as smart phones, laptops or large public displays (see figure 1). The problem to be solved here is related to this challenge and in particular aims to avoid the following problems: – On small displays, visual clutter may occur and – on large public displays connectivity information can be lost. This means, simultaneously presenting the same information on different output devices is a challenging task to be solved in smart environments. In this paper, we address the problem of Display Scalability through extending the classical visual mapping to visual variables (which we call primary mapping) to a redundant encoding of data (which we call secondary mapping). G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 472–483, 2011. c Springer-Verlag Berlin Heidelberg 2011
Supporting Display Scalability by Redundant Mapping
473
Fig. 1. Applying the proposed Two-Step-Mapping to generate scatterplots showing the same clusters on different display devices in our smart meeting environment
We will demonstrate this approach using the example of presenting clustered data in scatter plot displays. The paper is organized as follows: First we briefly reflect the state of the art in Section 2. In Section 3 we introduce the two-step mapping strategy and exemplarily show the application to scatter plots presenting classified data. Section 4 describes a short user-study and Section 5 concludes and gives an outlook on further work.
2
Related Work
Visual representations have been adapted in multiple ways to address different data properties (e.g., [3,4]), different visualization goals (e.g.,[5,6]) and different user capabilities (see [7]). However, the generation of proper visual representations in consideration of heterogeneous multi-displays, as they are found in smart environments, has not been sufficiently examined. Such environments are generally heterogeneous ensembles, that change over time (joining and leaving devices) and facilitate collaborative work (e.g., in [8,9]). The specifically developed, rare visualization approaches for those ensembles typically combine the individual displays into a large single one. Thus, current research mainly addresses the problems of sharing content synchronously from multiple devices on multiple displays and sharing the corresponding multiple interactions on the devices (e.g., [9,10]). Other research projects in the field of multi-desktop environments study the effectiveness of such environments (e.g., [11]) or of single display types (e.g., [12]). However, they do not provide a strategy on how to adapt the same visual representation to different displays.
474
A. Radloff et al.
The adaptation of graphical representations according to given output devices, in particular considering the reduced display size of mobile devices, has been addressed extensively in different fields (e.g., for maps [13], for 3D-models [14], for images [15]). The underlying techniques – scaling, aggregation, interpolation, progression and sampling – are also applied in the field of information visualization (e.g., [16,17,18]). However, visual clutter remains the main problem. It appears if displaying too much data on a screen with limited space [19]. On the other hand, enlarging the display size may pull the visual objects apart and hence, features such as data density may be misinterpreted [20,17]. To score the perceivable visual features and goodness of a visualization and thus, to address these problems, different measures have been introduced (see [21] for an overview). Such measures in combination with appropriate thresholds, are used either to reduce the amount of displayed data (e.g., by sampling [17,18]), to simplify the visual representation (e.g., the use of binning in [22,23]) or to determine an appropriate level of progression [24]).
3
Two-Step Mapping
The mapping step specifies the visual encoding of data by defining visual abstractions that can be graphically presented. That means visual abstractions represent data through graphical objects specified by their geometry and additional attributes describing the appearance of the objects. The choice of the visual encoding has a huge impact on the effectiveness of conveying information to the user. The mapping step is influenced by different constraints like the characteristics of the data to be encoded, the capabilities of the human visual system, the tasks at hand, but also the characteristics of the given output displays. Because of the diversity and complexity of constraints, many mapping approaches consider one constraint only (typically the data characteristics). In the case of considering output devices, the mapping primarily takes the limited resources of mobile output devices into account, in particular the limited screen size. However, to the best of our knowledge, no mapping strategies published so far specifically addresses the requirements of simultaneously presenting the same data on different displays. This leads to two problems to be solved. First, in any case the visibility of the elements to be displayed has to be guaranteed depending on the display resolution and eye distance. Second, the capability of a human when working with different display devices may vary significantly, influenced by nothing but the human cognition (see e.g. [12]). We address these issues by introducing a two-step mapping process. The basic idea is to distinguish between primary and secondary mapping functions. The primary mapping functions generate the basic visualization by realizing the typical mapping step generating a specific visual representation. Additionally, the size of visual abstractions is adapted in such a way that visibility on different display devices is guaranteed.
Supporting Display Scalability by Redundant Mapping
475
The secondary mapping functions define additional visual cues to redundantly encode information and thus, improve the readability of visual representations in multi-display environments. To this end, the combination of primary and secondary mapping functions preserves visibility and important properties of visual abstractions, but also reflects the characteristics of heterogeneous output devices. 3.1
Primary Mapping
The primary mapping corresponds to the classical mapping step of the visualization pipeline. It generates the visual encoding of data according to the characteristics of a given visualization technique. Thus, the primary mapping defines the same basic layout for all displays. The visibility of visual abstractions on different displays has to be ensured. For this purpose, the size of graphics abstractions needs to be adjusted to a well-defined minimal size to guarantee that objects are perceivable by the user. The adaptation of the object sizes is based on the procedure for eye testing. According to the ISO standard for visual acuity testing [EN ISO 8596:199605], an object is distinguishable from the background and other objects and hence, visible (assuming a visual acuity of 1.0), if the object covers at least 1 arc minute of a humans visual angle. Thus, in contrast to [25] we don’t determine the covered visual angle of an object. We assume that the object covered at least 1 arc minute (α = 1 arc minute) of a humans visual field. Hence, we calculate the required size (s) of the object using a given viewing distance (d) by the use of the law of sine: s = d ∗ tan(α) (1) Furthermore, let r be the ppi (pixels per inch) of the specific display and s the object size in inch, then p, the number of pixels, covered by the object, can be determined by the following equation: s∗r ≤p
(2)
Because of the discrete nature of the pixel space, p has to be adjusted upwards to the next integer value to guarantee visible objects. Note, the visual acuity is influenced by the lightness. Higher lightness would increase and lower lightness would decrease the visual acuity. This was first proven by K¨ onig [26]. However, a light density between 160 and 320 cd/m2 has no significant influence on the visual acuity [27]. Furthermore, the light density of a typical display is in this range and hence, this parameter need not be specifically taken into account. Using the typical viewing distances of [28], the required minimum object sizes for different display classes can be pre-computed, and stored by a Look-up-Table. Based on this, the primary mapping step checks the size of the generated visual abstractions and scales the visual output as necessary. Since all visual abstractions are scaled in the same way, the visual attribute size (e.g. as range [min, max]) can still be used to encode data.
476
3.2
A. Radloff et al.
Secondary Mapping
The secondary mapping step redundantly encodes data to improve the effectiveness of visual representations on heterogeneous displays. In principle, all visual attributes that have not been used by the primary mapping functions can be applied to specify the secondary mapping step. However, each visualization technique requires different visual encodings and thus, different subsets of visual variables to define the primary mapping functions. Furthermore, different tasks and output devices require specific encodings. Hence, a general guideline for the design of secondary mapping functions cannot be established. However, the functions have to be found with regard to the applied visualization technique, the task at hand and the given output device. 3.3
Example: Adapting Scatter Plot Displays to Different Output Devices
Exemplarily, in this Section, we will show the application of the two-step mapping concept on scatter plots displaying classified data. The task to be supported is detecting clusters simultaneously on large displays with sparse point distribution or on small displays with dense point regions. Primary Mapping. In our example, generating scatter plots, the primary mapping is defined by position (encoding data values) and color (class membership). Thus, we specified two primary mapping functions p1 and p2 to define the basic layout. p1 maps data values onto positions p2 maps class memberships to color (hue) The primary mapping function p1 generates the typical scatter plot display by using position, whereas the primary mapping function p2 allows to distinguish the classes, by mapping the cluster membership to color. We encode class membership by color because Mackinley [29] ranked this variable to be the second best for nominal data (best is position, which is already used). We apply the color scale provided by Healey et. al. [30] that is proven to be a very good color scale for the distinction of nominal data. However, for gray-scale displays the variable value has to be used instead of hue. To guarantee visibility of all dots for typical viewing distances we additionally define the mapping function p3 : p3 defines the minimal dot size with regard to visibility constraints. The size calculation is based on the formula described in 3.1. Secondary Mapping. For the definition of the secondary mapping functions further appropriate visual variables have to be found to redundantly encode the data, in this specific case the cluster membership. The question is, which visual variables can be used. Our discussion is based on the classification of visual variables by Bertin [31] (position, size, color, value, shape, orientation
Supporting Display Scalability by Redundant Mapping
(a) primary mapping
(c) primary + attribute shape
secondary
477
(b) primary + secondary attribute orientation
(d) primary + secondary attribute connection
Fig. 2. Scatter plot showing identical data with the application of different mapping functions. (a) shows the classical scatter plot using the primary mapping functions. In (b), (c) and (d) additional secondary mapping functions have been applied; in (b) the secondary mapping function s2 (orientation), in (c) the mapping function s3 (shape) and in (d) the mapping function s4 (connection).
and texture) and Mackinley [29] who introduced three further attributes: density, connection and containment and replaced size by length & area & volume and replaced orientation by angle & slope. As described above, for redundant encoding of data in the secondary mapping appropriate visual variables have to be found. This procedure depends on characteristics of the chosen visualization technique and the presented data. Using our example of redundant encoding of cluster membership in scatter plots, the following illustrates the individual steps of this procedure: The first four of the seven visual variables provided by Bertin [31] – position, size, and color (or value for grayscale displays) – are utilized by the primary mapping functions and thus, cannot be used by secondary mapping functions. The visual variable texture should also be excluded, since the dot size is adjusted in such a way that the visibility is ensured, but it cannot be assumed that
478
A. Radloff et al.
the dot size also ensures the representation of readily identifiable textures. The same holds true for the attribute containment. The chosen dot size guarantees visibility, but does not allow for further encodings. Also, a further refinement of visual variables as suggested by Mackinley does not provide new opportunities for our purposes. Readily identifying both (length and area) of small dots would be very difficult. Hence, only the visual variables shape and orientation of Bertin’s list and the variable connection of Mackinley’s extension can be used as additional variables to adjust the scatter plot display. Based on this discussion, we introduce three additional secondary mapping functions: 1. s1 : maps classes onto orientation. s1 is realized by replacing the dots representing the data in the scatterplot by little bars. The cluster membership is mapped onto a rotation angle (γ) with γ[0◦ , 90◦ ] and the bars are rotated accordingly. To determine the required angle, the 90◦ interval is divided equidistantly with regard to the number of existing clusters (see figure 2(b)). Thus, otherwise overplotted dots that do not belong to the same cluster can be distinguished by their differently rotated bars. Here, the minimum size for an object, calculated in the primary mapping, is mapped to the width of a bar. 2. s2 : maps classes onto shape. For s2 the dots are replaced by regular shapes (all sides have the same length and all angles are the same). Here, the cluster membership is mapped onto the number of vertices. The data value is represented by the center of the shape (see figure 2(c)). The minimum size is mapped to the diameter of the surrounding circle of a shape. 3. s3 : maps classes onto connection. s3 connects the dots of one cluster with the centroid by lines. Therefore, first the centroid of a cluster has to be found by min min + mmin , ymax −y + ymin ) calculating the center of the cluster ( xmax −x 2 2 and determining the data point with the least Euclidian distance to the center. Then, the other points are linked to the centroid by drawing a straight line (see figure 2(d)). Here, the minimum size is mapped to the size of the dots. In the next Section we will describe our user study to demonstrate that secondary mapping functions like s1 , s2 and s3 improve the effectiveness of scatter plot displays on heterogeneous displays. The next section shows that this secondary mapping step remarkable improves the perception of clustered data on heterogeneous displays.
4
User Study
This user study aims to study how the proposed Two-Step Mapping to present clusters in scatter plots addresses Display Scalability in our smart room environment. To evaluate the Two-Step Mapping strategy, we determine the success rate of connection, shape and orientation for a small, medium and large display device. The success rate is an important issue of Display Scalability and we define it as the ratio of correct answers to all answers.
Supporting Display Scalability by Redundant Mapping
479
We compare the success rates of the Two-Step Mapping strategy against the well-known visual mapping of scatter plots which directly corresponds to our primary mapping. In this section, we will present preliminary results of our user study that support our research hypotheses: 1. H1: Encodings with secondary mapping perform better on the used display devices in comparison with a single primary mapping of clusters. 2. H2: Display device affects the success rate, i.e., the success rate of the different secondary mappings varies across the used small, medium, and large display device. 4.1
Setup
Technical base: The class of large displays is represented by a 61” TV (see figure 3(b)) with a resolution of 1920x1080 pixels (36 ppi (pixels per inch)). The medium displays are represented by a 24” Desktop display with a resolution of 1920x1200 pixels (94 ppi) presenting the samples in a 500x500 pixel window. The small displays are represented by an Apple iPod touch (see figure 3(a)) 1G with 3.5” and a resolution of 320x480 pixels (163 ppi). The viewing distances have been fixed with regards to [28]. Thus, the participants were placed 3m in front of the large TV, 70cm from the desktop monitor and 40cm from the iPod touch.
(a) TV with additional attribute connection
(b) iPod touch with additional attribute orientation
Fig. 3. Demonstration of two typical views with applied enhanced mapping
Data characteristics: The dataset to be tested was generated using a mixture of normal distributions: First, the number of clusters (4 to 6) is set for each scatter plot. We experienced that the success rate decreases with the increasing number of clusters for scatter plots on desktop devices; the task to detect clusters in scatter plots becomes difficult for many clusters. Since our aim is to study the relationship between success rate and display device, we restricted the the number of clusters between 4 and 6. The (x, y) coordinates of the centroids of the clusters are drawn from the mixture of normal distributions. Data points are placed around each centroid by considering the centroid as the mean of a normal distribution. To ensure comparability, each view includes a constant number of 200 data points assigned to each cluster.
480
A. Radloff et al.
Participants: The user group was composed of 23 non-visualization experts (6 female and 17 male) with an average age of about 34, a minimum age of 18 and a maximum age of 60 years. Task: The task for the participants was to state the number of clusters that can be observed in a scatter plot display. The time for each scatter plot display was restricted to 15 seconds. Briefing and Execution: To brief the participants, test-samples were shown and described in detail. Then, participants were shown scatter plots with 20 different data samples, 5 samples per visual encoding (without additional visual cues, with connection, with orientation and with shape). To avoid a learning effect, the sequence of the display devices, the secondary mappings, and the number of clusters were randomly chosen for each participant. 4.2
Results
H1: The primary mapping leads to success rates between 63.5% for large displays and 80.0% for medium size displays. The particular low success rate for large displays might be caused by a low density, i.e., the ratio of data points to screen space is low. The secondary mapping clearly improves the success rates in comparison with the primary mapping: for small displays in every case, for medium and large displays in two of three cases in comparison to color (see also 1). H2: Table 1 shows that display size affects the success rate of the different mappings. Whereas, connection always outperforms the other mappings, table 1 shows significant differences for the success rate of the secondary mappings. However, the good performance of connection was a surprising finding for us since the secondary mapping to connection seems to be related to the corresponding Gestalt law which seems to maintain high success rates on all display sizes. Table 1. Success rate of the user-study for the different displays and mappings mapping connection shape orientation only primary mapping
5
small display 89.6% 87.7% 86.9% 75.6%
medium display large display 87% 97.4% 73% 80% 80.8% 61.7% 80% 63.5%
Conclusion and Future Work
In this paper we introduced a two-step mapping strategy distinguishing between primary and secondary mapping functions. The primary mapping defines the same basic representation for all displays additionally taking the display characteristics and the typical viewing distances into account. The secondary mapping functions adapt the visual representation in terms of different output devices.
Supporting Display Scalability by Redundant Mapping
481
We also applied the two-step-mapping to map based visualizations to redundantly encode data. Here, the graphical elements are area objects, allowing to use another set of additional visual attributes (e.g. texture). Furthermore, we are planing to extend this mapping to further techniques (e.g. Parallel Coordinates and Treemaps). An interesting topic for future work is the investigation of further visual variables, e.g. provided by NPR techniques. An interesting topic for further investigations could be related to real viewing distances that can in principle be provided by our smart lab. Acknowledgments. Axel Radloff is supported by a grant of the German National Research Foundation (DFG), Graduate School 1424 (MuSAMA).
References 1. Burghardt, C., Reisse, C., Heider, T., Giersich, M., Kirste, T.: Implementing scenarios in a smart learning environment. In: Proceedings of 4th IEEE International Workshop on Pervasive Learning, Hongkong (2008) 2. Cook, K., Thomas, J.: Illuminating the path: The research and development agenda for visual analytics (2005) 3. Robertson, P.K.: A methodology for choosing data representations. IEEE Computer Graphics and Applications 11, 56–67 (1991) 4. Senay, H., Ignatius, E.: A knowledge-based system for visualization design. IEEE Computer Graphics and Applications 14, 36–47 (1994) 5. Merino, C.S., Sips, M., Keim, D.A., Panse, C., Spence, R.: Task-at-hand interface for change detection in stock market data. In: Proceedings of the Working Conference on Advanced Visual Interfaces (AVI 2006), pp. 420–427. ACM, New York (2006) 6. Tominski, C., Fuchs, G., Schumann, H.: Task-driven color coding. In: Proceedings of the International Conference Information Visualisation (IV 2008), pp. 373–380. IEEE Computer Society, Washington, DC, USA (2008) 7. Kerren, A., Ebert, A., Meyer, J. (eds.): Human-Centered Visualization Environments: GI-Dagstuhl Research Seminar. Springer, Heidelberg (2007) 8. Encarna¸ca ˜o, J.L., Kirste, T.: Ambient intelligence: Towards smart appliance ensembles. In: From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, pp. 261–270 (2005) 9. Pirchheim, C., Waldner, M., Schmalstieg, D.: Deskotheque: Improved spatial awareness in multi-display environments. In: Proceedings of IEEE Virtual Reality Conference (VR 2009), pp. 123–126. IEEE Computer Society, Los Alamitos (2009) 10. Forlines, C., Lilien, R.: Adapting a single-user, single-display molecular visualization application for use in a multi-user, multi-display environment. In: Proceedings of the Working Conference on Advanced Visual Interfaces (AVI 2008), pp. 367–371. ACM Press, New York (2008) 11. Wigdor, D., Jiang, H., Forlines, C., Borkin, M., Shen, C.: Wespace: The design, development and deployment of a walk-up and share multi-surface collaboration system. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2009), pp. 1237–1246. ACM Press, New York (2009)
482
A. Radloff et al.
12. Tan, D.S., Gergle, D., Scupelli, P., Pausch, R.: Physically large displays improve performance on spatial tasks. ACM Transactions on Computer-Human Interaction 13, 71–99 (2006) 13. Follin, J.M., Bouju, A., FredericBertrand, B.P.: Management of multi-resolution data in a mobile spatial information visualization system. In: Proceedings of the International Conference on Web Information Systems Engineering Workshops (WISEW 2003), pp. 92–99. IEEE Computer Society, Los Alamitos (2003) 14. Huang, J., Bue, B., Pattath, A., Ebert, D.S., Thomas, K.M.: Interactive illustrative rendering on mobile devices. IEEE Computer Graphics and Applications 27, 48–56 (2007) 15. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2007) 26, 10 (2007) 16. B¨ uring, T., Reiterer, H.: Zuiscat: Querying and visualizing information spaces on personal digital assistants. In: Proceedings of the International Conference on Human Computer Interaction with Mobile Devices & Services (MobileHCI 2005), pp. 129–136. ACM Press, New York (2005) 17. Bertini, E., Santucci, G.: Improving 2d scatterplots effectiveness through sampling, displacement, and user perception. In: Proceedings of the International Conference Information Visualisation (IV 2005), pp. 826–834. IEEE Computer Society, Washington, DC, USA (2005) 18. Ellis, G., Dix, A.: Enabling automatic clutter reduction in parallel coordinate plots. IEEE Transactions on Visualization and Computer Graphics (Proceedings of InfoVis 2006) 12, 717–724 (2006) 19. Ellis, G., Dix, A.: A taxonomy of clutter reduction for information visualisation. IEEE Transactions on Visualization and Computer Graphics (Proceedings of InfoVis 2007) 13, 1216–1223 (2007) 20. Bertini, E., Santucci, G.: Quality metrics for 2d scatterplot graphics: Automatically reducing visual clutter. In: Butz, A., Kr¨ uger, A., Olivier, P. (eds.) SG 2004. LNCS, vol. 3031, pp. 77–89. Springer, Heidelberg (2004) 21. Bertini, E., Tatu, A., Keim, D.: Quality metrics in high-dimensional data visualization: An overview and systematization. In: InfoVis 2011 – IEEE Information Visualization Conference 2011, Providence, USA (2011) 22. Fuchs, G., Thiede, C., Sips, M., Schumann, H.: Device-based adaptation of visualizations in smart environments. In: Workshop Collaborative Visualization on Interactive Surfaces (CoVIS), IEEE VisWeek 2009 (2009) 23. Novotny, M., Hauser, H.: Outlier-preserving focus+context visualization in parallel coordinates. IEEE Transactions on Visualization and Computer Graphics (Proceedings of Vis 2006) 12, 893–900 (2006) 24. Thiede, C., Schumann, H., Rosenbaum, R.: On-the-fly device adaptation using progressive contents. In: Tavangarian, D., Kirste, T., Timmermann, D., Lucke, U., Versick, D. (eds.) IMC 2009. Communications in Computer and Information Science, vol. 53, pp. 49–60. Springer, Heidelberg (2009) 25. Ware, C.: Information Visualization: Perception for Design. Morgan Kaufmann, San Francisco (2000) 26. K¨ onig, A.: Die abh¨ angigkeit der sehsch¨ arfe von der beleuchtungsintensit¨ at. Sitzungsbericht der K¨ oniglich Preussischen Akademie der Wissenschaften zu Berlin 26, 559–575 (1897) 27. Kaufmann, H.: Strabismus. Georg Thieme Verlag (2003) 28. Terrenghi, L., Quigley, A., Dix, A.: A taxonomy for and analysis of multi-persondisplay ecosystems. Personal and Ubiquitous Computing 13, 583–598 (2009)
Supporting Display Scalability by Redundant Mapping
483
29. Mackinlay, J.: Automating the design of graphical presentations of relational information. ACM Transactions on Graphics 5, 110–141 (1986) 30. Healey, C.G.: Choosing effective colours for data visualization. In: Proceedings of IEEE Visualization (Vis 1996), pp. 263–270 (1996) 31. Bertin, J.: Graphics and Graphic Information-Processing. de Gruyter, Berlag (1981)
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device for Construction and Browsing of Human-Reachable Environments* Yu-Tung Kuo1 and Wen-Hsiang Tsai1,2 1
Institute of Computer Science and Engineering, National Chiao Tung University, Taiwan 2 Department of Information Communication, Asia University, Taiwan
Abstract. As an improvement on existing street view systems which cannot be moved to certain environments to acquire images, a new 3D imaging system using a portable two-camera omni-imaging device is proposed. The device can be carried on one’s back and moved on foot to any indoor and outdoor scene spot to acquire omni-images which cover the entire spherical view of the scene with an overlapping image band. By a space-mapping technique, the two omniimages are transformed into panoramic images, and the overlapping band is utilized to stitch them to form a single total panoramic image using a dynamic programming technique. Browsing of an environment with a series of scene spots can be conducted to see the perspective-view image of any spot in any direction by changing the viewpoint via the uses of four tools − the current perspective-view image, the total panoramic image, a walking path, and a threeview diagram. Experimental results show the feasibility of the system. Keywords: 3D imaging system, street view, omni-image, scene spot.
1 Introduction In recent years, many street view systems have been developed [1, 7, 8]. Fig. 1 shows two examples. To exploit a certain street scene, one may just browse a street view system to see the scene online without going there. This provides great value to applications like city touring, real estate sales, environmental management, etc. To collect street scenes, Google Street View Cars [2] were used (Fig. 2(a)). To enter narrow lanes, a Street View Trike was designed later (Fig. 2(b)). Also launched was a Street View Snowmobile for use on rough terrains (Fig. 2(c)). Aboard each of these vehicles is an imaging device with eight digital cameras, one fish-eye camera, and three laser range finders for scene collection. The device weighs about 150 kg. Though the use of the Street View Trek overcomes the incapability of the Street View Car in reaching narrow alleys, it is hard for a rider to pedal the heavy tricycle on steep ways. Also, although the Snowmobile has better mobility, it still cannot be ridden to some spaces like indoor stairways, mountains tracks, garden paths, etc. To reduce the system weight and widen the reachable environments, it is desired to design a new system which is portable by a person. *
This work was supported financially by the NSC project No. 98-2221-E-009-116-MY3.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 484–495, 2010. © Springer-Verlag Berlin Heidelberg 2010
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device
(a)
485
(b)
Fig. 1. Examples of street view systems. (a) Street view of Google Map. (b) Street view of Bing Maps.
Specifically, as shown in Fig. 3, in this study we utilize only two omni-cameras to construct an imaging device which weights less than 10 kg. The device is affixed to a steel holder which can be carried conveniently by a person on his/her back and moved to anywhere on foot (Fig. 3(a)). The two omni-cameras are aligned coaxially in a back-to-back fashion (Fig. 3(b)), and can be used to acquire two omni-images simultaneously which cover the spherical surrounding space. To stabilize the imaging device during walking in order to avoid acquisition of vibrating images, a set of gimbals is used to hold the device (Fig. 3(c)).
(a)
(b)
(c)
Fig. 2. Google street view vehicles. (a) Street view Car. (b) Street View Trike. (c) Snowmobile.
(a)
(b)
(c)
Fig. 3. Proposed 3D imaging system. (a) The system on the back of a carrier. (b) The imaging device consisting of a pair of omni-cameras. (c) A set of gimbals used for stabilizing the cameras during walking.
With environment reachability and imaging capabilities provided, there still arise many other technical issues which should be solved before the proposed system can be used for environment browsing, including: (1) transformation of the omni-images taken at a scene spot into panoramic ones; (2) stitching of the two panoramic images
486
Y.-T. Kuo and W.-H. Tsai
into a single one for use as an overview of the scene spot; (3) generation of a perspective-view image for inspection of the detail at the spot; (4) updating of the perspective-view image according to a chosen viewpoint; (5) provision of an interface for environment browsing, etc. All of these issues are solved in this study. In the remainder of this paper, we describe the system configuration and processes in Sec. 2, the adopted techniques for panoramic and perspective-view image generations in Sec. 3, and the proposed technique for panoramic image stitching in Sec. 4, followed by experimental results in Sec. 5 and conclusions in Sec. 6.
2 System Configuration and Processes In this study, we align two catadioptric omni-cameras coaxially and back to back to form a new 3D image device as shown in Fig. 3(b) for acquiring omni-images at scene spots. Each omni-camera includes a projective camera and a hyperboloidalshaped mirror. The mirror bottom plane may be designed to go through a focal point of the hyperboloidal shape of the mirror. If two omni-cameras of this design are not connected seamlessly, a circular blind region not observable by either camera will appear undesirably. Therefore, in this study we design the mirror bottom plane to be lower than the focal point as shown in Fig. 4(a). This broadens the field of view (FOV) of the camera with an additional circular band as illustrated by the yellow portion in Fig 4(a). Moreover, after two omni-cameras with such mirrors are attached back to back, a circular overlapping band will appear in both omni-images as illustrated by the yellow region in Fig. 4(b). Fig. 5 shows an example of real omniimages taken by an omni-camera pair of such a design, in which the yellow portions in Figs. 5(c) and 5(d) show the circular overlapping band. This overlapping band is utilized skillfully in a new technique proposed in this study for image stitching (described later in Sec. IV).
p
p Upper Image
Image plane
Om Om
O m
p Lower Image plane
(a)
(b)
(c)
Fig. 4. Illustration of proposed imaging device. (a) Light ray reflection with focal point higher than mirror bottom. (b) Combination of proposed omni-camera pair, creating an overlapping image band in both omni-images. (c) Overlapping of light rays reflected by mirrors of the two omni-cameras.
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device
(a)
(b)
(c)
487
(d)
Fig. 5. Overlapping bands in upper and lower omni-images (a) Original upper omni-image. (b) Original lower omni-image. (c) Overlapping band in (a). (b) Overlapping band in (b).
The system processes include two stages: environment learning and browsing. The former includes three phases: (1) system setup − including calibration of the focal point location of the hyperboloidal shape, and establishment of a space-mapping table for each omni-camera for use in panoramic and perspective-view image generations; (2) construction of panoramic images − including image generation, inpainting [4], dehazing [5], blending [6], and stitching using a technique proposed in this study; (3) database establishment − recording relevant images, scene spot locations, and feature points in panoramic images. And the environment browsing stage includes two phases: (1) display of the current scene spot view as a perspective-view image − including generation of an initial perspective-view image and updating of it in response to the user’s view-changing operation; and (2) display of the interface − including generations of a panoramic image, a walking path, and a three-view diagram for scene spot browsing as well as updating of the interface.
3 Generations of Images of Various Views by Space Mapping Generations of panoramic and perspective-view images in this study are based on the space-mapping approach proposed by Jeng and Tsai [3]. As illustrated in Fig. 6, each omni-image pixel p with coordinates (u, v) is the projection of a real-world point P with the light ray of the point going onto the mirror and reflected into the imaging plane, resulting in an azimuth angle θ and an elevation angle ρ whose values may be obtained by table-lookup using (u, v) according to a so-called pano-mapping table like that shown in Table 1. To construct the table, with the mirror surface assumed radially symmetric, a mapping function r = f(ρ) from ρ to the radial distance r of p with respect to the image center is set up first as f(ρ) = a0 + a1×ρ1 + a2×ρ2 + … + a5×ρ5. Next, by the known image coordinates (ui, vi) and the corresponding known world coordinates (Xi, Yi, Zi) of six real-world landmark points Pi selected manually in advance where i = 1~6, the coefficients a0 through a5 is obtained by computing the radial distances ri for each Pi as ri = ui 2 + vi 2 and the elevation angle ρi for Pi as ρi = tan−1(Zi/ X i 2 + Yi 2 ); and solving the six simultaneous equations ri = f(ρi) = a0 + a1×ρi1 + a2×ρi2 + ... + a5×ρi5, where i = 1~6. Finally, for each real-world point Pij with azimuth-elevation angle pair (θi, ρj), compute the image coordinates (uij, vij) of the corresponding pixel pij in the omni-image as uij = rj×cosθi; vij = rj×sinθi where rj = f (ρj) = a0 + a1×ρj1 + a2ρj2 + … + a5×ρj5 and fill them into the corresponding entry in the table (Table 1).
488
Y.-T. Kuo and W.-H. Tsai Table 1. An example of pano-mapping table of size M×N ρ1 ρ2 ρ3 ρ4
...
ρN
θ1
θ2
θ3
θ4
(u11, v11) (u12, v12) (u13, v13) (u14, v14) … (u1N, v1N)
(u21, v21) (u22, v22) (u23, v23) (u24, v24) … (u2N, v2N)
(u31, v31) (u32, v32) (u33, v33) (u34, v34) … (u3N, v3N)
(u41, v41) (u42, v42) (u43, v43) (u44, v44) … (u4N, v4N)
θM
… … … … … …
(uM1, vM1) (uM2, vM2) (uM3, vM3) (uM4, vM4) …
…
(uMN, vMN)
Fig. 6. An omni-camera with a mirror
With the table so constructed, afterward whenever the image coordinates (u, v) of an pixel p in a given omni-image are known and checked by table lookup to be located in entry Eij in Table 1, the corresponding azimuth-elevation angle pair (θi. ρj) in the table may be retrieved to describe the real-world point P corresponding to p. With the pano-mapping table T for an omni-camera C available, we may construct a panoramic image Iq of the size of T from a given omni-image Io taken by C by two steps: (1) for each entry Eij in T with the azimuth and elevation angles (θi, ρj), take out the coordinates (uij, vij) filled in Eij; and (2) assign the color values of pixel pij of Io at coordinates (uij, vij) to pixel qij of Iq at coordinates (i, j). Also, we may construct a perspective-view image IP from Io for any viewpoint by three steps: (1) map each pixel p in IP at coordinates (k, l) to a pair of elevation and azimuth angles (ρ, θ) in T according to the top-view geometry of the desired perspective view [3]; (2) find by table-lookup the image coordinates (u, v) in T which correspond to (ρ, θ); and (3) assign the color values of the pixel at coordinates (u, v) in Io to pixel p in IP. For example, the panoramic images and a perspective-view image generated from the two omni-images shown in Fig. 5 are shown in Figs. 7(a) through 7(c).
(a)
(b)
(c)
Fig. 7. Images of different views generated from omni-images of Fig. 5. (a) Upper panoramic image. (b) Lower panoramic image. (c) Perspective-view image.
4 Stitching of Upper and Lower Panoramic Images The new technique we propose for stitching two panoramic images of a scene spot into a total panoramic image includes the following steps.
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device
489
(1) Localization of the overlapping band in the upper and lower omni-images As shown in Fig. 4(c), a partial FOV A1 between L1 and L2 will appear in both omniimages as mentioned previously and it is desired to find out the range of elevation angles covering A1 in either omni-image. The light ray (green in the figure) of each scene point on an boundary light ray L2 of A1 going through the mirror to the focal point O1 is confined to be within a band B1 formed by L2 and L1′ where L1′ is the extreme light ray formed by an infinitively far scene point on L2 and is so parallel to L2. The elevation angles of the light rays within B1 can be figured out to be within the range of ρupper = ρ1 + ρ1′ where ρ1 is the range of the elevation angles formed by those of the boundary light ray L1 and the horizontal line H1 (the upper dotted line in the figure). The elevation angle of L1 may be found by image processing from any point on the “circular image boundary” in the upper omni-image, and the elevation angle of H1 is just 0o. Therefore, ρ1 may be computed explicitly, and by symmetry, so is ρ2. Furthermore, since L1′ is parallel to L2, ρ1′ may be figured out to be equal to ρ2. Consequently, we have ρupper = ρ1 + ρ1′ = ρ1 + ρ2 which is computable, and by symmetry so is the corresponding range ρlower for the lower omni-image, which actually just equals ρupper. Either of ρupper or ρlower finally may be transformed into an image band like the yellow circular region shown in Fig. 5(c) or 5(d). (2) Selection of feature matching lines It seems feasible to stitch the two panoramic images by matching the overlapping image parts found in the last step. However, this does not work because the two image parts actually are not identical. One reason is that the two cameras take images at different heights so that the image scaling differs in the two images. Another reason is that radial-directional distortions exist on the mirrors of the two cameras so that different nonlinearities are created on the acquired images. Nevertheless, it is found in this study that the rotational invariance property of omni-imaging [3] can be used to solve this problem. The property says that image pixels of a certain radial direction in an omni-image are just the projections of some real-world points appearing in the same direction. This in turn says that for each radial direction, there exist respectively in the omni-images two corresponding radial lines with identical image contents which are the projections of an identical set of real-world points. Accordingly, we propose in this study the idea of using radial lines to conduct image matching for stitching the upper and lower panoramic images. Such radial matching lines are selected manually in the learning stage to be of a set of 90 angular directions with equal intervals of 4o. For each of such lines with angular direction θ, a pair of matching lines is set up, one in the upper omni-image and the other in the lower, both of the same direction θ. To facilitate manual pairings of such matching lines, as illustrated by Fig. 8, a cylindrical-shaped grid pattern with a series of 90 equally-distanced colored vertical lines was designed to wrap around each omnicamera, and an omni-image of the pattern was then acquired. From the two acquired images, the pairing work was conducted by visual inspection with the line locations recorded in the previously-mentioned database. Such a work need be conducted only once in the learning stage as long as the used omni-cameras are not changed.
490
Y.-T. Kuo and W.-H. Tsai
(a)
(b)
Fig. 8. Illustration of finding corresponding matching lines manually by visual inspection. (a) Grid pattern wrapped into a cylinder. (b) Corresponding matching lines in two omni-images.
(3) Feature point matching on matching lines Stitching of the panoramic image pair proposed in this study starts from feature point matching within the overlapping band on each matching line pair. Feature points used are edge points detected on the matching lines in the grayscale versions of the panoramic images. For example, the feature point detection result with Fig. 7 as input is shown in Fig. 9. Due to differences in noise, distortion, and scaling between the two panoramic images, feature points appearing in the upper matching line might not appear in the lower, and vice versa, resulting in insertions and deletions of feature points in the matching lines. Regarding feature points which match mutually as substitutions, it is found in this study appropriate to use the dynamic programming technique to conduct optimal feature point matching on each matching line pair.
(a)
(b)
Fig. 9. Feature detection result with Fig. 7 as input. (a) Result of processing upper panoramic image of Fig. 7(a). (b) Result of processing lower panoramic image of Fig. 7(b).
More specifically, assume that two radial matching lines respectively are represented by two feature point sequences lupper = < u1, u2, …, um> and llower = < l1, l2, …, ln> where each ui or lj is a feature point. After transforming the RGB colors of the image points into corresponding HIS values (hue, intensity, and saturation values), each feature point wi (ui or li) is given an attribute which is the hue value at wi, denoted as H(wi). Assume that the maximum hue value is Hmax. Each of the three previously-mentioned types of edit operations, insertion, deletion, and substitution, causes a cost, which is defined as: (1) substitution cost C(u → l) = |H(u) − H(l)|/Hmax for a feature point u with attribute H(u) substituted by another l with feature H(l); (2) insertion cost C(φ → u) = 0.5 for a feature point u being inserted with φ meaning null; or (3) deletion cost C(l → φ) = 0.5 for a feature point l being deleted. All the costs so defined are within the range of 0~1. Then, matching of feature points on a matching line pair is equivalent to finding the best edit sequence taking lupper to llower according to the dynamic programming process, which we describe as follows, where D is an
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device
491
m×n table as illustrated in Fig. 10: (1) set D(0, 0) = 0; (2) for all 0 ≤ i ≤ m and 0 ≤ j ≤ n, fill all unfilled D(i, j) as the minimum of the following three values: Vi = D(i − 1, j) + C(φ → u) Vs = D(i − 1, j − 1) + C(ui → lj) Vd = D(i, j − 1) + C(lj → φ)
(for inserting ui in lupper); (for substituting ui in lupper with lj in llower); (for deleting lj from llower);
(3) with D(m, n) being the total cost of the optimal editing sequence S taking lupper to llower, trace the table back to obtain S for use in the subsequent steps. (4) Use of knowledge to improve feature point matching The attribute, average hue, used above is too simple to yield correct substitution results on all matching line pairs. So, additional knowledge about omni-imaging was exploited, yielding two constraints for use in dynamic programming so that unreasonable feature point matching results based on substitutions can be excluded. The first constraint is about the consistency of the distance of the current matching point to former one: for a pair of feature points pu and pl on the upper and lower matching lines, respectively, to be valid, the distances du and dl of them to their respective preceding matching feature points pu′ and pl′ found in the dynamic programming process should be roughly the same (because the upper and lower panoramic images, though different in content for reasons described previously, are still similar to a certain degree). The second constraint says that for two feature points be a matching pair, their respective elevation angles can differ only for a tolerable range Re as can be verified from Fig. 11. Specifically, if Lup is the light ray of a feature point P observable by the upper camera, then in the extreme case that P is located infinitively far away, the corresponding light ray going into the lower camera is just Llow which is parallel to Lup so that the elevation angles ρlow and ρup are equal. Accordingly, the range of elevation angles for P to be observable by both cameras can be figured out to be in the range from ρs2 to ρs2 + ρ2 – ρlow =ρs2 + ρ2 – ρup which is computable because ρs2 is the elevation angle of L2, ρup is the elevation angle of point P, and ρ2 is as shown in Fig. 4(c). This range can be used as Re mentioned above.
ρ
e1
L 1'
Lup
ρ
O1
O2
L2
ρ
s2
ρ
up
1
ρ
ρ
2
Llow
low
L1
ρ
s1
L 2'
ρ
e2
Fig. 10. Illustration of matching result by dynamic programming
Fig. 11. Illustration of difference of elevation angles of a feature point visible in both cameras
492
Y.-T. Kuo and W.-H. Tsai
(a)
(b)
Fig. 12. Result of dynamic programming applied to feature points in Fig. 9 with red points meaning matching points. (a) Result of upper panoramic image. (b) Result of lower panoramic image.
These two constraints are applied in Step (2) of the dynamic programming process: only when both constraints are satisfied can a substitution operation be considered for a feature point pair; otherwise, only the insertion and deletion operations are used. The result of applying this revised dynamic programming process to Fig. 9 is shown in Fig. 12, where the red points represent the matching pixels obtained by substitutions and the blue ones mean those obtained by insertions and deletions. (5) Stitching of panoramic images by homographic transformation The above steps results in a set of matching feature point pairs specified by substitutions. Then, on each matching line, we collect all the matching points and take out the two outmost ones (the highest and the lowest). Connecting these outmost points of all the matching lines in the upper and lower panoramic images, respectively, we get four piecewise-linear curves, as shown in Fig. 13(a). These curves form two quadrilaterals between every two pairs of matching lines, which we denote as Qup and Qlow, as illustrated in Fig. 13(b). Then, we register Qlow and Qup by homographically transforming Qlow to become Qup [6]. Finally, we take a blending of Qup and Qlow to yield a new quadrilateral denoted as Qnew. In addition, we transform similarly the patch below Qlow, denoted as Rlow, to become a shape-predetermined patch, denoted as Rup, below Qup, resulting in a new patch Rnew. Also, let the untouched patch above Qup in the upper panoramic image be denoted as Sup. Then, we take the final stitching result to be the union of Rnew, Qnew, and Sup. This completes the stitching of the vertical band between a pair of matching lines. This process is repeated until all vertical bands are processed. The result of applying this stitching step to Fig. 12 is shown in Fig. 14(a). Another result is shown in Fig. 14(b) with the original omni-images and created panoramic images shown in Figs. 14(c) through 14(f).
5 Experimental Results We have conducted experiments in several real environments, and some experimental results of one of them, a stairway in a library where other street view systems cannot reach, are shown here. The system interface constructed for environment navigation is shown in Fig. 15. A user first selects the environment from the database. The system then displays a perspective-view image of the first scene spot (at the top-left corner of
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device
493
the interface), which is the view seen by the system operator when the environment scenes were collected during the environment learning stage. Also displayed are: (1) a walking path on the perspective-view image, which consists of step-by-step arrows representing the observable scene spots in the current view; (2) a three-view diagram showing the observable scene spots connected by line segments from three directions (the front, the top, and a side); and (3) a panoramic image of the current scene spot. Upper panoramic image
Lower panoramic image
Panoramic image
(a)
(b)
Fig. 13. Stitching of panoramic images by homographic transformation. (a) Illustration of combination of stitched parts. (b) Illustration of homographic transformation between two quadrilaterals.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 14. Results of stitching of panoramic images. (a) Result of applying homographic transformation to Fig. 12. (b) Another result of stitching with omni-images and respective panoramic images shown in (c)~(f).
494
Y.-T. Kuo and W.-H. Tsai
Then, the user may navigate the environment further at the current spot by six ways: (1) changing the viewpoint though the perspective-view image (by clicking and dragging the image to turn into a desired view); (2) changing the viewpoint through the total panoramic image (by clicking a point on the image to select a new view); (3) going forward or backward by navigating the walking path (by clicking on an arrow of the path in Fig. 15); (4) going forward or backward step by step on the walking path (by clicking on the red or blue arrow in Fig. 15); (5) changing the scene spot through three-view diagrams (by clicking on a point any of the three views to see the closest scene spot); (6) zooming in or out the current view (by clicking on the perspective-view image). For example, with the scene shown in Fig. 15(a) as the current spot, after the topmost arrow is clicked by Way (3) above, the next scene spot appearing in the interface is shown in Fig. 15(b).
(a)
(b)
Fig. 15. Interface and environment navigation. (a) Interface for a scene spot. (b) Interface for another spot.
6 Conclusions A new 3D imaging system with a pair of portable omni-cameras has been proposed, in which the cameras are aligned coaxially and back to back. Panoramic and perspective-view images are generated from acquired omni-image pairs based on a space-mapping approach. The overlapping band in omni-images is utilized skillfully to stitch a pair of generated panoramic images into a single one by dynamic programming for use in environment browsing. The system has several merits: (1) it is light to carry to any indoor/outdoor place; (2) it can be used to collect images in narrow, non-planar, or slanted paths where existing street view vehicles cannot navigate; (3) only two omni-images are acquired to construct various images for each scene spot; (4) perspective-view images can be generated for scene observation from any view direction; (5) six ways of environment browsing via a perspective-view image, a panoramic image, a walking path, and a three-view diagram are provided.
References 1. Wikipedia. Google Street View (April, 2010), http://en.wikipedia.org/wiki/Google_Street_View 2. Vincent, L.: Taking Online Maps Down to Street Level. IEEE Computer 40(12), 118–120 (2007)
A New 3D Imaging System Using a Portable Two-Camera Omni-Imaging Device
495
3. Jeng, S.W., Tsai, W.H.: Using Pano-mapping Tables for Unwarping of Omni-images into Panoramic and Perspective-view Images. IET Image Processing 1(2), 149–155 (2007) 4. Telea, A.: An Image Inpainting Technique Based on The Fast Marching Method. Journal of Graphics Tools 9(1), 25–36 (2004) 5. He, K., Sun, J., Tang, X.: Single Image Haze Removal Using Dark Channel Prior. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1957–1963 (2009) 6. Goes, J., et al.: Warping & Morphing of Graphical Objects. Morgan Kaufmann, San Francisco (1999) 7. Huaibin, Z., Aggarwal, J.K.: 3D Reconstruction of An Urban Scene from Synthetic Fisheye Images. In: Proc. 4th IEEE Southwest Symp. on Image Analysis & Interpretation, pp. 219–223 (2000) 8. Roman, A.: Multiperspective Imaging for Automated Urban Imaging. Doctoral dissertation, Dept. Electrical Eng., Stanford University (2006)
Physical Navigation to Support Graph Exploration on a Large High-Resolution Display Anke Lehmann, Heidrun Schumann, Oliver Staadt, and Christian Tominski Institute for Computer Science University of Rostock A.-Einstein-Str. 21, 18059 Rostock, Germany http://vcg.informatik.uni-rostock.de
Abstract. Large high-resolution displays are potentially useful to present complex data at a higher level of detail embedded in the context of surrounding information. This requires an appropriate visualization and also suitable interaction techniques. In this paper, we describe an approach to visualize a graph hierarchy on a large high-resolution display and to interact with the visualization by physical navigation. The visualization is based on a node-link diagram with dynamically computed labels. We utilize head tracking to allow users to explore the graph hierarchy at different levels of abstraction. Detailed information is displayed when the user is closer to the display and aggregate views of higher levels of abstraction are obtained by stepping back. The head tracking information is also utilized for steering the dynamic labeling depending on the user’s position and orientation. Keywords: Information Visualization, Interaction, Graph Exploration, Large High-Resolution Displays, Head Tracking.
1
Introduction
Visualization is challenged by the fact that the available display space usually cannot keep up with the amount of information to be displayed. Commonly this problem is addressed by a kind of overview+detail technique [1], where users interactively switch between an abstract overview of the entire data and detailed views of smaller parts of the data. Large high-resolution displays (LHRDs) are an emerging technology and a promising alternative to the classic overview+detail approaches. LHRDs combine a large physical display area with high pixel density for data visualization (see Ni et al. [2]). A unique feature of LHRDs is that they allow users to perceive the global context of complex information presented on the display by stepping back, while enabling them to explore finer detailed data by stepping closer. This makes LHRDs suitable for visualization applications, such as the visual exploration of graphs. However, because classic mouse and keyboard interaction is infeasible in LHRD environments, we require new approaches to interact with the visualization. Ball et al. [3] found that physical navigation is quite useful in this regard. We utilize G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 496–507, 2011. c Springer-Verlag Berlin Heidelberg 2011
Physical Navigation for Graph Exploration on a Large HighRes Display
497
this fact for a basic exploration task. In particular, the user’s physical position in front of a LHRD determines the level of abstraction of the visual representation of hierarchical graphs. Graph nodes are expanded dynamically when the user moves closer to the display (i.e., more detail). When the user steps back, nodes are collapsed, by which a higher level of abstraction (i.e., less detail) is obtained. This provides a natural way to get an overview or detailed information similar to virtual zoom interaction via mouse’s scroll wheel. A benefit of our approach is that the users do not need their hands for the interaction, which opens up possibilities to use them for other tasks. Switching between different levels of abstraction also requires adaptation of the visual representation. While the operations expand and collapse correspond to addition and removal of nodes in the graph layout, respectively, the question remains how to appropriately label the differently abstracted objects shown on the LHRD. We utilize an existing labeling algorithm (i) to ensure readability of labels depending on the level of abstraction (determined by the user display distance) and (ii) to avoid readability problems caused by LCD-panel-based LHRDs. In the next section, we will briefly describe basics of graph exploration and related work on interactive visualization on LHRDs. In Section 3, we will describe details of our approach to support graph exploration on LHRDs. Section 4 will outline the design of our prototype system and preliminary user feedback. We will conclude with a summary and an outlook on future extensions and applications of our approach in Section 5.
2
Related Work
Our literature review is structured as follows. We briefly describe graph exploration in general and then consider visualization on large displays and shed some light on what modern input concepts beyond mouse and keyboard input can offer for interactive visualization. 2.1
Graph Exploration
Lee et al. [4] identified several tasks that users seek to accomplish with the help of graph visualization. Lee et al. structure low-level tasks by means of a taxonomy with three high-level categories: topology-based tasks, attribute-based tasks, and browsing-tasks. In each of these categories, we find tasks of exploratory nature, such as “find adjacent nodes”, “find edge with largest weight”, “follow a given path”, or “return to previously visited node”. In general, interactive graph exploration is supported by appropriate visualization techniques (see Herman et al. [5]) and well-known overview+detail and focus+context interaction (see Cockburn et al. [1]). Such techniques enable users to zoom in order to view details or an overview, and to pan in order to visit different parts of the data. A common approach to support the exploration of larger graphs is to structure (e.g., to cluster or aggregate) them hierarchically (see Herman et al. [5]).
498
A. Lehmann et al.
Elmqvist and Fekete [6] describe the various advantages of this approach, including the fact that it allows us to visualize different abstractions of the underlying data. In order to access the level of abstraction that is required for the task at hand, users interactively expand or collapse individual nodes of the hierarchy or switch between entire levels of the hierarchy. Additionally, special lens techniques have been proposed for interacting in locally restricted areas of the visualization. For instance, van Ham and van Wijk [7] use a virtual lens to automatically adjust the level of abstraction (i.e., expand/collapse nodes) depending on the position of the lens. Tominski et al. [8] and Moscovich et al. [9] describe lenses that create local overviews of the neighborhood of selected nodes and provide direct navigation along paths through a graph. All of these examples demonstrate the usefulness of dedicated interaction techniques for graph exploration. However, most of the existing techniques have been designed for classic desktop environments with regular displays and mouse interaction. In this work, we adapt existing concepts to make them applicable in LHRD environments. The exploration in terms of different levels of abstraction is of particular interest to us, because of the inherent overview+detail capabilities of LHRDs. 2.2
Interactive Visualization on Large Displays
The advantages of LHRDs (i.e., many pixels and natural overview+detail) make them an interesting alternative to desktop-based visualization scenarios. Next, we review existing approaches that utilize LHRDs for visualization. As demonstrated by Keim et al. [10], pixel-based visualization approaches are a good example of visualization techniques that benefit from the larger number of pixels available on LHRDs. Another example is the visualization of complex spatiotemporal data. Booker et al. [11] explain that the exploration of spatiotemporal data can significantly benefit from LHRDs. They also conjecture major benefits for other information exploration scenarios. While these examples focus on the output capabilities of LHRDs, other researchers have addressed questions of interaction. While mostly hand-held devices (e.g., wireless mouse, tracked button device) are used to pan or zoom in a desktop-based visualization application, such devices, if at all, are cumbersome to use in LHRD environments. A commonly accepted means to drive interaction in LHRD is tracking the user’s physical movements. Vogel and Balakrishnan [12] use head tracking to switch between the display of ambient, public, or personal information depending on the user’s distance in front of a large display. Ashdown [13] switch the mouse pointer between monitors by head tracking to increase mouse movements in multi-monitor environments. Ball et al. [14,3] identify that in LHRDs the performance of simple navigation tasks for finely detailed data can be increased by using physical navigation. This knowledge is realized by Peck et al. [15] who adjust the mechanisms for interactive selection and navigation based on the user’s distance to the display. A survey of interaction techniques for LHRDs is presented by Khan [16].
Physical Navigation for Graph Exploration on a Large HighRes Display
499
Although physical interaction has a number of advantages, there are some limitations. For example, Ball et al. [17] use head rotation as input for panning navigation in a geospatial visualization application. However, virtual navigation and head motion were tightly coupled, which made it difficult for users to pan the map while also scanning it. Consequently, head tracking is better suited for simple interaction mechanisms than for fine motor control. Overall, prior work shows that using physical navigation in LHRD environments can improve both perception of the visualized data [14,3] and interaction with the visualization [15]. Considering these advantages, we devised an interactive visualization approach to support the exploration of graphs on LHRDs at different levels of abstraction.
3
Exploring Graphs on a Large High-Resolution Display
In recent years, graphs have gained importance in many application backgrounds such as social networks, power networks, climate networks, biological networks and others. We present an approach that is suitable to support graph exploration by exploiting the advantages of LHRDs and physical navigation. First we will explain the data that we address, which are basically graph hierarchies. Then we describe the node-link-based visualization employed in this work. Finally, we introduce novel interaction techniques for adjusting the level of abstraction of the displayed data based on head tracking. 3.1
Data
We use a graph hierarchy H as the main data structure to drive the exploration [5,6]. H is a rooted tree whose leaves represent information on the finest level of granularity. Non-leaves of H are abstractions of their corresponding child nodes. Any “full cut” through H defines a view of the data with a specific level of abstraction. There are two basic means to enable users to choose a suitable level of abstraction: (i) one can globally switch from one level of the hierarchy to another or (ii) one can expand and collapse nodes in order to adjust the level of abstraction locally. Switching the level globally means replacing all nodes of the current level with the nodes of another level. On the other hand, expansion of a non-leave node locally replaces that node with its children, which results in more information and less abstraction. Collapsing a set of nodes replaces these nodes with their parent node, which results in less information and more abstraction. For the purpose of demonstration, we visualize the hierarchy of the ACM computing classification system, where nodes correspond to text labels of the categories and edges between nodes indicate related categories1. With its 1473 nodes and 464 edges this data set is not too large and the explicitly given labels on the different levels of the hierarchy are expressive and easy to understand. Both facts make this data set quite useful for first experiments. 1
http://dspace-dev.dsi.uminho.pt:8080/en/addon acmccs98.jsp
500
A. Lehmann et al.
Note that edges have been added by hand, which is the reason why an edge that exists on a higher level of abstraction may have no corresponding edges on lower levels. In particular, no edges exist on the finest granularity of the ACM data set. 3.2
Visualization
The visualization relies on a basic node-link representation (see Figure 1). The required graph layout has been pre-computed using a recursive hybrid algorithm: A force-directed mechanism determined node positions for connected parts of the graph [18] and a variant of the squarified treemap layout handled unconnected parts [19]. In a second phase, node positions were adjusted manually to better echo the semantics of the data (e.g., the ordering of categories). Third, an automatic local adjustment of the layout is computed to avoid that nodes are placed behind bezels of our LCD-panel-based LHRD. In our node-link visualization, nodes are visualized as spheres. To separate the various hierarchy levels, the spheres are colored depending on the depth of the node in the graph hierarchy using a sequential color scheme from ColorBrewer2. The links between spheres are shown as colored lines. Along a line, we interpolate the color of its two incident spheres. This way, edges between nodes of the same level in the hierarchy (i.e., no interpolation, because both spheres have the same color) are clearly distinguishable from edges between nodes of different levels (i.e. color interpolation, because sphere colors are different). Although the layout already communicates the hierarchical structure of the data quite well, we use convex hulls as visual envelopes for nodes that share the same parent.
Fig. 1. Node-link view of the top level of the ACM computing classification system 2
http://www.colorbrewer.org
Physical Navigation for Graph Exploration on a Large HighRes Display
501
Because we work with a graph whose nodes represent category captions, we have to attend to label placement. This is also relevant when it comes to showing text labels with associated node properties (e.g., node degree, node depth, etc.). To ensure label readability, we have to account for two aspects. First, we have to deal with bezels of LCD-panel-based LHRDs. Usually, bezels are handled as part of the display space and an empty virtual space is placed behind them in order to make virtual objects (e.g., spheres) drawn across monitors appear more natural and undistorted. What might be good for virtual objects is unfavorable for labeling, because we risk loosing important textual information in the empty virtual space behind the bezels. In order to avoid this, labels must not overlap the bezels. Secondly, we have to account for the larger range of distances from which labels must be readable. This requires adjustment of label sizes depending on the viewing distance. Due these requirements we need a dynamic calculation of the labeling at realtime. We employ the labeling algorithm by Luboschik et al. [20], which satisfies our needs. The algorithm allows the placement of so-called conflict particles in those parts of the labeling space where labels must not appear. We insert such particles exactly where the bezels are located. Hence, we can guarantee that labels never overlap with bezels. However, labels that are larger than the size of a single LCD-panel cannot be placed anymore. To mitigate this problem, we split labels where possible. Once the label positions have been calculated, we use scalable vector fonts for the text rendering. Given the number of pixels of LHRDs, we are theoretically able to visualize the entire data. However, in that case, it might be difficult for viewers to recognize the multitude of graph elements. Moreover, as in our example, interesting findings might be revealed on different levels of a graph hierarchy. Therefore, it is important to enable the user to explore the data interactively. 3.3
Interaction
When visually exploring a graph hierarchy, the adjustment of the level of abstraction is an essential interaction. We realize this interaction based on head tracking. By using head tracking we obtain information about the user’s head position and orientation (6 degrees of freedom) in front of the display wall. As mentioned in Section 2.2, care must be taken that the interaction be robust against small head motions. We implemented two alternative methods to utilize the head tracking information: (1) the zone technique and (2) the lens technique. Both techniques allow the user to change the level of abstraction of the displayed graph hierarchy by moving in front of the LHRD. Transitions from one level to another are animated to retain the user’s mental map. The zone technique corresponds to a level-wise global adjustment, whereas the lens technique realizes a local adjustment in those parts of the visualization that the user is currently looking at. This input offers a hands-free interaction method, where the hands can be used for finer motor control tasks (e.g., interactive manipulation). The zone technique supports intuitive level switching based on head tracking, where only the user’s distance is considered. For the zone technique, the
502
A. Lehmann et al.
Fig. 2. Illustration of zone technique (left) and lens technique (right)
interaction space in front of the wall is divided into parallel zones, one zone for each level. The interaction zone for the root of the graph hierarchy is farthest away from the display, the zone for the deepest leaves of the graph hierarchy is closest to the display. As the user navigates physically (i.e., step forward or backward) we switch the level of the graph hierarchy. The closer the user moves toward the display the more details are displayed. In order to avoid sudden unintentional switches, additional threshold regions are inserted between two adjacent zones (see Figure 2, left). Thus the user has to cross the threshold region entirely to release a switching. The advantage of the zone technique is a natural changing of the level of detail by forward and backward moves. However, the user changes the level of detail globally for the entire graph. This reveals edges within the same level, but edges across different levels remain hidden (e.g., a relation between a node and the children of another node). The lens technique facilitates to keep the overall context, while providing detailed information for a user-selected part of the data. The selection is done by an approximation of the gaze direction. Head tracking delivers information about the head position and orientation. Consequently, we create a viewing cone from the user to the display screen to approximate the user’s field of view (see Figure 2, right). As a visual feedback, the field of view is shown as a lens ellipse on the LHRD (see Figure 3). As with the zone technique, the distance determines the currently displayed level of abstraction. By steering the lens with small head movements, the user is able to scan the graph and get insight into detailed information in different parts of the data. Nodes that enter the lens are dynamically expanded to reveal the next lower level of the graph hierarchy (i.e., more detail). When a node exits the lens it is collapsed to return to the original level of detail.
Physical Navigation for Graph Exploration on a Large HighRes Display
503
convex hull lens le
Fig. 3. Exploring a graph with the lens technique
To keep the amount of displayed information understandable, the lens size is adjusted to the user’s distance from the LHRD. When the user steps closer to the display, the lens size decreases. This naturally matches with the smaller field of view that the user covers when standing close to the LHRD. The lens technique offers an interesting focus+context interaction, where the overall context is preserved while the user accesses details by moving the head. However, unintentional head movements can effect frequent expand/collapse operations, which cause a kind of disturbing flicker. Therefore, we enabled the Kalman filter provided by the tracking software to reduce the natural head tremor. Both of our interaction techniques require that the labels be adjusted depending on the user distance (and level of abstraction) in order to improve readability. When the user is close to the display we can decrease the font size to free display space, which allows the algorithm to place more labels (see Figure 3). In contrast, when the user is far away from the display, small labels are unreadable. In such a setting, we traverse up in the graph hierarchy and pick aggregated labels, which are rendered using a larger font face. We use the common “Arial” font in a range from 12pt to 32pt.
4
Prototype and Preliminary User Feedback
We implemented a prototype to study the visual exploration technique on our LHRD. Next we describe the technical details and report on preliminary user feedback. 4.1
Technical Details
We use a flat tiled LCD wall consisting of 24 DELL 2709W displays, where each tile has a resolution of 1920 × 1200 pixels, with a total resolution of 11520 ×
504
A. Lehmann et al.
Fig. 4. Tiled LCD wall and interaction space indicated with floor marks
4800 (55 million pixels). The interaction space in front of the display wall (see Figure 4) has an area of approximately 3.7m × 2.0m and a height of about 3.5m. The displays are connected to a cluster of six render nodes (slaves) and one additional head node (master). For the rendering, we utilize the graphics framework CGLX3 and the font rendering library FTGL4 . Our prototype uses an infrared tracking system (Naturalpoint OptiTrack) with 12 cameras, which are arranged semicircularly around the interaction space. The user wears a baseball cap that has attached to it reflective markers. Tracking these markers enables us to determine the position and the orientation of the user’s head in the interaction space. 4.2
User Feedback
Using the aforementioned prototype, we collected preliminary user feedback. We asked computer science students (one female and seven male students) to test the application with the described interaction techniques about ten minutes. The users explored the graph (see Section 3.1) by walking around in the interaction space. They could switch between the zone technique and the lens technique by flipping the baseball cap during the trial. In the interview afterwards the 3 4
http://vis.ucsd.edu/∼cglx/ http://sourceforge.net/projects/ftgl/
Physical Navigation for Graph Exploration on a Large HighRes Display
505
participants were asked about readability of labels from any distance, ease-ofuse, and their preferred technique. All participants indicated that the head tracking was easy to use and that the labels were readable from any distance. The participants reported that the zone technique was easier to use because there were no unintentional level switches during the interaction. This indicates to us that the threshold regions for the zone technique are effective. Although the zone technique was easier to control, the subjects found it less interactive and they felt that there was too much information in their peripheral visual field. The lens was named as an eye-catcher and as being more interactive than the zone technique. Providing detail information inside the lens while maintaing the context outside the lens helped users to cope with the amount of information. On the other hand, the lens technique required a head calibration for every user to enable a reliable interaction behavior. Moreover, even after careful calibration, the users experienced unintended interaction caused by small head motions. This feedback suggests to us that we require improved filter methods to mitigate the effects of natural head tremor.
5
Summary and Future Work
We have introduced an approach to support the exploration of graphs on large high-resolution displays. The visualization is dynamically adjusted as the user moves in front of the display. Based on head tracking input, the labeling of the visual representation is recomputed to ensure label readability. Moreover, we implemented two interaction techniques based on physical navigation. Both the zone technique and the lens technique utilize the user’s position to determine the displayed level of abstraction. Additionally, the lens technique considers an approximation of the viewing direction to provide more details for a selected part of the data while preserving the overall context. Preliminary user feedback indicates that users can easily learn and apply the developed techniques. Our prototype is capable of showing two thousand nodes and labels at interactive frame rates. We understand our work as an initial step to be followed by further research on visualization on LHRDs. The next step is to integrate a very large data set (e.g., a metabolic pathway from systems biology) into the LHRD environment and to perform a detailed user study. There are a number of general questions to be answered as well as specific issues to be addressed. In terms of the visualization, we have to improve our algorithmic solutions for avoiding overlap of important visual information with bezels. We already achieved quite good results for the labeling and are confident that similar solutions can be found for clusters of nodes and edges of graphs. Further investigations are necessary to find general solutions for other visualization techniques. This also has to include the enhancement of existing methods to better utilize the available pixels. The large display space and the interaction space in front of the display are potentially useful for collaborative work. We have experimented with a setting
506
A. Lehmann et al.
where two persons collaborate. In such a setting, we have to consider enhanced visual feedback of the participating users and conflicts in the determination of the level of detail to be shown – globally as well as locally. However, this work is still in a very early stage. In future projects, we will investigate ways to enable users to explore graphs with physical navigation in a multi-user scenario. More specifically, with regard to our prototype, we plan to consider further interaction tasks to be accomplished in the LHRD environment. For instance, the user should be able to freeze selected nodes for the purpose of visual comparison. We are also thinking of manipulation techniques for graphs to allow users to edit the data. It is also conceivable that visualization parameters (e.g., font size) are interactively customizable within limits. We conjecture that the so far unused hands of the user are most suitable for these finer interaction tasks. Acknowledgments. This work was supported by EFRE fond of the European Community and a grant of the German National Research Foundation (DFG), Graduate School 1424 MuSAMA. We gratefully acknowledge implementation contributions by Thomas Gertz. We are also grateful for the valuable comments by the anonymous reviewers.
References 1. Cockburn, A., Karlson, A., Bederson, B.B.: A review of overview+detail, zooming, and focus+context interfaces. ACM Computing Surveys 41, 2:1–2:31 (2009) 2. Ni, T., Schmidt, G., Staadt, O., Livingston, M., Ball, R., May, R.: A survey of large high-resolution display technologies, techniques, and applications. In: Virtual Reality Conference, 2006, pp. 223–236 (2006) 3. Ball, R., North, C., Bowman, D.A.: Move to improve: promoting physical navigation to increase user performance with large displays. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). ACM, New York (2007) 4. Lee, B., Plaisant, C., Parr, C.S., Fekete, J.D., Henry, N.: Task taxonomy for graph visualization. In: Proceedings of the 2006 AVI Workshop on Beyond Time and Errors: Novel Evaluation Methods for Information Visualization, BELIV 2006, pp. 1–5. ACM, New York (2006) 5. Herman, I., Melan¸con, G., Marshall, M.S.: Graph visualization and navigation in information visualization: A survey. IEEE Transactions on Visualization and Computer Graphics 6 (2000) 6. Elmqvist, N., Fekete, J.D.: Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines. IEEE Transactions on Visualization and Computer Graphics 16 (2010) 7. van Ham, F., van Wijk, J.J.: Interactive visualization of small world graphs. In: Proceedings of the IEEE Symposium on Information Visualization (InfoVis), pp. 199–206. IEEE Computer Society, Los Alamitos (2004) 8. Tominski, C., Abello, J., Schumann, H.: Cgv – an interactive graph visualization system. Computers & Graphics 33 (2009) 9. Moscovich, T., Chevalier, F., Henry, N., Pietriga, E., Fekete, J.: Topology-aware navigation in large networks. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI). ACM, New York (2009)
Physical Navigation for Graph Exploration on a Large HighRes Display
507
10. Keim, D.A., Schneidewind, J., Sips, M.: Scalable pixel based visual data exploration. In: L´evy, P.P., Le Grand, B., Poulet, F., Soto, M., Darago, L., Toubiana, L., Vibert, J.-F. (eds.) VIEW 2006. LNCS, vol. 4370, pp. 12–24. Springer, Heidelberg (2007) 11. Booker, J., Buennemeyer, T., Sabri, A.J., North, C.: High-resolution displays enhancing geo-temporal data visualizations. In: Proceedings of the ACM Southeast Regional Conference, pp. 443–448 (2007) 12. Vogel, D., Balakrishnan, R.: Interactive public ambient displays: transitioning from implicit to explicit, public to personal, interaction with multiple users. In: Proceedings of the 17th Annual ACM Symposium on User Interface Software and Technology, UIST 2004, pp. 137–146. ACM, New York (2004) 13. Ashdown, M., Oka, K., Sato, Y.: Combining head tracking and mouse input for a gui on multiple monitors. In: CHI 2005 Extended Abstracts on Human Factors in Computing Systems, CHI EA 2005, pp. 1188–1191. ACM, New York (2005) 14. Ball, R., North, C.: Effects of tiled high-resolution display on basic visualization and navigation tasks. In: CHI 2005 Extended Abstracts on Human Factors in Computing Systems, CHI 2005, pp. 1196–1199. ACM, New York (2005) 15. Peck, S.M., North, C., Bowman, D.: A multiscale interaction technique for large, high-resolution displays. In: Proceedings of the 2009 IEEE Symposium on 3D User Interfaces, 3DUI 2009, pp. 31–38. IEEE Computer Society, Washington, DC, USA (2009) 16. Khan, T.K.: A survey of interaction techniques and devices for large high resolution displays. In: Middel, A., Scheler, I., Hagen, H. (eds.) Visualization of Large and Unstructured Data Sets - Applications in Geospatial Planning, Modeling and Engineering (IRTG 1131 Workshop). Open Access Series in Informatics (OASIcs), vol. 19, pp. 27–35. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl (2011) 17. Ball, R., Dellanoce, M., Ni, T., Quek, F., North, C.: Applying embodied interaction and usability engineering to visualization on large displays. In: ACM British HCI - Workshop on Visualization & Interaction, pp. 57–65 (2006) 18. Battista, G.D., Eades, P., Tamassia, R., Tollis, I.G.: Graph Drawing: Algorithms for the Visualization of Graphs. Prentice-Hall, Englewood Cliffs (1999) 19. Bruls, M., Huizing, K., van Wijk, J.J.: Squarified treemaps. In: Proceedings of the Joint Eurographics - IEEE TCVG Symposium on Visualization (VisSym), Eurographics Association (2000) 20. Luboschik, M., Schumann, H., Coords, H.: Particle-based labeling: Fast pointfeature labeling without obscuring other visual features. IEEE Transactions on Visualization and Computer Graphics 14 (2008)
An Extensible Interactive 3D Visualization Framework for N-Dimensional Datasets Used in Heterogeneous Software Display Environments Nathaniel Rossol1, Irene Cheng1, John Berezowski2, and Iqbal Jamal3 1
2
Computing Science Department, University of Alberta Alberta Agriculture and Rural Development, Government of Alberta 3 AQL Management Consulting Inc.
Abstract. Although many automated techniques exist to mine large Ndimensional databases, understanding the results is nontrivial. Data visualization can provide perceptual insights leading to the understanding of the results as well as the raw data itself. A particular application domain where the use of high-dimensional interactive data visualization has proven useful is in the exploratory analysis of disease spread through populations, especially in the case of livestock epidemics. However, designing effective visualization tools for domain practitioners presents many challenges that have not been resolved by traditional interactive high-dimensional data visualization frameworks. To address these issues, we introduce a novel visualization system developed in conjunction with a livestock health surveillance network for interactive 3D visualization of high-dimensional data. Among the key features of the system is an XML framework for deployment of any high-dimensional data visualization tool to multiple heterogeneous display environments, including 3D stereoscopic displays and mobile devices.
1 Introduction Many real world problems involve processing multiple large scale databases where the cross-stream data may not have obvious correlated patterns due to complexity and high dimensionality. In many cases, the datasets are collected for specific purposes by different institutions making cross references difficult. One example of where such issues occur is in the field of epidemiology which involves the study of disease spread through human populations and the field of veterinarian epidemiology which studies disease spread in livestock or animal populations where such epidemics can threaten public health if not discovered before reaching consumer markets. In this field, predictors of the health of an individual can be determined by multiple factors such as nearby population density, abnormal weather, and even geography, all of which is typically gathered by separate institutions and tools. For example, local weather data is typically collected through nearby weather observatories whereas cattle disease data is collected from farms and summarized at the municipal level. Observatories are located at strategic locations and do not have a one-to-one relationship with municipalities. Therefore, trying to understand the association between weather data and G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 508–517, 2010. © Springer-Verlag Berlin Heidelberg 2010
An Extensible Interactive 3D Visualization Framework
509
cattle disease data can be tricky. Technical knowledge combined with professional experience is often required in order to analyze and understand the collected data. Moreover, in order to discover associations amongst the different factors, data analysts often include as many disparate data dimensions as possible which unavoidably leads to large, complex, and highly dimensional datasets to be analyzed. The two general approaches used to explore the aforementioned large scale and highly complex data collections are through the use of Data Mining algorithms to extract useful knowledge automatically or through the use of Data Visualization tools whereby human experts identify useful information directly. Data mining has been proven to be effective and is commonly adopted in data exploration for larger datasets. Data mining analyzes data to uncover patterns and correlations, but is often limited in the kinds of data associations it can detect. If an obvious pattern is present in the dataset, but the data mining algorithm has not explicitly been designed to detect that specific pattern, then it will simply not be found, even if it was potentially quite important or useful to the end user(s). It is also not clear how to intuitively apply the professional expertise of domain experts to the data mining algorithms. Data visualization offers a natural alternative whereby intuitive visualization tools can be used by experts in the field in order to visually data-mine the dataset through different interactive data-views in order to extract meaningful trends and patterns directly in accordance with their prior knowledge and experience. Through collaboration with a provincial government animal veterinarian surveillance program, an extensible, interactive 3D visualization system was designed and implemented for use in allowing veterinarian epidemiologist practitioners to directly apply their expertise in order to extract useful patterns and trends from their large highly-dimensional animal disease surveillance datasets over a variety of heterogeneous software display environments. Development of the tool for this domain presented unique design and technical challenges beyond what is offered by traditional and existing high-dimensional visualization tools. In particular, the user requirement that the system be capable of deployment to a wide range of heterogeneous software display environments whilst simultaneously offering an intuitive common interface that can be extensible with additional data dimension visualization features in the future, necessitated the development of a novel XML-based framework for dynamically adapting the visualization interface to the features and limitations of its current display environment. The framework uses a combination of automatic hardware feature detection, expert knowledge, and developer-defined dependencies to automatically adapt the visualization interface to the current display environment the system is running on. In order to validate the effectiveness of the proposed system, the visualization tool and framework as a whole was evaluated by epidemiologist practitioners on a large, real-world highly dimensional disease surveillance dataset. The results of the evaluation indicate that the system is effective in allowing epidemiologist practitioners to identify brand new, previously unknown patterns and trends as several new preliminary discoveries where already made as a result of using the system. The contributions of this paper include a design framework for effective, highlydimensional visualization systems for use by epidemiologist practitioners and also an XML-based framework for deploying extensible highly-dimensional visualization systems to a large number heterogeneous software display environments. Although
510
N. Rossol et al.
the implemented system was created for use with livestock disease data in food production chains, it can also be used in a much wider range of data visualization applications such as human disease spread or even ecological studies. The aforementioned XML-based framework subcomponent of the system, however, is highly flexible and can effectively be used in the design of essentially any extensible high-dimensional visualization framework built for deployment on heterogeneous software display environments. The rest of the paper is organized as follows: Section 2 reviews related work, Section 3 introduces the proposed visualization framework, design and technical challenges that were overcome, user evaluation, and proposes an XML-based framework for deployment of extensible high-dimensional visualization systems onto heterogeneous software display environments. Finally, the conclusion and plans for future works are presented in Section 4.
2 Related Work Interactive visualization tools have become attractive to users because of their graphical presentations which often have higher expressive powers than simple text, images and videos alone. Properly structured graphics can convey the underlying associations [1] between otherwise seemingly independent elements, and can perceptually attract more attention from the viewers [2, 3]. Visualization tools can highlight possible associations between the displayed data and alert the viewers, if necessary, for more in depth follow-up examinations with statistical analysis or focused data mining techniques. The practice of using visualizations to assist analytical scientific reasoning is commonly referred to as "Visual Analytics" [4] and has recently generated a great deal of growing interest in both the research community and industry. For example, in the United States, the recently launched National Visualization and Analytics Center (NVAC) makes use of modern visual analytics techniques on highly dimensional datasets to identify events that are important to homeland security [5]. In order to convert highly dimensional datasets (i.e. datasets with a large number of attributes per entity) into a form that can be more easily viewed for visualization purposes, several common techniques exist. One common approach involves preprocessing the data through data mining, clustering, etc. and then visualizing the results of the preprocessing step rather than the raw data itself [6, 7]. For example, in the field of veterinarian epidemiology, Markov Monte Carlo models have been effectively used for modeling disease spread [8] and network representations, graph-partitioning, reorderable matrix representations, and flow maps have all proven to be useful tools for application-specific analyses [9, 10, 11, 12]. While these approaches aid in comprehension by breaking down large raw datasets into a more meaningful and structured form, the issue remains that the data pre-processing step might have removed or obfuscated patterns that were in the raw data that would have otherwise been of great interest or importance to the domain expert using the system. Another approach to deal with highly dimensional data is to arrange it into a hierarchical structure. Systems such as n-Vision [13] split the highly dimensional data space into a group of simpler, lower-dimensional subspaces. These subspaces are then arranged in a hierarchy to facilitate comprehension, although the ease of understanding in
An Extensible Interactive 3D Visualization Framework
511
such visualizations naturally depends quite closely on the nature of the data and the partitioning used to generate the subspace hierarchy. A less complex alternative approach is to project the visualization into a lower dimensional subspace. While this requires the end-user to create several custom data-views for their analytical investigations, no patterns or correlations are inadvertently lost through any pre-processing steps. Thus, the design used in this paper follows this second approach. Another key concern with multidimensional visualization systems for use in visual analytics is the desire for portability across the wide range of heterogeneous computer systems and display environments including stereoscopic displays or mobile displays. Large scientific visualization systems such as GeoVISTA [14] address the issue of heterogeneous software display environments by implementing a common API, and offering multi-dimensional visualization through a large array of separate applications within the framework. As a result, the system can be used in any display environment so long as it implements the GeoVISTA Java API specification. However, a major issue with this approach is that it requires the investigators using the system to potentially learn a large number of distinct software tools in order to analyze their high-dimensional datasets. This can be a problem for domain experts, who may not consistently have all the technical backgrounds that the various tools may require. One of the more effective means for visualization of highly dimensional geospatial temporal data such as the type encountered in disease surveillance is through the use of a single unified visualization interface with interactive options to allow users to dynamically select data dimensions for comparison and render visualizations in realtime, as is the approach used by GeoTime [15]. GeoTime is a commercial analytics tool designed for processing temporal geospatial information and has a high degree of customizability. This tool takes in sequential event data and displays it over a spatial map, with the vertical dimension indicating progression of time. While GeoTime allows for ease of use by practitioners through its unified interface, it does so at the cost of restricting itself to only standard desktop application display environments.
3 Design and Implementation of the Proposed Framework 3.1 System Design Requirements As previously described, data visualization for the N-dimensional datasets used in epidemiology poses new challenges not handled by traditional N-dimensional data visualization approaches. Design decisions were therefore guided by working closely with veterinarian epidemiologist practitioners. In order to help minimize training time and facilitate efficient adoption of the tool, the system was designed with a consistent, unified user-interface across all platforms. Fig. 1 shows an overview of the primary user interface of our proposed NDataVis System (N-dimensional Data Visualization System). The pane on the right side of the interface is the primary visualization pane and renders in real-time. This allows the user to animate their data and view how correlations or patterns change or emerge over time. Users can also interactively adjust their zoom and view angle at any time to assist them in performing their visual analysis in accordance with their knowledge and experience.
512
N. Rossol et al.
Fig. 1. Main user interface for NDataVis with the configuration panel shown on the left and the visualization pane on the right
Beyond animating 3D data along a time series in the 3 spatial directions, the system must be capable of visualization through an extensible list of visual data encodings including colour, size, iconography (i.e. the use of icons or special shapes to convey numerical meaning), etc. The data configuration panel on the left side of the interface is used to primarily bind data dimensions to visual encodings. The user first selects the visual encoding they wish to configure, and then chooses which data dimension to bind to that encoding. The user can then customize the visualization further by altering the default scaling of any data dimension. The visualization pane updates constantly as these changes are made even during animation of the data to help satisfy the domain requirement that the system be interactive and real-time. New visual encodings can be added to the system as it develops over time through an extensible, novel XML framework that intelligently handles which visualization encodings to make available in different software display environments. Due to the wide variety of displays and visual analysis tools available to the epidemiologist practitioners, including desktop computers, web-browsers, stereoscopic projectors, and onsite mobile devices, the desire exists to offer the visualization tool to as many of these heterogeneous software display environments as possible. To allow for this, the system is Java-based and uses the cross-platform, open-source Java OpenGL API for its 3D graphics. Fig. 1 shows an example of the system used in a stand-alone desktop environment with a standard monitor, whereas Fig. 2 shows the system used in a multi-display 3D-stereoscopic projection environment. Data for the visualization is loaded into a single combined data table prior to being read. This can be accomplished by any short data-aggregation script, although more advanced data management tools could also be used. NDataVis currently uses the Konstanz Information Miner [16] to aggregate and format the data before loading.
An Extensible Interactive 3D Visualization Framework
513
Fig. 2. NDataVis in-use in a multi-display 3D stereoscopic display environment. In this particular visualization, 3D iconography has been selected and bound to a data dimension to help highlight changes in values of an attribute over time.
3.2 XML Framework for Deployment to Heterogeneous Software Display Environments Designing and implementing an extensible and unified N-dimensional Visualization system that is capable of being deployed to a wide range of heterogeneous software display environments poses several new technical and design challenges. When considering standard desktop displays, multi-view stereoscopic displays, and mobile devices, it is clear that features available in some environments are not available in others. Adapting software applications to the specific abilities and limitations of their software display environments is commonly referred to as Adaptive Content Delivery [17]. In traditional adaptive content delivery, the adaptive application typically uses automated means to measure its display and interaction capabilities and then adjust the content presented to the user accordingly. While useful for our framework, automatic adaption alone is not sufficient for the types of situations that must be handled in N-dimensional visualization systems. Consider, for example, the visualization encoding “size” whereby the size of a data point the visualization grows and shrinks in accordance with the value of the data dimension bound to it. This visualization setup presents a problem when it is used with the standard desktop monitor 3D visualization interface because ambiguity now exists between the “size” encoding and the Z axis spatial encoding (i.e. “depth”). That is, it is no longer completely clear if a given data point is actually of a smaller size, or just simply further away from the user’s viewpoint. However, when this visualization configuration is used in a multi-display stereoscopic environment, there is no ambiguity because the display environment now conveys stereo-depth information so that users can discern the distance of data points separate from their size. Naturally, an adaption rule disabling the use of the “size” visual encoding on standard displays would need to be manually defined by the developer. Additional restrictions can also
514
N. Rossol et al.
be placed on data visualization encodings due to special needs or preferences of the end-user. For example, the end-user of a system may suffer from colour-blindness and would therefore like to prevent the visualization system from using coloured data points in its visualizations. In this case, the adaption rule would be manually defined by the end-user performing the actual visual analysis. In order to support the above mentioned system adaptability, we have developed an extensible XML-based framework for defining adaption rules for N-dimensional visualization systems deployed to heterogeneous software display environments. The basic XML overview for this framework is outlined below: ... ... ... ... ... ... For each display environment, 3 levels of adaption rules are used: - System: System-level adaption rules are automatically determined by the features and limitations of the software environment. For instance, a newly implemented data visual encoding module might use advanced vertex shader functions for complex animation of the data points, but these shader functions are not supported on all graphic hardware platforms. After the visualization system is launched for the first time, it will perform an automatic test to determine if the current hardware supports the necessary features. If this test fails, the system will add an adaption rule to the “system” level rule set that disallows this visual encoding from being used in the present display environment. - End-user: Adaption rules are defined by the end-user to disable or allow visual encodings in accordance with user preferences (e.g. colour-blindness, or perhaps even a personal preference not to use animated iconography). - Developer: It is the responsibility of the developer of a new visual data encoding feature update to write adaption rules to allow the visualization tool to only use visualization features that are appropriate for their display environments. To ensure maximum system flexibility, higher level rules can overwrite rules at a lower level. The priority rankings for rule sets define “system” as the highest level,
An Extensible Interactive 3D Visualization Framework
515
“end_user” as the second highest, and “developer” as the lowest level. This allows end users to customize their system beyond the defaults set by the developers without enabling visualization dimensions that were disabled due to hardware constraints. A final technical challenge involved in handling the incorporation of new visualization extensions, is to automatically and intelligently determine whether to allow or disallow the use of a new visual data encoding based on the existing rules. For example, consider the situation where a developer has completed a new update that extends the visualization system by adding a new visual display option that maps changes in the corresponding data attributes to the animation of a coloured texture pattern mapped to the 3D data point objects (i.e. higher values of the data attribute result in faster looping of the texture animation to indicate increased severity, and slower speeds to indicate lower values). There could be a variety of reasons why this visual encoding would not be applicable to certain environments. A user with colourblindness or related disability might have disabled the use of colour as a visual encoding, or another user may have disabled animated iconography as a visual encoding due to personal preference. Other display environments might have texture mapping disabled as a visual encoding due to limits on storage space or processing limitations. In order to automatically and intelligently determine whether or not to allow the use of the new visual data encoding by default based on other existing adaption rules, the proposed framework makes use of a developer-defined dependency graph, whereby the developer defines which existing visual encoding the new extension is dependent on. In the case of the aforementioned example which uses an animated colour texture to visualize changes in a data attribute, part of a possible example dependency graph is outlined in Fig. 3.
Fig. 3. Part of a sample dependency graph for a set of 6 visual encoding options. An arrow from one node to another indicates that the former is dependent on the latter.
In the sample system outlined in Fig. 3, if the texture pattern encoding is disabled due to texture mapping not being supported by the hardware, then the system-level disallow rule that defines this will be copied and added as a system-level disallow rule for the Animated Colour Texture visual encoding. In the event of disallow rules at multiple levels, only the highest level disallow rule will be applied. Using this XML-based framework approach, any N-dimensional visualization system that is deployed to multiple heterogeneous software display environments can automatically and intelligently adapt to the environment without sacrificing the ability of end-users to customize their systems as needed.
516
N. Rossol et al.
3.3 System Evaluation and Validation In order to test the efficacy of the implemented system, user trials were conducted involving a group of epidemiologist practitioners using 1 year of province-wide veterinarian disease reports totaling 81,942 separate disease report entries with 65 recorded attributes per report. The software display environments used included a standard desktop monitor setup and multi-display polarized 3D stereoscopic projector setup. Using the visualization system in conjunction with other visualization tools developed in the collaboration, the epidemiologist practitioners were already able to make new important preliminary discoveries that were not accomplished using previous traditional means. Some of these new discoveries included: - Several long-standing ideas regarding the interaction between seasonal disease trends, particularly in the spring and autumn months, were confirmed and validated by the visualization, and hints of several new correlations were discovered. - The visualization system also clearly highlighted several possible short-comings in the data collection level in certain provincial regions possibly indicating a need to carefully re-evaluate some analyses performed in those areas. - -Beyond the discoveries made by the aforementioned planned use for the tool, an unexpected additional use for the tool was quickly discovered as it proved to be an effective way to compare various disease alert prediction algorithms by mapping them simultaneously within the same visualization space and intuitively directly comparing them in order to determine the relative strengths and short-comings of each on real-world data through visual analysis. Using these comparisons enabled more effective and informed decision-making regarding which detection and prediction algorithms were most effective in specific roles and situations. As a result of the effectiveness of the visualization system shown in the various user trials conducted, certain aspects of the system are now in the processing of being adapted for full commercialization and use in day-to-day disease surveillance by the epidemiologist collaborators’ organization involved.
4 Conclusion In this paper, we present the design and implementation of a framework for visual analysis of highly dimensional disease surveillance data. A key subcomponent of the system is a flexible XML framework that allows for N-dimensional visualization systems to be deployed in an extensible way into a wide range of heterogeneous software display environments and automatically intelligently adapt its visualization feature set accordingly. This paper also presented the results of a user evaluation where the framework proved effective in assisting veterinarian epidemiologists in analyzing highly dimensional disease report data in useful, novel, and meaningful ways. Plans for future works include adapting the tool for use in additional areas of disease surveillance and other types of medical data analysis.
An Extensible Interactive 3D Visualization Framework
517
References [1] Compieta, P., Di-Martino, S., Bertolotto, M., Ferrucci, F., Kechadi, T.: Exploratory Spatio-Temporal Data Mining and Visualization. J. Vis. Lang. Comput. 18(3), 255–279 (2007) [2] Wood, J., Dykes, J., Slingsby, A., Clarke, K.: Interactive Visual Exploration of a Large Spatio-temporal Dataset: Reflections on a Geovisualization Mashup. IEEE Transactions on Visualization and Computer Graphics 13(6), 1176–1183 (2007) [3] Livnat, Y., Agutter, J., Moon, S., Foresti, S.: Visual correlation for situational awareness. In: IEEE Symposium on Information Visualization, pp. 95–102 (2005) [4] Lawton, G.: Users Take a Close Look at Visual Analytics. IEEE Computer Magazine 42(2), 19–22 (2009) [5] Thomas, J.J., Cook, K.A.: Illuminating the Path: The Research and Development Agenda for Visual Analytics. IEEE CS Press, Los Alamitos (2005) [6] Ferreira de Oliveira, M.C., Levkowitz, H.: From Visual Data Exploration to Visual Data Mining: A Survey. IEEE Transactions on Visualization and Computer Graphics 9(3), 378–394 (2003) [7] Gross, M.H., Sprenger, T.C., Finger, J.: Visualizing Information on a Sphere. In: Proc. IEEE Information Visualization 1997, pp. 11–16 (1997) [8] Jorgensen, E.: Calibration of a Monte Carlo Simulation Model of Disease Spread in Slaughter Pig Units. Computers and Electronics in Agriculture 25(3), 245–259 (2000) [9] Guo, D.: Visual Analytics of Spatial Interaction Patterns for Pandemic Decision Support. International Journal of Geographical Information Science 28(8), 859–877 (2007) [10] Ghoniem, M., Fekete, J.D., Castagliola, P.: On the Readability of Graphs Using NodeLink and Matrix-Based Representations: A Controlled Experiment and Statistical Analysis. Information Visualization 4(2), 114–135 (2005) [11] Heath, F.M., Vernon, M.C., Webb, C.R.: Construction of Networks with Intrinsic Temporal Structure from UK Cattle Movement Data. BMC Veterinary Research, 4 (2008) [12] Bailey-Kellogg, C., Ramakrishnan, N., Marathe, M.V.: Mining and Visualizing Spatial Interaction Patterns for Pandemic Response. ACM SIGKDD Explorations Newsletter, 80–82 (2006) [13] Beshers, C.G., Feiner, S.K.: Visualizing n-Dimensional Virtual Worlds within n-Vision. ACM SIGGRAPH Computer Graphics 24(2), 37–38 (1990) [14] Gehegan, M., Hardisty, F., Demsar, U., Takatsuka, M.: GeoVISTA Studio: Reusability by Design. In: Hall, G.B., Leahy, M.G. (eds.) Advances in Geographic Information Science: Open Source Approaches in Spatial Data Handling, vol. 2, pp. 201–220 (2008) [15] Proulx, P., Chien, L., Harper, R., Schroh, D., Kapler, T., Jonker, D., Wright, W.: nSpace and GeoTime: A VAST 2006 Case Study. IEEE Computer Graphics and Applications 27(5), 46–56 (2007) [16] KNIME, http://www.knime.org/ [17] Ma, W.Y., Bedner, I., Chang, G., Kuchinsky, A., Zhang, H.: Framework for Adaptive Content Delivery in Heterogeneous Network Environments. In: Proc. SPIE, vol. 3969(86) (1999)
Improving Collaborative Visualization of Structural Biology Aaron Bryden, George N. Phillips Jr., Yoram Griguer, Jordan Moxon, and Michael Gleicher University of Wisconsin Madison
Abstract. Structural biology is the study of how molecular shape, chemistry and physics connect to biological function. This work is inherently multidisciplinary and co-located group discussions are a key part of the work as participants need to refer to and study visualizations of the molecule’s shape and properties. In this paper, we present the design and initial assessment of CollabMOL, a collaborative molecular visualization tool specifically designed to support small to medium size groups working with a large stereo display. We present a task analysis for co-located collaborative work in structural biology in which we find shortcomings in existing practice as well as key requirements of an appropriate solution. In this paper we present our design of this solution and an observation based user study to validate its effectiveness. Our design incorporates large stereo display support, commodity input devices and displays, and an extension to an existing molecular visualization tool.
1
Introduction
Structural biology is the study of how molecular shape and physics connect to biological function. The work is inherently multi-disciplinary, as it includes both those who understand molecular geometry, and others who are interested in how proteins are used within biological systems. Co-located discussions are a key part of the work as participants need to refer to and study visualizations of the molecules’ shape and properties. Our goal is to develop systems to support this collaborative visualization that are inexpensive and effective. Collaborative molecular visualization is enabled by the availability of large, stereo displays. Such displays are becoming increasingly practical as consumer devices emerge. While current molecular visualization tools support a variety of displays, they are designed to support single user tasks, and are not necessarily well-suited for collaborative work. Our premise is that group collaboration is best supported by software specifically designed to address the type of work being done. In this paper, we describe our experience developing a collaborative molecular visualization tool. We performed a task analysis to understand domain scientists’ needs by identifying shortcomings in existing practice and key requirements. We used this task analysis to design a system that addresses these shortcomings and key requirements with low cost consumer displays and input devices. This design is implemented G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 518–529, 2011. c Springer-Verlag Berlin Heidelberg 2011
Improving Collaborative Visualization of Structural Biology
519
in a prototype which we use to a perform a user study in order to evaluate its effectiveness. Contributions: Our primary contribution is to provide an analysis of the task of co-located collaborative molecular visualization, and the design for a system based on this analysis using commodity hardware. The specific elements of the solution may not be very novel. However, by basing the design on an understanding of the task we were able to tailor the solution, and better manage the tradeoffs in system functionality, cost, and usability. Together, the elements create a system that demonstrates that a task-informed design can produce an effective collaborative system. Our work also contributes a case study of how the design process can be applied to co-located collaborative visualization, and how tools designed this way can better serve user need than systems not specifically designed to support it.
2
Related Work
Several areas of related work influenced our design process including molecular visualization, co-located collaboration and 3D interaction methods. 2.1
Molecular Visualization
Structural biologists and their collaborators often use molecular visualization to gain a better understanding of the shape and other properties of the molecules that they work with. They utilize a variety of abstract representations of these molecules and visualization options in order to to best understand the molecule they are examining. Molecular visualization has a rich toolbox of visualization and analysis techniques. Donogue et al. gives an excellent overview of these techniques[1]. Some examples include ribbon representations[2], solvent excluded surfaces[3], ambient occlusion shading[4], and abstracted molecular surfaces[5]. There are several popular molecular visualization packages including VMD[6], PyMOL1 , and Chimera[7]. Because these tools provide a wide variety of visualization options and styles as well as analysis tools, scientists that work with them often invest a great deal of time and effort into gaining proficiency in their tool of choice. This allows them to quickly manipulate the visualization and perform the analysis necessary to make structure-function connections. Figure 1 demonstrates an example of the kind of visualization that they might create. This impressive depth of both visualization and analysis present in molecular visualization tools influences our design of an appropriate collaborative system. 2.2
Co-located Collaboration
Co-located collaboration refers to collaborative work that takes place in the same physical location. Previous work has explored a variety of display and interaction paradigms for collocated collaboration including immersive environments, table 1
http://www.pymol.org
520
A. Bryden et al.
Fig. 1. This protein(PDB:1AKE) is rendered using a ribbons representation to show the secondary structure, a transparent molecular surface to show the shape of the binding cavity, a spheres representation to show the location of the ligand, and a sticks representation to show the location of residues near the ligand
top displays, large tiled displays and multiple co-located displays. Other tools for collocated collaboration in the domain of molecular visualization include multiple immersive applications[8,9], and a multi display collaborative adaptation of JMOL[10]. Our work deals specifically with the design of a single display co-located collaborative system for a specific application. Our experience and that of our colleagues has shown us that col-located discussions are common within the domain of structural biology. These and other existing collaborative applications help to inform our design process. 2.3
Bimanual Interaction in 3D Interfaces
Bimanual interaction has been shown to be an effective metaphor for interacting with 3D objects [11,12]. The non-dominant hand is used for view manipulation which requires less precision while the dominant hand remains free for pointing and selection tasks which requires more precision. View manipulation is used not only to orient the view towards relevant objects but also to gain a better depth perception via motion parallax. Decoupling these tasks avoids constant switching between actions and facilitates improved perception of the object being explored and increases task performance [12].
3
Task Analysis
In order for our collaborators to do their work it is necessary for them to engage in group discussions of protein structure. They have found that these discussions are more effective when facilitated by a large, vertical, stereo display, as shown in Figure 2. These discussions typically revolve around specific properties of small pieces or areas on a protein molecule and are usually relevant to some other domain of study such as how the protein interacts with other molecules in a larger system. In order for these discussions to be effective they need to leverage
Improving Collaborative Visualization of Structural Biology
521
the multitude of visualization and analysis options present in desktop molecular visualization software as well as their expertise in using the software. In this task analysis we identified shortcomings of existing practice, key tasks, and functional requirements. In this section we discuss the key tasks we observed, the shortcomings of existing practice we observed and the functional requirements of the system. 3.1
Methods
We observed several collaborative discussions over a period of approximately six months. During these observations we noted how the existing system is used in a collaborative setting and how users spend most of their time. We noted road blocks to discussion and sources of confusion. We primarily used fly on the wall observation but if users encountered difficulties we asked them questions after they finished the session. In order to uncover other functional requirements of the system we discussed our plans with our collaborators. 3.2
Key Tasks
Our observations revealed that while many different visualization tasks are used, collaborative work in this area is strongly dominated by two key tasks: viewpoint control and pointing. Because each participant may prefer to tell their stories, moving between viewpoints is common, and considerable time is spent recreating previous views. In discussions, participants often use pointing gestures to refer to places on molecules. These gestures are done either with their hands or with the mouse pointer. Beyond the key tasks of viewpoint control and pointing we observed selection of sub parts of the molecule, switching between a set of selections and changing the visualization style of selections as important tasks that occurred frequently.
Fig. 2. A collaborative discussion we observed in our task analysis. The participants had a hard time understanding exactly what is being pointed at due to the parallax issue associated with stereo viewing.
522
3.3
A. Bryden et al.
Shortcomings of Existing Practice
Through our observations we observed several shortcomings of existing practice or roadblocks to effective group discussions. While the conversations often involved many people, only a single person could “drive” (e.g. hold the mouse). Switching drivers was time consuming because the driver needs to sit at the console. Participants were forced to “back-seat drive” by describing desired viewpoint changes to the driver, but this was rarely satisfying. Additionally, it was often difficult for the person using the mouse to change the view effectively or move the pointer to the appropriate position because of the configuration of the space. These problems were compounded by the fact that the mouse pointer was only displayed on 1 of the 2 stereo viewpoints. Because of the limitations of the mouse pointer, it was common for participants to attempt to point out parts of the molecule they were studying with their hands. While in some domains this would work quite well, the use of stereo viewing caused a great deal of confusion when participants used their hands for pointing(due to parallax). Figure 2 shows an example of this problem. Because there was no explicit support for switching between multiple viewpoints and selection sets, participants spent considerable time recreating previous configurations. The system’s interface presented users with reasonable access to an immense set of operations. However, it did not necessarily make the operations most common in discussion readily available. 3.4
Functional Requirements
In addition to potential areas for improvement our collaborators had key requirements for the system. These were: full compatibility with PyMOL including access to all functions provided by PyMOL during a collaborative session and the ability to load PyMOL session files; and the system must be comfortably usable by groups of 3-6 people.
4
Design
After performing the task analysis we designed a collaborative molecular visualization system around the shortcomings, key tasks, and functional requirements we observed. This section presents the design decisions we made and the rationale behind those decisions. 4.1
Design Process
In order to make our system as accessible as possible to small groups of domain scientists we wanted to use primarily inexpensive, consumer level hardware for our system. We used this constraint as the basis for our design decisions. Within the framework of this constraint we made a set of design choices that addressed the requirements found in our task analysis. The primary design
Improving Collaborative Visualization of Structural Biology
523
choices we made were the input methods and hardware to use and the types of displays to support. We based these decisions on how we could best support our requirements given the constraint of using low cost consumer level hardware. 4.2
Input Methods
The collaborative interaction with the molecular visualization is the area where we felt we could make the largest improvement over existing practices. Choice of Input Hardware. Our input hardware must be able to support the key tasks we identified in a fashion equivalent to or better than using a mouse and keyboard in a single user context and it must be readily available and inexpensive in the consumer marketplace. We examined a variety of input strategies including multiple mice and keyboard, individual displays and video game controllers. As we examined these we realized that the input device type with the most prior use in group settings at the consumer level is video game controllers due to their portability and ease of use in a single display group setting. One potential pitfall with using video game controllers is that they might not be as flexible as a keyboard and mouse or individual touch screen setup. Because of this it was important to determine whether or not they could support our key tasks of view manipulation and pointing as well as a few other specific tasks. We found that video game controllers were excellent at supporting the core interaction metaphors of pointing and view selection and that the other key operations were limited enough that they could be mapped to buttons. We chose to abstract our control scheme in order to be able to experiment with both Wiimote style controllers and dual stick controllers because both these controller styles are able to support view manipulation and pointing as core interaction metaphors and they have enough buttons to support the other important tasks. This abstraction also allows us to rapidly support a new game controller or change what control maps to what task. Mapping of Functionality to Input Hardware. Our system must support the key tasks in a way that is easy to learn and consistent. We chose to use bimanual input to separate view control and pointing/selection tasks and map other important operations to buttons on the controllers. Using the dominant hand for precision pointing and selection and the non-dominant hand for viewpoint control is a well established input metaphor[11,12] and increases task performance and depth perception. This input metaphor maps well to video game controllers and is intuitive. Other important but less frequent operations are mapped to buttons. For operations that exist in multiple contexts such as selection and deletion we used consistent mappings. We used symmetric pairs of buttons such as d-pads for symmetric operations such as navigating a list of selections or changing the visualization style of a selection. Cursor and Pointing. Users must be able to clearly communicate what sub portion of the molecule they are referring to and effectively select residues, atoms and chains. We found attempting to point at a particular part of the molecule
524
A. Bryden et al.
Fig. 3. A) Each user controls their own pointer which is used for both pointing and selection tasks. The active user is displayed at the bottom of the screen. B)Users can easily return to previously saved viewpoints by navigating saved graphical representations.
using ones hands in a single stereo wall context was often very confusing due to parallax issues. This led us to the conclusion that rather than relying on users to physically point using their hands and arms it was necessary to give each user a pointer inside the system. While we could have used a variety of 3D selection and manipulation techniques as discussed in Bowman et al.[13], we chose to use a virtual pointer because this most closely resembled the existing selection metaphor of PyMOL. Figure 3 shows the pointer. 4.3
Display Type
Our collaborators felt strongly that stereo viewing was a necessary component to their collaborative discussions. For this reason we chose to support consumer level projectors and 3D HDTVs with stereo viewing capabilities. In the past this requirement would have required the use of expensive, custom hardware but consumer 3D devices have vastly improved in recent years. 4.4
View and Selection Bookmarking
In our task analysis we discovered that participants spent a great deal of time recreating previous views or trying to compare views. We chose to make view saving and loading easily accessible via the game controller Snapshots of views as a guide for selecting previously saved views. This functionality is available in PyMOL’s desktop interface, but is obscure and requires typing function names using the keyboard. We felt it should be more prominent for collaborative visualization. We also observed that participants frequently used different sets of selections to make comparisons or change visualization options. For this reason we made saving and storing selections easily accesible.
Improving Collaborative Visualization of Structural Biology
525
active user Performs action (action succeeds) user tries to perform action (action succeeds)
no user in control
active user in control Active user does nothing for 3 seconds another user tries to perform action (action fails)
Fig. 4. The system only allows one user at a time to give input, other than moving their pointer. If the active user stops providing input for 3 seconds the system allows another user to take control.
4.5
Floor Control
Giving every user their own input device has significant advantages but it also has the potential for conflicting input. Our floor control policy attempts to prevent conflicting access without requiring explicit coordination. The model uses two states, as shown in Figure 4. In one state, anyone may take control by beginning an action such as viewpoint control. Once a user has taken control, they have exclusive control while they complete their action, and for a brief period afterwards to allow them to start a new action (to accommodate pauses). The state of floor control is shown in the display so that each user knows when they are in control or another user is in control in order to avoid confusion.
5
Implementation Details
We implemented our design as a plugin to the PyMOL molecular visualization software because of our users need to access it’s functionality. Our implementation differs slightly between Wiimote and dual stick style controllers. These differences are discussed here. 5.1
Viewpoint Control and Viewpoint Selection
For both input devices we implemented rotation using the left thumbstick because in the bimanual import metaphor the non dominant hand controls the view and the dominant hand points. We made rotation the default control for the left analog stick and used the left trigger and bumper(easily pressed with fingers on the left hand) toggle panning and zoom of the view. In both cases we mapped saving views and loading from a set of saved views to one of the easily accessible buttons. Figure 3 shows an example of the viewpoint selection screen. 5.2
Pointing and Selection
Pointing on the Wiimote used a laser pointer metaphor: spatial tracking (using the controller’s built in camera) allowed a user to point at the screen. On the
526
A. Bryden et al.
dual stick style controller pointing was implemented by using the right analog stick2 to make relative motion. In both cases selection uses the trigger on the right hand. Figure 3 shows an example of several pointers being used at once. While the laser pointer metaphor seemed attractive, it proved ineffective for several reasons. There were challenges in making pointing stable and precise enough for the task. These issues we addressed through Kalman filtering [14] and using snapping to nearby objects. A practical issue, that the Wiimote’s camera does not have sufficient field of view to work effectively with the large screen display, ultimately precluded us deploying the Wiimote solution. Our pilot studies also suggested other issues, particularly fatigue, as the user needs to hold the pointer up to maintain its position.
6
User Study
In order to evaluate our design we performed a user study on groups of experts in the domains of structural biology and related fields. These participants were either graduate students or professors in structural biology related fields. The goal of our user study was to both validate the individual elements of our design as well as the hypotheses that a molecular visualization system designed with multiple users in mind will facilitate more effective collaborative discussions than one that was designed for a single desktop user and that such a system is achievable using consumer level inexpensive hardware. We used XBox 360 controllers in the user study because they seemed to be the most comfortable of the dual stick controllers, and support multiple wireless controllers simultaneously and because our pilot study revealed the wiimotes were not adequate for the pointing task. Our pilot study also revealed that with small groups the main purpose of floor control should be to disallow actual simultaneous input. 6.1
Methods
We performed an observational study on small groups of structural biologists and their collaborators. We observed 4 small groups of domain experts using our system in a collaborative discussion. Each session lasted 45 minutes and used a combination of fly on the wall observation, and a semi-structured interview afterwards. We gave each group a short tutorial on how to use the system and then encouraged them to discuss their work with each other using the system. Participants discussed protein molecules of interest to them without our intervention. Both the controllers and the mouse and keyboard interface were available. Afterwards we performed a semi-structured interview about their impressions of the system. We used video recording to analyze how the groups used the system and record answers to the interview. 2
The poor support for left handed users is a common criticism of dual stick devices. However, their widespread adoption in the marketplace suggests this may not be significant.
Improving Collaborative Visualization of Structural Biology
527
We evaluated the individual elements of our design by whether the participants were able to complete key tasks efficiently using our interface and whether or not they felt they could complete those tasks with the same ease as with the mouse. We evaluated this both in terms of whether or not participants could complete key tasks either immediately, by the end of the tutorial, by the end of the group discussion or not at all; and based on the semi-structured interview. We evaluated the overall effectiveness of our system based on how the participants use the system during the group discussion of a molecule and on the reaction in the semi-structured interview. We tested whether or not participants were able to use our interface to perform key tasks or if they resorted to using the mouse, and we measured how much of the time in the discussion was spent performing tasks that we had not mapped to the controller and whether they encountered significant bottlenecks similar to those observed in the task analysis. While the depth of the PyMOL molecular visualization system means that some tasks will necessarily be performed using the mouse and keyboard, a successful collaborative system should enable the vast majority of interaction with the system over the course of a discussion with the collaborative interface. 6.2
Results
In all cases, participants were able to quickly learn to use the game controllers for view manipulation, pointing and selection tasks. The overall response to the system was very positive and the group discussions were able to use the controllers for the overwhelming majority of desired tasks. View Manipulation. Twelve out of the 13 participants were immediately comfortable manipulating the view and the remaining participant was comfortable after the tutorial. Several participants felt that the view manipulation was easier than the mouse. 2 participants who were expert PyMOL users noted that while manipulating the view with the controller was easy and effective they were able to go faster with the mouse. Throughout all the group discussions no participant felt the need to use the mouse for view manipulation. Pointing and Selection. All participants were immediately able to use the pointer. 11 out of 13 participants found selection easy and the other 2 participants required time to adjust to the sensitivity. Several participants expressed appreciation of the improved clarity associated with the pointer compared to using hand gestures to point at specific pieces of the molecule. View and Selection Bookmarking. Participants were immediately able to use the view and selection bookmarking system after being shown how to use it. Over the course of the evaluation these functions were not used very much during the group discussions. We speculate this functionality is new to most users, as it is typically inconvenient and therefore infrequently used on the desktop. Several participants were interested in using pre-prepared selections as opposed to new selections which we plan to add support for in the future.
528
A. Bryden et al.
Floor Control. Disallowing simultaneous input and displaying the active user proved adequate for the group sizes we used to test the system. Overall Effectiveness. Participants only used the mouse and keyboard to achieve unusual tasks that were not mapped to the controller. The vast majority of the time spent interacting with the system was done through the controllers. In all group discussions more than 90 percent of the time spent interacting with the visualization was done via the controllers. The bottlenecks to discussion we identified in our task analysis did not slow down the discussions. All of the participants felt that the collaborative interface was better than the desktop interface for the majority of uses in a collaborative session. Some felt that we should have included a way to toggle ligand(small molecules that bind to the protein) display. We felt that this demonstrates that they preferred the controller to the mouse and we plan to include this functionality in the future. 6.3
Analysis of Results
We found that a visualization system designed with multiple users in mind will be more effective for group discussions than simply displaying a single user system on a large wall. Additionally we demonstrated that this can be achieved at a hardware cost similar to a medium-end home entertainment system as we were able to use consumer level displays and inexpensive video game controllers as input devices. Our design succeeded in all areas. In the area of bookmarking we discovered improvements that will be incorporated into the next iteration of the design. Acknowledgements. We would like to thank Tom Grim and the members of the Phillips laboratory for their contributions to this work. This work was supported in part by a grant from NSF award IIS-0946598 and CMMI-0941013 and National Library of Medicine grant R01-LM008796.
References 1. O’Donoghue, S.I., Goodsell, D.S., Frangakis, A.S., Jossinet, F., Laskowski, R.A., Nilges, M., Saibil, H.R., Schafferhans, A., Wade, R.C., Westhof, E., Olson, A.J.: Visualization of macromolecular structures. Nature Methods 7, s42–s55 (2010) 2. Lesk, A.M., Hardman, K.D.: Computer-generated schematic diagrams of protein structures. Science 216, 539–540 (1982) 3. Sanner, M., Olsen, A., Spehner, J.C.: Fast and robust computation of molecular surfaces. In: Proceedings of the 11th ACM Symposium on Computational Geometry, pp. C6–C7. ACM, New York (1995) 4. Tarini, M., Cignoni, P., Montani, C.: Ambient occlusion and edge cueing for enhancing real time molecular visualization. IEEE Transactions on Visualization and Computer Graphics 12, 1237–1244 (2006) 5. Cipriano, G., Gleicher, M.: Molecular surface abstraction. IEEE Transactions on Visualization and Computer Graphics 13, 1608–1615 (2007)
Improving Collaborative Visualization of Structural Biology
529
6. Humphrey, W., Dalke, A., Schulten, K.: VMD – Visual Molecular Dynamics. Journal of Molecular Graphics 14, 33–38 (1996) 7. Pettersen, E.F., Goddard, T.D., Huang, C.C., Couch, G.S., Greenblatt, D.M., Meng, E.C., Ferrin, T.E.: UCSF Chimera–a visualization system for exploratory research and analysis. Journal of Computational Chemistry 25, 1605–1612 (2004) 8. Akkiraju, N., Edelsbrunner, H., Fu, P., Qian, J.: Viewing geometric protein structures from inside a cave. IEEE Computer Graphics and Applications 16, 58–61 (1996) 9. Su, S., Loftin, R., Chen, D., Fang, Y., Lin, C.: Distributed collaborative virtual environment: Paulingworld. In: Proceedings of the 10th International Conference on Artificial Reality and Telexistence, pp. 112–117 (2000) 10. Forlines, C., Lilien, R.: Adapting a single-user, single-display molecular visualization application for use in a multi-user, multi-display environment. In: Proceedings of the Working Conference on Advanced Visual Interfaces, AVI 2008, pp. 367–371. ACM, New York (2008) 11. Leblanc, A., Kalra, P., Thalmann, N.M., Thalmann, D.: Sculpting with the ”ball & mouse” metaphor. In: Proceedings of Graphics Interface 1991, pp. 152–159 (1991) 12. Balakrishnan, R., Kurtenbach, G.: Exploring bimanual camera control and object manipulation in 3d graphics interfaces. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: the CHI is the Limit, CHI 1999, pp. 56–62. ACM, New York (1999) 13. Bowman, D.A., Kruijff, E., LaViola, J.J., Poupyrev, I.: 3D User Interfaces: Theory and Practice. Addison-Wesley Professional, Reading (2004) 14. Frolov, P., Matveyev, S., G¨ obel, M., Klimenko, S.: Using kalman filter for natural hand tremor smoothing during the interaction with the projection screen. In: Workshop Proceedings VEonPC 2002, pp. 94–101 (2002)
Involve Me and I Will Understand!–Abstract Data Visualization in Immersive Environments René Rosenbaum , Jeremy Bottleson, Zhuiguang Liu, and Bernd Hamann Institute for Data Analysis and Visualization (IDAV) Department of Computer Science University of California, Davis, CA 95616, U.S.A.
Abstract. Literature concerning the visualization of abstract data in immersive environments is sparse. This publication is intended to (1) stimulate the application of abstract data visualization in such environments and to (2) introduce novel concepts involving the user as an active part of the interactive exploratory visualization process. To motivate discussion, requirements for the visualization of abstract data are reviewed and then related to the properties of immersive environments in order to show its potential for data visualization. This leads to the introduction of a novel concept for immersive visualization based on the involvement of the viewers into the data display. The usefulness of the concept is shown by two examples demonstrating that immersive environments are far more than tools to create visually appealing data representations.
1
Introduction
Visualization has been quite successful for data analysis, but limitations still exist. One limiting factor is the widely used two-dimensional (2D) display and interaction technology. Although we live and interact in a three-dimensional (3D) environment, most data visualizations use only two dimensions and neglect depth. This is often reasonable as the display of 3D data on 2D desktop screens imposes many drawbacks, such as occlusion, cumbersome interaction, and missing depth cues. First attempts to overcome these problems in data visualization were made by presenting data in immersive environments (IE). IEs are able to provide a synthetic 3D display and interaction space, which is rendered in the first-person viewpoint [1]. Due to the use of stereoscopic-vision and motion parallax, IEs are able to mimic our natural 3D viewing environment and thus can provide a high level of physical immersion (see Figure 1, left). Early IEs relied on costly hardware, but with the recent boost in 3D display hardware, the technology now becomes much more affordable. Research to visualization in IES mainly focusses on spatial data. Not much is known about abstract data. This might be due to the fact that the “natural” representation of available spatial components is often considered to be the
The author gratefully acknowledges the support of Deutsche Forschungsgemeinschaft (DFG) for funding this research (#RO3755/1-1)
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 530–540, 2011. c Springer-Verlag Berlin Heidelberg 2011
Involve Me and I Will Understand!
531
Fig. 1. The visualization of spatial data in immersive environments has a long tradition (left). Visualization of abstract data in such environments is usually limited to threedimensional adaptations of two-dimensional data displays (right).
main reason for its application. Although not necessarily containing spatial components, abstract data must also be spatially arranged in order to be displayed. Thus, abstract data visualization can benefit from IEs [1]. We would like to review and stimulate discussion on the visualization of abstract data in immersive environments - immersive Abstract Data Visualization (iAV). We also introduce a novel concept for iAV not available for non-immersive viewing technology. After reviewing existing research in this field (Section 2), the main requirements of the visualization of abstract data are discussed (Section 3) and used to show that the properties of IEs can help to advance data representation and interactive analysis (Section 4). To illustrate the mainly undiscovered potential of iAV, a novel concept considering the user as an active part of the visualization instead of a passive viewer is introduced (Section 5). Its implementation is shown for the two traditional visualization techniques scatter and parallel coordinates plot, each focussing on a different aspect of the concept. We conclude that iAV will not replace common desktop-based data exploration, but will enrich the state of related visualization technology by novel means for visual representation and the way we interact with the data (Section 6).
2
Work Related to Immersive Abstract Data Visualization
Compared to the immersive visualization of spatial data, often referred to as scientific data, not much is known about iAV. Existing approaches can be mainly subdivided into two groups - strategies with or without using the virtual world metaphor. Approaches applying the virtual world metaphor use visuals that mimic principles and behavior of the real world. Most strategies create a “virtual world” from the data to facilitate conveyance of data aspects and to simplify navigation. Others enrich a virtual world by Infographics [2]. The concept introduced in
532
R. Rosenbaum et al.
this paper does not take advantage of a virtual world metaphor and thus is significantly different from this approach. Visualizations for IEs not using the virtual world metaphor are sole data displays. They take advantage of the provided depth cues and thus an enriched and more natural 3D data representation. Most of the techniques are founded on developments for desktop environments, such as the techniques shown in Figure 1, right. Recent research [1] has shown that this provides a much better data representation and supports conveyance of properties of the data. Most common are traditional scatter plots that can be found in all varieties [3,4,1] often paired with an appropriate glyph-based representation to show more than three data dimensions in a single display. However, adaptations of other data displays, such as line charts [3] or the parallel coordinates plot [4], have also been proposed. Novel visualization strategies specifically developed for immersive environments are rare. The few approaches have mainly been proposed for single application areas and data types that have specific needs, such as software [5] or documents [6]. These solutions are strongly domain- and problem-specific. The novel concept introduced in this paper is broadly applicable and involves the user into the representation. Thus, it opens up for a whole new view of iAV.
3
Main Requirements for Abstract Data Visualization
The visualization of abstract data is related to the field of information visualization. Card et al. [7] identified an appropriate (1) visual representation and (2) means for interaction as the key requirements for meaningful solutions in this domain. We will discuss to what degree these aspects can be implemented by 2D and 3D representations. Visual representation. In order to amplify cognition [8], properties of the data are mapped into appropriate perceptual attributes, such as position, orientation, size, shape, or color. Thereby, position is probably the most often used attribute as it relies on the strong ability of the human visual system to determine distances and relate objects to each other. This ability is strongest in 2D representations. Thus, they have a long successful tradition especially for low-dimensional data sources and can reduce visual overload and the complexity of comparison and relation tasks. However, when data becomes more complex, e.g., has higher dimensionality, their limits become quickly apparent. 3D representations have two main advantages compared to their 2D counterparts: (1) there is an additional spatial dimension available for value encoding and (2) they help shift the viewing process from being a cognitive task to being a perception task [9]. This enables a much faster processing of the contents [10] and makes them more natural and appealing to us. Thus, encoding data into a third spatial dimension is often superior compared to other means. The ability of the human visual system to determine distances, however, is less accurate in 3D space. In order to display a data representation on a common screen, it must be projected into the 2D screen space. This process causes occlusion issues and
Involve Me and I Will Understand!
533
visual clutter for 3D graphics. Due to this, such representations have been labeled weak-3D [11]. Interaction. Most of the established interaction devices are designed for 2D displays and representations. Widely accepted solutions for interacting with 3D representations on 2D screens are not available making data exploration in such set-ups difficult to impossible. As meaningful interaction is also imperative to solve the inherent occlusion problem and to improve the viewers’s understanding of 3D content, there has always been a strong controversy whether 3D representations are meaningful in the visualization of abstract data. Given their advantages for visual data representation, one might conclude that the lack of appropriate means for interaction is probably the main reason why 3D information displays are often neglected.
4
Benefits of Immersive Environments for Abstract Data Visualization
In this section, we show that the realistic three-dimensional data representation and intuitive means for interaction inherently provided by IEs are able to meet the requirements for abstract data visualization, and thus allowing full advantageous use of 3D graphics. IEs can also provide a solution for the often corresponding problem of visualization of large data volumes. 4.1
Realistic Three-Dimensional Representation
IEs create the impression of a user being “present” in the visualization, resulting in a data representation that is natural to the viewer and requires less cognitive strain. This leads to amplified cognition as well as an increased user acceptance [12] and performance in tasks typical for information exploration [13]. First developments in iAV revealed that they are well-suited for tasks requiring a spatial understanding [1] or a mental model of the data and its representation [12]. As shown for simple data or more sophisticated cluster displays as well as tree and graph visualizations, task completion times can often be significantly reduced [14,12] or are generally lower [15] compared to that of 2D representations. Keys for the success are the provided depth cues [12] and immersion [15]. However, it has been reported [11] that such a strong-3D display [11] does not always provide better results than a weak-3D display. This especially applies to less cluttered data representations. 4.2
Intuitive Means for Interaction
IEs are closer to natural interaction than many other forms of computer systems and inherently overcome the problems imposed by interacting with 3D data. They support most tasks relevant to abstract data exploration [13]. Crucial for their success is the natural and effortless ability to change the viewing perspective by head-based rendering (HBR), providing effective means in solving occlusion
534
R. Rosenbaum et al.
problems and adjusting the detail level of the data. Panning and zooming may be used to examine different parts of the data representation. As stated in [15], this characteristic leads to better task completion times, higher usefulness ratings, and less disorientation compared to other forms of interaction. As the viewer is immersed in the environment and can almost “touch” it, more complex forms of interaction such as selection may also be provided in an intuitive and natural way to the viewer. The respective implementation, however, would strongly depend on the given application context and available hardware, such as magic wands or data gloves. 4.3
Visualization of Large Data Volumes
Because data must often be displayed in detail and in many different views at a time, the available screen space becomes one of the limiting factors in the visualization of large data volumes. Clusters of computer screens and display walls have recently become very common in pushing the existing boundaries. Immersive environments probably representing the ultimate solution for this trend. They provide a field of view in resolutions that shifts the constraining factor from used hardware to the human visual system’s abilities. The literally unlimited screen space also significantly reduces viewing interactions. The supported natural means for zooming and panning help to overcome the general problem of visual overload. 4.4
Drawbacks
Despite the many different advantages provided by IEs, it has been reported that initial learning efforts may increase task completion times especially for users unfamiliar with IEs [14]. However, such difficulties are usually quick to overcome [16]. There are also cases where IE usage does not provide a significant advantage compared to non-immersive display technology [11]. As IEs only advance the means for data representation and interaction, this especially applies to solutions that strongly rely on other stages of the visualization process, such as filtering. In poorly designed representations, disorientation or cybersickness may appear [16]. One major drawback of IEs often neglected in related literature is their reliance on a direct interaction with the data. The thereby resulting higher level of physical activity compared to common desktop-based data analysis can lead to fatigue especially for complex and long exploration tasks.
5 5.1
Novel Strategies for Immersive Visualization of Abstract Data The Concept
This section introduces a novel concept for iAV that introduces new display and interaction strategies in addition to the advantages of IEs for data visualization. "Tell me and I’ll forget. Show me and I may remember. Involve me and I’ll understand." - Konfuzius.
Involve Me and I Will Understand!
535
Fig. 2. Immersive Scatter Plots without (left) and with an immersed user looking at one of the visible three-dimensional data clusters (center, right)
The concept is founded on the ability of IEs to let the viewer immerse with the data and aims to gain more insight with less effort. Immersion is used to let the user be an active part of the visualization instead of being a passive viewer opening up a novel display and interaction paradigm. The foundation of the proposed concept is the exceptional support of location changes and spatial understanding in IEs allowing viewers to easily position and relate themselves to the data representation. Considering the viewer as part of the visualization promises to provide a better understanding of the data and intuitive means for interaction. Dependent on the role of the viewer within such a truly immersive visualization, two different approaches are introduced: the user is considered as (1) part of the data that can interact with others or (2) part of the layout that serves as a reference scale or means to change the layout. We are aware of the fact that the proposed concept might require a new understanding of data representation. Involving the user, however, is the next logical step in the development of more sophisticated visualization techniques for IEs. The implementation of the concept is demonstrated by the extension of two widely used visualization techniques – scatter and parallel coordinates plot. Each implementation is explained focusing on the novel means for visual representation and interaction. The techniques have been tested in the UC Davis KeckCAVES virtual reality environment consisting of a four-walled cave system. Interaction is accomplished by HBR and a handheld wand with multiple buttons.
Fig. 3. Size and color of data points within iSP change dependent on user position introducing means to explore a six-dimensional data space (left, center). Unlocking the user position allows the viewer to freely explore interesting set-ups (right, orange point represent last user position).
536
5.2
R. Rosenbaum et al.
Immersive Scatter Plot (iSP)
Background and main idea: In common scatter plots dimensions must be neglected when data of higher dimensionality is to be visualized leading to information loss. By taking advantage of immersion and treating the position of the immersed viewer as an individual data point in data space, an immersive Scatter Plot (iSP) is able to represent characteristics of higher-dimensional data. Six-dimensional (6D) data is used in this example. Representation: An iSP without an immersed viewer is identical to an ordinary 3D scatter plot (see Figure 2, left). In order to represent data of higher dimensionality, we take advantage of the “worlds within worlds” metaphor [17] and overlay the displayed primary coordinate system, the master plot, with a secondary coordinate system, the navigation plot. Both systems share the same origin and alignment of the dimension axes. The master plot represents three dimensions of the data arbitrarily selected by the author via interactive menus. As within the traditional scatter plot, these dimensions determine the respective positions of the data points in the plot. The navigation plot is formed by three other dimensions of the data and serves only for navigation purposes. The corresponding data points are not shown. Both plots are linked by an immersed viewer that is placed inside the visualization. As both plots are superimposed, the position of the viewer determines a point in the 6D data space. The distance of this point with regard to all other points is then encoded in their size and color (see Figure 3). Distances are calculated in 6D space using the Euclidian distance metric. Data points that are close to the user position are larger and brighter. Interaction: The viewer mainly interacts with the plot by a constant change of the position and an analysis of the changes that this imposes on the visual encoding of the data points. This makes it possible to evaluate properties, e.g., data clusters, that are observed in the master plot for their behavior in 6D space. If the cluster points show identical or similar encodings during position changes, the cluster property holds for the 6D data space (see Figure 3, center). Different groups of cluster points identifiable by their respective size and color represent different data clusters in 6D space (see Figure 2, center and right). For a structured exploration of the data space two strategies are proposed: (1) single and (2) multiple axes browsing. In single axis browsing, the user changes position with regard to a single axis giving a feedback about the properties of the data with regard to the associated dimension only (see Figure 3, left). In multiple axes browsing the user position is varied along a certain trajectory in the plot providing feedback to higher-dimensional correlations (see Figure 3, center). As it is not always desired to track the user position, we implemented the option to disable this option. This keeps the point representation constant allowing for its further exploration at different angles or distances (see Figure 3, right). Although not mandatory for immersion, we also provide an interactive wand making it possible to pan, zoom, or rotate the visualization in presentation space.
Involve Me and I Will Understand!
537
Fig. 4. Immersive Parallel Coordinates consider the user as a central axis that is connected to a specified number of main axes (left, center). In nearest-axes mode, connections change according to user position (right).
Properties: The iSP approach allows to explore a 6D data set by a common 3D scatter plot. Our tests have shown that it is simple to “mentally” link both plots, although it might take some initial training efforts to understand the principle and to memorize the dimension mapping. As both coordinate systems are superimposed, only a single 3D position must be determined to extrapolate the associated 6D point. Position changes allow for a quick and reliable feedback about higher-dimensional data features. The immersion of the viewer and the resulting close proximity to the displayed data points might be a drawback for certain views as it limits the number of points visible within a view. Occlusion of data points can also be an issue. Both problems, however only matter when the mental model of the data is lost. They can be overcome by unlocking the user position from the representation. Further, dimension axes that are overlayed cannot be considered independently. This can be solved by superimposing axes with similar characteristics. 5.3
Immersive Parallel Coordinates (iPC)
Background and main idea: There are different 3D variations of the parallel coordinates plot. Some, such as those described in [18], use the concept of an importance axis that is placed in the center of the representation and connected to all other axes. This can lead to visual clutter for high-dimensional data. User-driven (de)selection of a relevant connections may help to overcome this problem, but interactions are difficult when 2D display technology is used. Immersive Parallel Coordinates (iPC) adopt the approach of a central axis and demonstrate the second concept of iAV – the user serving as part of the layout. In iPC the central axis is labeled user axis and represented by the viewer. Representation: Similar to iSP, the visualization of iPC without an immersed viewer does not significantly differ from its standard counterpart. This especially applies to the main axes that are arranged on a 2D plane. The user axis is shown as soon as the viewer enters the visualization (Figure 4, left and center) and displayed in front of the viewer stretching from the users top to feet. It is connected to a specified number of main axes.
538
R. Rosenbaum et al.
Fig. 5. The fixed-axes mode in iPC allows to reveal patterns in the data by scaling and shearing (left, center). Unlocking the user from the presentation makes it possible to analyze the data from unconstrained viewing angles (right).
Interaction: iPC provide intuitive means for interacting with the representation, again, mostly via position changes. We propose two different interaction modes: (1) nearest-axes and (2) fixed-axes mode. In nearest-axes mode the user axis connects only to a specified number of axes that are closest to the current user position (Figure 4, right). In fixed-axes mode, the connected axes are kept fixed. Changes in user position result in different projections of the data (Figure 5, left and center). Our tests revealed that during visual analysis the nearestaxes mode is useful to find interesting dimensions in the data. The fixed-axes mode is more suited for a detailed analysis of axes pairs and configurations. The fact that the user axis is aligned with the user creates a viewing angle to the data that inherently prioritizes lines closer to the eyes of the viewer. To allow for independent views of the data, we provide means similar to that proposed for iSP to disable user tracking (Figure 5, right). We also support wand interaction allowing for different viewing transformations not supported by HBR. Properties: iPCs allow the viewers to immerse themselves in the layout and to play an active role in data exploration. iPC result in a truly viewer-centered representation. The visualization ensures that the important user axis is always the axis closest to the viewer and thus is perceived in higher detail than any other axis. Changing user position in nearest-axes mode is a natural manner to explore many axes configurations and to browse the visualization dimensionwise. The fixed-axes mode is a means to intuitively distort the representation of the projected data lines by scaling and shearing. In addition to the advanced 3D representation in IEs, this can help to reveal patterns that are not visible in a planar and static representation. As the applied distortion is determined by the respective user position it can be changed effortlessly and is simply understood. We did not observe any disorientation or cybersickness for our implementations of iPC and iSP.
6
Conclusion and Future Work
We showed that immersive environments can provide many benefits for the visualization of abstract data. Compared to the common ”flat" data display, immersive environments support a highly realistic representation of the three spatial
Involve Me and I Will Understand!
539
dimensions and natural means to interact in that space. We also introduced a novel concept for the immersive visualization of abstract data involving the user as part of the data or layout. Its implementation has been demonstrated by the adaptation of the scatter and parallel coordinates plot. Both showed the potential of the approach, but represent first attempts in this direction only. In future research we will evaluate the novel concept by comparisons to established 2D visualization techniques and comprehensive user tests. We will also be concerned with novel visualization strategies exclusively designed for immersive environments. First developments for cluster visualization based on transient spheres that can be entered in order to provide context for the associated cluster or its sub-clusters are promising. Such visualizations have the potential to lead to a completely novel class of data displays taking full advantage of the benefits provided by immersive environments.
References 1. Bowman, D., Raja, D., Lucas, J., Datey, A.: Exploring the benefits of immersion for information visualization. In: Proceedings of HCI International (2005) 2. Lau, H., Chan, L., Wong, R.: A VR-based visualization framework for effective information perception and cognition. In: Proceedings of Human System Interactions, pp. 203–208 (2008) 3. Lamm, S.E., Reed, D.A., Scullin, W.H.: Real-Time geographic visualization of world wide web traffc. In: Proceedings of International World Wide Web Conference, vol. 28, pp. 1457–1468 (1996) 4. Carlo, W.D.: Exploring multi-dimensional remote sensing data with a virtual reality system. Geographical and Environmental Modelling 4, 7–20 (2000) 5. Maletic, J.I., Leigh, J., Marcus, A., Dunlap, G.: Visualizing Object-Oriented software in virtual reality. In: Proceedings of the International Workshop on Program Comprehension, vol. 26, pp. 26–35 (2001) 6. Benford, S., Snowdon, D., Greenhalgh, C., Ingram, R., Knox, I., Brown, C.: VRVIBE: a virtual environment for co-operative information retrieval. Computer Graphics Forum 14, 349–360 (1995) 7. Card, S.K., Shneiderman, B.: Readings in Information Visualization - Using Vision to Think. Morgan Kaufman, San Francisco (1999) 8. Fekete, J., Wijk, J.J., Stasko, J.T., North, C.: The value of information visualization. In: Kerren, A., Stasko, J.T., Fekete, J.-D., North, C. (eds.) Information Visualization. LNCS, vol. 4950, pp. 1–18. Springer, Heidelberg (2008) 9. Knight, C., Munro, M.: Comprehension with[in] virtual environment visualisations. In: Proceedings of the IEEE 7th International Workshop on Program Comprehension, pp. 4–11 (1999) 10. Santos, R.D., Russo, C., Santos, D., Gros, P., Abel, P., Loisel, D., Trichaud, N.: Mapping information onto 3D virtual worlds. In: Proceedings of International Conference on Information Visualization, pp. 379–386 (2000) 11. Kjellin, A., Pettersson, L.W., Seipel, S., Lind, M.: Different levels of 3D: an evaluation of visualized discrete spatiotemporal data in space-time cubes. Information Visualization 9, 152–164 (2009) 12. Ware, C., Franck, G.: Evaluating stereo and motion cues for visualizing information nets in three dimensions. ACM Trans. Graph. 15, 121–140 (1996)
540
R. Rosenbaum et al.
13. Crossley, M., Davies, N.J., Taylor-Hendry, R.J., McGrath, A.J.: Three-dimensional internet developments. BT Technology Journal 15 (1997) 14. Arns, L., Cruz-Neira, C., Cook, D.: The benefits of statistical visualization in an immersive environment. In: Proceedings of the IEEE Virtual Reality Conference, pp. 88–95. IEEE Computer Society, Los Alamitos (1999) 15. Bowman, D., Lucas, J., North, C., Raja, D.: Exploring the benefits of immersion in abstract information visualization. In: Proceedings of International Immersive Projection Technology Workshop (2004) 16. Knight, C., Munro, M.: Mindless visualisations. In: Proceedings of ERCIM Workshop “User Interfaces for All” (2000) 17. Feiner, S., Beshers, C.: Worlds within worlds: Metaphors for exploring n- Dimensional virtual worlds. In: Proceedings of the Symposium on User Interface Software and Technology, pp. 76–83 (1990) 18. Johansson, J., Forsell, C., Lind, M., Cooper, M.: Perceiving patterns in parallel coordinates: determining thresholds for identification of relationships. Information Visualization 7, 152–162 (2008)
Automated Fish Taxonomy Using Evolution-COnstructed Features Kirt Lillywhite and Dah-Jye Lee Department of Computer and Electrical Engineering, Brigham Young University, Provo UT, 84604, USA
Abstract. Assessing the population, size, and taxonomy of fish is important in order to manage fish populations, regulate fisheries, and evaluate the impact of man made structures such as dams. Automating this process saves valuable resources of time, money, and manpower. Current methods for automatic fish monitoring rely on a human expert to design features necessary for classifying fish into a taxonomy. This paper describes a method using Evolution-COnstructed (ECO) features to automatically find features that can be used to classify fish into a taxonomy. ECO features provide features automatically and thus can quickly be adapted to new environments and fauna. The effectiveness of ECO features is shown on a dataset of four fish species where using five fold cross validation an average classification rate of 99.4% is achieved. Keywords: Object Detection, Feature Construction, Self-Tuned, Adaboost, Fish Taxonomy.
1
Introduction
Assessing the population, size, and taxonomy of fish is important in order to manage fish populations, regulate fisheries, and evaluate the impact of man made structures such as dams. Fish surveys, currently, are done using catch and release methods, diving, and fish tagging, all of which are costly and time consuming. Automating this process can save valuable resources, increase the quantity of data available, and improve accuracy over current methods that are error-prone. Similar research has been published in this area for monitoring and measuring fish. Zion et al. use an underwater sorting machine to sort three species of fish in a fishery pond[1]. They use features that they designed by hand from fish silhouettes and a bayes classifier to determine the species. Rova et al. use deformable template matching and a support vector machine to differentiate between two very similarly shaped fish[2]. Lee et al. compared contour segments between fish and a database to identify four target species[3]. Chambah et al. use hand selected shape, color, texture features, motion features, and a bayes classifier to identify fish in an aquarium[4]. Cadieux et al. use silhouettes and again a set of hand selected features with a combination of a bayes classifier, a learning vector quantization, and a neural network to classify fish[5]. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 541–550, 2011. c Springer-Verlag Berlin Heidelberg 2011
542
K. Lillywhite and D.-J. Lee
Previous research methods for automated fish identification and taxonomy have depended on a human expert to design the features the identification algorithm uses. Adapting to other environments with a different fauna is difficult, time consuming, and costly. In this paper, our previous work on EvolutionCOnstructed (ECO) features[6] are used to automatically find features that are then used by Adaboost to classify the different fish species. Using ECO features allows the system to easily adapt to new fauna and circumstances. In Section 3.1 the dataset used in this paper is briefly introduced. In Section 2 ECO features are explained. The experimental results are given in Section 3 and finally our conclusions are given in Section 4.
2 2.1
ECO Features What Is an ECO Feature?
An ECO feature, as defined in Equation 1, is a series of image transforms, where the transforms and associated parameters are determined by a genetic algorithm. In Equation 1, V is the ECO feature output vector, n is the number of transforms the feature is composed of, Ti is the transformation at step i, Vi is the intermediate value at step i, φi is the transformation parameter vector at step i, and I(x1 , y1 , x2 , y2 ) is a subregion of the original image, I, indicated by the pixel range x1 , y1 , x2 , y2 . V = Tn (Vn−1 , φn )
(1)
Vn−1 = Tn−1 (Vn−2 , φn−1 ) .. . V1 = T1 (I(x1 , y1 , x2 , y2 ), φ1 ) Almost any transformation is possible but we are mostly interested in those transforms that can be found in a typical image processing library. Table 1 lists the set of image transforms used, ψ, along with the number of parameters associated with that transform. The values that these parameters can take on for a given transform Ti make up the set ξTi . The number of transforms used to initially create an ECO feature, n, varies from 2 to 8 transforms. With the datasets that were used in testing it was found that the average ECO feature contained 3.7 transforms with a standard deviation of 1.7 transforms. The range 2 to 8 transforms allowed the search to focus where it was more likely to find results. We wanted ECO features to be able to be long and complicated if it yielded good results but found through experimentation that longer ECO features were less likely to yield good results. Figure 1 shows an examples of two ECO features. The transformations of a feature are applied to a subregion of the image which can range from a 1 × 1 pixel area to the whole image. An example of a subregion can be seen in Figure 5. Rather than making any assumptions about what the salient regions of the image are and defining a criteria for their selection, the genetic algorithm is used to search for the subregion parameters x1 , y1 , x2 , y2 .
Automated Fish Taxonomy Using Evolution-COnstructed Features
543
Table 1. A list of image transforms available to the genetic algorithm for composing ECO features and the number of parameters the genetic algorithm must set for each transform Image Transform Gabor filter Morphological Dilate Histogram Gradient Convert Median Blur Integral Image Rank Transform Resize
|φ| 6 1 1 1 0 1 1 0 1
Image Transform Sobel operator Gaussian Blur Hough Lines Normalize Log Canny Edge Square Root Census Transform Pixel Statistics
|φ| 4 1 2 3 0 4 0 0 2
Image Transform Morphological Erode Adaptive Thresholding Hough Circles Histogram Equalization Laplacian Edge Detection Distance Transform Difference of Gaussians Discrete Fourier Transform Harris Corner Strength
|φ| 1 3 2 0 1 2 2 1 3
Fig. 1. Two example ECO features. The first example shows an ECO feature where the transforms are applied to the subregion where x1 = 12, y1 = 25, x2 = 34, and y2 = 90 from Equation 1. The values below the transforms are the parameter vectors φi also from Equation 1.
In this way the saliency of a subregion is not determined by the subregion itself, but in its ability, after being operated on by the transforms, to help classify objects. Subregions allow both global and local information to be captured. Local features are those features located at a single point or small region of an image whereas global features cover a large region or the entire image[7]. The use of subregions allows each ECO feature to specialize at identifying different aspects of the target object. 2.2
Constructing ECO Features
ECO features are constructed using a standard genetic algorithm (GA)[8]. GAs, in general, are used for optimization and searching large spaces efficiently. They start with a population of creatures, representing possible solutions, which then undergo simulated evolution. Each creature is made up of genes which are the parameters of that particular solution. A fitness score, which is designed specifically for the problem, is computed for each creature and indicates how good the solution is. At each generation, creatures are probabilistically selected from the population to continue on to the next generation. Creatures with higher
544
K. Lillywhite and D.-J. Lee
fitness scores are more likely to be selected. Other creatures are made through crossover, which combines the genes of two creatures to form one. Finally the genes of each creature in the population can possibly be mutated according to a mutation rate, which effectively creates a slightly different solution. The algorithm then ends at some predefined number of generations or when some criteria is satisfied. In our algorithm, GA creatures represent ECO features. Genes are the elements of an ECO feature which include the subregion (x1 , y1 , x2 , y2 ), the transforms (T1 , T2 , ..., Tn ), and the parameters for each transform φi . The number of genes that makeup a creature is not of fixed length since the number of transforms can vary and each transform has a different number of parameters. Initially, the genetic algorithm randomly generates a population of ECO features and verifies that each ECO feature consists of a valid ordering of transforms. In order to assign a fitness score to each ECO feature a weak classifier is associated with it. A single perceptron is used as the weak classifier as defined in Equation 2. The perceptron maps the ECO feature input vector V to a binary classification, α, through a weight vector W and a bias term b. 1 if W · V + b > 0 α= (2) 0 else Training the perceptron generates the weight vector W according to Equation 3. Training images are processed according to Equation 1 and the output vector V is the input to the perceptron. The error, δ, is found by substracting the perceptron output, α, from the actual image classification β. The perceptron weights are updated according to this error and a learning rate λ. δ =β−α W [i] = W [i] + λ · δ · V [i]
(3)
A fitness score, s, is computed using Equation 4, which reflects how well the perceptron classifies a holding set. In Equation 4, p is a penalty, tp is the number of true positives, fn is the number of false negatives, tn is the number of true negatives, and fp is the number of false positives. The fitness score is an integer in the range [0, 1000]. s=
tn · 500 tp · 500 + f n + tp p · f p + tn
(4)
After a fitness score has been obtained for every creature, a portion of the population is selected to continue to the next generation. A tournament selection method is used to select which creatures move to the next generation. A tournament selector takes N creatures at random and the creature with the best fitness score continues to the next generation. By adjusting N the ability for creatures with lower fitness scores to move to the next generation can be tuned. Higher values of N prohibit creatures with low fitness scores to move on to the next
Automated Fish Taxonomy Using Evolution-COnstructed Features
545
Algorithm 1. F indingF eatures for Size of population do Randomly create creature x1 ∈ [0, image width − 2],
y1 ∈ [0, image height − 2],
x2 ∈ [x1 + 1, image width − 1],
y2 ∈ [y1 + 1, image height − 1],
T1 (φ1 ) ∈ [ψ], . . . end for φ1 ∈ [ξT1 ], . . . for number of generations do for every creature do for every training image do Process image with feature transformations Train creature’s perceptron end for for every holding set image do Process image with feature transformations Use perceptron output to update fitness score end for Assign fitness score to the creature Save creature if fitness score > threshold end for Select creatures that make it to next generation Create new creatures using cross over Apply mutations to the population end for
, Tn (φn ) ∈ [ψ], , φn ∈ [ξTn ]
generation. Currently N is set to 2 which allows weaker creatures to move on. After selection has taken place the rest of the population is composed of new creatures created through crossover as shown in Figure 2. Through the process of crossover it is possible for ECO features to have a transform length, n, longer than 8 which is the cap placed on gene length when they are being created. Once the next generation is filled, each of the parameters in the creatures can be mutated, also shown in Figure 2. This whole process of finding features is summarized in Algorithm 1. 2.3
Training AdaBoost
After the genetic algorithm has completed finding good ECO features, Adaboost is used to combine the weak classifiers to make a stronger classifier. Algorithm 2 outlines how AdaBoost is trained. X represents the maximum number of weak classifiers allowed in the final model. The normalization factor in Algorithm 2 is set so that the sum of the error over all the training images is equal to 1.0. After training, the resulting AdaBoost model consists of a list of perceptrons and coefficients that indicate how much to trust each perceptron. The coefficient for each perceptron, ρ, is calculated using Equation 5 where δw is the error of the perceptron over the training images. ρ=
1 − δw 1 · ln 2 δw
(5)
546
K. Lillywhite and D.-J. Lee
Fig. 2. Examples of crossover and mutation
Algorithm 2. T rainAdaBoost Set of training images M for every training image, m do Initialize δM [m] = 1/|M | end for for x = 0 to X do for every perceptron, w, do for every training image, m do if wrongly classified then δw + = δM [m] end if end for end for Select perceptron with minimum error, Ω if δω [Ω] >= 50% then BREAK end if Calculate coefficient of perceptron using Equation 5 for everytraining image, m do 1 if classified correctly by Ω c= −1 else δM [m] = end for end for
2.4
δM [m]∗e−ρ·c Normalization Factor
Using the AdaBoost Model
Figure 3 shows an example of classifying an image with an Adaboost model containing three ECO features. The figure shows each feature operating on its own subregion of the image (see Equation 1). Also as the subregions pass through the transforms, the intermediate results may vary in size from one to the next. Each
Automated Fish Taxonomy Using Evolution-COnstructed Features
547
Fig. 3. ECO features and their corresponding perceptrons are combined using Adaboost to classify an image as object or non-object
feature is accompanied by its trained perceptron. The output of each perceptron is combined according to Equation 6 where X is the number of perceptrons in the Adaboost model, ρx is the coefficient for the perceptron x (see Equation 5), αx is the output of perceptron x (see Equation 2), τ is a threshold value, and c is the final classification given by the Adaboost model. The threshold τ can be adjusted to vary the tradeoff between false positives and false negatives. ⎧ X ⎪ ⎨1 if ρ · α > τ x x (6) c= x=1 ⎪ ⎩ 0 else
3 3.1
Results Fish Dataset
The fish images used are from field study images taken by our university’s biology department. The fish were captured, photographed, and released. There are four fish species represented in the dataset: yellowstone cutthroat, cottid, speckled dace, and whitefish. Samples of each species from the dataset are shown in Figure 4. In the dataset there are 246 yellowstone cutthroat, 121 cottids, 140 speckled dace, and 174 whitefish. The raw images were pre-processed in order to make a dataset appropriate for object recognition. The image was rotated so that the head of the fish was on the right side of the image. Then, each image in the fish dataset was cropped and resized to a standard 161 × 46 pixels. No color information is used because in many fish species recognition applications, color is either not present or not reliable due to water opaqueness and inability to control lighting conditions. The ECO features then have to key into shape information in order to distinguish one species from another.
548
K. Lillywhite and D.-J. Lee
Fig. 4. Examples of the four fish species. From the top row to the bottom row the species are yellowstone cutthroat, cottid, speckled dace, and whitefish.
3.2
ECO Feature Visualization
Figure 5 shows a visualization of an ECO feature used to identify a whitefish from other fish species. Box 1 in Figure 5 shows an original whitefish image and indicates the subregion being operated on. Boxes 2,3, and 4 show the average output over all whitefish images after each transform, Vi . The transform are two Gabor filters with different parameters followed by a median blur. Boxes 5 and 6 are the weights to the perceptron with box 5 showing the magnitude of the negative weights and box 6 the magnitude of the positive weights. After the median blur (box 4), there are a few bright spots in the image that are predominant in the whitefish images. The brightest spots appear on the adipose fin (small fin behind the main dorsal fin on the back of the fish), the front area right before the dorsal fin, and a section of the body before the tail. In box 6, the perceptron shows that it is paying attention to several aspects of the image but particular attention to the adipose fin as indicated by the bright spot in that region. Box 5 shows the negative weights from the perceptron and indicates aspects that should not appear if the image is that of a whitefish. The figure indicates that the ECO feature is cuing into several aspects of the fish’s shape but especially the adipose fin. 3.3
Cross Validation
Five fold cross validation was performed to test the ability of ECO features to distinguish each fish species. Each image in the dataset of a chosen species was treated as the positive example and all the other species made up the negative examples. Using five fold cross validation one fold is treated as the test set and
Automated Fish Taxonomy Using Evolution-COnstructed Features
549
Fig. 5. Visualization of an ECO feature used to identify a whitefish from other fish species. Box 1 shows an original whitefish image and indicates the subregion selected by the genetic algorithm on which to operate. Boxes 2,3, and 4 show the average output over all whitefish images after each transform, Vi . The transform are two Gabor filters with different parameters followed by a median blur. Boxes 5 and 6 are the weights to the perceptron with box 5 showing the magnitude of the negative weights and box 6 the magnitude of the positive weights. Table 2. The classification accuracy when doing five fold cross validation (one fold for testing and four for training) for each species. One species is treated as the positive examples and the other species form the negative examples. Species Y. Cutthroat Cottid Speckled Dace Whitefish
1 99.3 100 99.3 97.8
2 99.3 100 99.3 100
3 100 100 100 98.6
4 100 99.3 100 98.6
5 100 98.4 99.3 99.2
the remaining four folds are used for training the ECO features. For each fold and species, 10 to 16 ECO features are found before error rates rise above 50% during adaboost training. Once the ECO features are found, the images in the current fold are used to compute a classification accuracy. The results are given in Table 2 with the average classification accuracy being 99.4%, with a standard deviation of .64%. This shows that the ECO features performed very well in discriminating between the four fish taxonomies. It took approximately three minutes to find the ECO features for a single fold during cross validation. Finding the ECO features is an off-line process and only needs to be run as the location and fauna change, as new species are added, or some other phenomena changes the environment. This makes using ECO features very adaptable and easy to use, without sacrificing the accuracy of the system.
4
Conclusion
ECO features provide an effective way to perform fish taxonomy classification. Our experiments demonstrate that ECO features obtain an average of 99.4%
550
K. Lillywhite and D.-J. Lee
classification accuracy with a standard deviation of .64% on our proposed dataset of four distinct fish species. No human expert was needed to design the features used to discriminate between fish taxonomies. If new fish are added to the environment no human expert is needed to design a new feature set. If a radically different shaped fish is introduced new ECO features can be found automatically rather than waiting for an expert to make the necessary changes. The time needed to setup the system to recognize the fish taxonomies is minimal. Despite the advantage of an automated recognition system accuracy is not sacrificed.
References 1. Zion, B., Alchanatis, V., Ostrovsky, V., Barki, A., Karplus, I.: Real-time underwater sorting of edible fish species. Computers and Electronics in Agriculture 56, 34–45 (2007) 2. Rova, A., Mori, G., Dill, L.: One fish, two fish, butterfish, trumpeter: Recognizing fish in underwater video. In: IAPR Conference on Machine Vision Applications (2007) 3. Lee, D., Archibald, J., Schoenberger, R., Dennis, A., Shiozawa, D.: Contour Matching for Fish Species Recognition and Migration Monitoring. Applications of Computational Intelligence in Biology, 183–207 (2008) 4. Chambah, M., Semani, D., Renouf, A., Courtellemont, P., Rizzi, A.: Underwater color constancy: enhancement of automatic live fish recognition. In: Color Imaging IX: Processing, Hardcopy, and Applications, vol. 5293, pp. 157–168 (2004) 5. Cadieux, S., Michaud, F., Lalonde, F.: Intelligent system for automated fish sorting and counting. In: International Conference on Intelligent Robots and Systems, vol. 2, pp. 1279–1284 (2000) 6. Lillywhite, K., Tippetts, B., Lee, D.J.: Self-tuned evolution-constructed features for general object recognition. Pattern Recognition (in press, corrected proof, 2011) 7. Roth, P., Winter, M.: Survey of appearance-based methods for object recognition. Technical report, Institute for Computer Graphics and Vision, Graz University of Technology (2008) 8. Mitchell, M.: An introduction to genetic algorithms. The MIT press, Cambridge (1998)
A Monocular Human Detection System Based on EOH and Oriented LBP Features Yingdong Ma, Xiankai Chen, Liu Jin, and George Chen Center for Digital Media Computing, Shenzhen Institutes of Advanced Technology, Shenzhen, China
Abstract. This work introduces a fast pedestrian detection method that detects humans based on the boosted cascade approach using an on-board monocular camera. The Edge Orientation Histogram (EOH) feature and a novel Oriented Local Binary Patterns (Oriented LBP) feature are used in this system. Combination of these features captures salient features of humans and, together with a rejection cascade, achieves an efficient and accurate pedestrian detection system. Temporal coherence condition is employed to reject false positives from detection results. For a video sequence with resolution of 320 × 240 pixels, experiment results demonstrate that the proposed approach runs at about 16 frames/second, while maintaining an detection rate similar to existing methods. Keywords: Human detection, edge orientation histogram, local binary patterns, histograms of oriented gradients, AdaBoost.
1
Introduction
Finding people in image sequences is a key step for a wide range of applications, including intelligent vehicles, video surveillance, and mobile robotics. In recent years, techniques for efficiently and accurately detecting standing- and walking-pedestrians have received continuously increasing interests. However, humans have been proven to be a difficult object to detect, especially in highly cluttered urban environments and using moving on-board cameras. The high variance in postures, clothing, occlusions, and illumination conditions present some common challenges in pedestrian detection. Various powerful human detection methods have been developed but, most of these methods are designed for detecting peoples from static images. These methods focus on finding powerful features [1, 2] or using combination of various classifiers [3, 4] to obtain high detection rate. Although some approaches make use of motion features to detect humans from video sequences, they have some limitations, e.g., using a fixed camera [5] or utilizing complicated features [6]. However, for applications such as on-line human detection for robotics and automotive safety, both efficiency and accuracy are important issues that should be considered carefully. In this work, an efficient and accurate pedestrian detection framework is introduced. The novelty of this approach is the introduction of a new feature set to better model the class of pedestrians under on-board human detection scenarios. It combines the silhouette based feature (EOH) and a local texture descriptor (Oriented G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 551–562, 2011. © Springer-Verlag Berlin Heidelberg 2011
552
Y. Ma et al.
LBP). An AdaBoost classifier is employed to construct the weak learners for each stage of the cascade. Temporal coherence condition is employed to reject false positives from detection results. The aim of the proposed system is to achieve accurate human detection. Meanwhile, it maintains efficient for applications that require fast human detection and tracking. The remainder of this paper is organized as follows. Section II briefly reviews some recent works in human detection in static and moving images. Section III describes the proposed features and Section IV provides a description of the object validation and object tracking system. The implementation and comparative results are presented in Section V. Finally, Section VI summarizes the results.
2
Previous Work
There is a rich literature in human detection and we focus on those widely used technologies recently. Systematic overviews of related work in this area can be found in [7] and [8]. Various human detection methods can be decomposed into two leading approaches. One approach uses a single sliding window whereas the other approach uses a parts-based detector. The former approaches, which employ a full-body detector to analyze a single detection window, are presently the most widely used method in human detection due to their good performance. The latter category works better in the case of partial occlusion between pedestrians. There are two main directions for the sliding window based method: developing more discriminative features to improve system detection rate, or employing powerful training algorithms to learn better classifiers. Some widely used features include Haar wavelet [9], HOG [1], shapelet [2], edge orientation histogram (EOH) [10], edgelet [11], region covariance [12], and LBP [13]. Researchers also explore various combinations of these features to achieve good results. In [14], multiple features including HOG feature, Haar wavelet feature, and Oriented Histograms of Flow (HOF) feature are combined in an MPLBoost classifier. Hu et al. [15] build feature pool using gradient feature, edgelet feature, and Haar wavelet feature for their multiple-feature based pedestrian detection system. Different powerful classifiers have been introduced to perform pedestrian detection. Some successful pedestrian classifiers, those employ statistical learning techniques to map from features to the likelihood of a pedestrian being present, usually employ some variant of boosting algorithms [16, 17], some types of support vector machines [18], or various combinations of these classifiers [19]. Single sliding window based methods may fail to detect partial occluded pedestrian in crowded urban scene. Some parts-based approaches have been proposed to deal with the occlusion problem. In these approaches, each part is detected separately and a human is detected if some or all of its parts are found. Wu and Nevatia [20] use edgelet feature to train four part detectors separately by a boosting method. Responses of part detectors are combined to form a joint likelihood model for pedestrian detection. In [21], a depth-based candidate selection method is employed to find possible pedestrians and candidates are decomposed into six sub-regions that are easier to learn by individual SVM classifiers.
A Monocular Human Detection System Based on EOH and Oriented LBP Features
553
To detect pedestrians from video sequences, motion descriptors are also widely used in video-based person detectors. Viola et al. [5] build a detector for static camera video surveillance applications by applying extended Haar wavelets over two consecutive frames of a video sequence to obtain motion and appearance information. In order to use motion for human detection from moving cameras, motion features derived from optic flow such as histograms of flow (HOF) are introduced in [22] and show successful in recent pedestrian detection systems [6, 14].
3
Features and Classifiers
Feature extraction is the first step in most sliding window based human detection algorithms. The performance of a pedestrian detection system often relies on the extracted features. Since the proposed system aims at on-line human detection in video sequences recorded from an on-board camera, efficiently and accurately feature extraction plays a very important role in this approach. Although high-dimensional shape features, e.g., HOG feature, combined with a linear SVM classifier achieve great success in image-based human detection applications, the people detectors in these approaches are usually not fast enough. Instead of using complex highdimensional shape features, we introduce a discriminative local texture feature, Oriented LBP, which can be extracted quickly. An AdaBoost algorithm is then employed to train a human detector using the proposed feature and the EOH feature in an efficient cascaded structure. In the following we describe features used in our system. 3.1
Feature Pool
In order to choose appropriate features for our human detection approach, we evaluate several popular features, including HOG, HOF, region covariance, LBP, and the color co-occurrence histograms (CCH) [6]. Among these features, HOG, region covariance and CCH have been proven as high discriminative and suitable for human detection. The price is heavier computation. For example, CCH tracks the number of pairs of certain color pixels that occur at a specific distance. For a fixed distance interval and a quantized nc representative colors, there are n n 1 /2 possible unique, nonordered color pairs with corresponding bins in the CCH. That is, in the case of nc =128, CCH has 8128 dimensions. In addition, motion features are not included in the proposed system because the global motion caused by moving camera cannot be eliminated efficiently. The changing background also generates a large optical flow variance. Gradient-Based Shape Feature. The HOG feature (Histograms of oriented gradients), proposed by Dalal and Triggs [1], is one of the most successful features in pedestrian detection applications. HOG features encode high frequency gradient information. Each 64×128 detection window is divided into 8×8 pixel cells and each group of 2×2 cells constitute a block with a stride step of 8 pixels in both horizontal and vertical directions. Each cell consists of a 9-bin histogram of oriented gradients, whereas each block contains a 36-D concatenated vector of all its cells and
554
Y. Ma et al.
normalized to an L2 unit length. A detection window is represented by 7×15 blocks, giving a total of 3780-D feature vector per detection window. Although dense HOG features achieve good results in pedestrian detection, this shape descriptor is too complex to evaluate, resulting in a slow detection speed. Its processing time of a 320×240 scale-space image still requires about 140ms on a personal computer with 3.0GHz CPU and 2GB memory. To speed up feature extraction, we choose the EOH feature proposed in [10]. Similar to HOG, EOH is also a gradient-orientation based feature. The difference between HOG and EOH is that HOG feature represents gradient-orientation information for a detection window by using a high-dimensional feature vector while EOH uses several 1-D features and each of them characterizing one orientation for a single block at a time. Hence, the EOH feature can be integrated into a boosted cascade algorithm for efficient weak learner selection. Calculation of the EOH feature begins by performing edge detection in the image. The gradient magnitude G x, y and gradient orientation θ x, y at point (x,y) in image I can be computed as: G x, y
G x, y
θ x, y
arctan
G x, y ,
.
.
,
(1) (2)
where G x, y is gradients at the point (x,y), which can be found by using Sobel masks: G x, y
Sobel
I x, y .
(3)
G x, y
Sobel
I x, y .
(4)
The gradient orientation is evenly divided into K bins and the gradient orientation histograms E B in each bin k of block B are obtained by: E B ψ x, y
∑
,
ψ x, y .
G x, y θ x, y bin . 0 otherwise
(5) (6)
A set of K EOH features of a single block is defined as the ratio of the bin value of a single orientation to the sum of all bin values: EOH
∑
.
(7)
where ε is a small value that avoids the denominator being zero. Viola and Jones introduced the integral-image in [23] for fast evaluation of rectangular features. The integral-image method can be also used for fast EOH features computation. Oriented Local Binary Patterns. The EOH feature can be seen as an oriented gradient based human shape descriptor, while LBP feature serves as a local texture
A Monocular Human Detecction System Based on EOH and Oriented LBP Features
555
descriptor. As a discriminaative local descriptor, LBP is originally introduced in [[24] and shows great success in human detection applications [13]. LBP feature has several advantages such as it can filter out noisy background using the concept of unifoorm mputational efficiency. To calculate the LBP feature, the pattern [13] and it is com detection window is divideed into blocks and computes a histogram over each bllock according to the intensity diifference between a center pixel and its neighbors. The oones whose intensities larger than n the center pixel’s are marked as “1”, otherwise markedd as “0”. The histograms of thee LBP patterns from all blocks are then concatenatedd to describe the texture of the detection d window. For a 64×128 detection window withh 32 non-overlap 16×16 blocks, its i LBP feature has a 1888-D feature vector.
F,
N
Fig. 1. Example of Oriented LBP L feature computing, (from left to right) Original image, L LBP image, pixel orientation and pixel magnitude, orientation histograms of a cell (4×4 celll is shown for simplicity, actual ceell size is 8×8)
Arch Threshold
Fig. 2. Computing pixel orienttation and magnitude of Oriented LBP feature. In this examplee we use a threshold of 20. See sectiion 5.2 for more details.
In this work we introd duce a lower-dimensional variant of LBP, namely the Oriented LBP (see Fig. 1). Concatenating cell-structured LBP feature leads to a loong dimensional and less meaningful feature. We propose to combine two ideas from m [1] and [12] to compute our Oriented O LBP feature. First define the arch of a pixel ass all continuous “1” bits of its neighbors. Then, the orientation θ x, y and magnittude m x, y of a pixel is defineed as its arch principle direction and the number of “1” bits in its arch, respectively. See Fig. 2 for illustration. The pixel orientation is eveenly divided into N bins over 0 to 360 . Then, the orientation histograms F , in eeach orientation bin n of cell C are obtained by summing all the pixel magnitudes whhose orientations belong to bin n in C F,
∑
. ,
m x, y .
(8)
556
Y. Ma et al.
In our implementation, N is 8 for LBP , . In this way, a 64×128 detection window with 32 non-overlap 16×16 blocks has a 1024-D (4×8×32) oriented LBP feature vector. Similar to the EOH feature, we calculate a set of N Oriented LBP features for each block as: O_LBP
3.2
,
∑
,
.
(9)
Classifier
Most recent pedestrian detection approaches choose some variant of boosting algorithms or some types of support vector machines as classifiers. In this section, we briefly review these typical learning algorithms. Support Vector Machines. SVMs learn the hyperplane that optimally separates pedestrians from background in a high dimensional feature space. Its main advantage is that data used in SVMs can be of any type, e.g., scalar values or multi-dimensional vector features. However, pedestrian detection systems based on SVMs algorithms have higher computational load. Some approaches have been developed to fast detection speed of SVMs. Zhu et al. [25] employed linear SVM in a boosted cascade by choosing SVM as a weak learner in each learning stage. AdaBoost. Boosting algorithms optimize the final decision-making process by learning a set of weak classifiers based on their weighted votes. Each round a weak classifier is chosen to minimize the training error. The weighted sum of all weak classifiers forms the final strong classifier. Boosting algorithms have several advantages. Firstly, the detection speed is faster when combined with a cascade structure, in which most non-pedestrian areas can be discarded in early stages. Secondly, different classifiers can be chosen as weak learners in a boosting algorithm, e.g., a SVM classifier [25]. Finally, compared to SVMs algorithms, there are few parameters need to be tuned in boosting algorithms. In this work, we use Real AdaBoost to learn classifiers since efficiency is an important requirement of our system. Instead of Boolean prediction, Real AdaBoost algorithm returns confidencerated weak classifiers. Thus, it has better discriminative character than the binary valued AdaBoost algorighm [26].
4
Detection Validation
The output of the pedestrian detection step is a set of independent bounding boxes show possible locations of human objectives. Due to the cluttered background and limited number of positive/negative samples, the detection might have some false alarms. In order to recover from these problems, we employ a detection validation step. Candidates detected using the Real AdaBoost algorithm are validated according to a multi-frame temporal coherence condition. Before that, a confidence measue of each candidate is computed:
A Monocular Human Detection System Based on EOH and Oriented LBP Features
conf x
∑ ∑
.
557
(10)
where α is the voting weight of weak classifier h x , m and n is the number of weak classifiers and the number of detected objects, respectively. Candidates with their confidence measures lower than a threshold τ are discarded. In practice, we 0.65, 0.75 provides good results. found that setting τ We define the temporal coherence condition as follow. When the first object is detected, a Kalman filter is initialized to start a three frames validation tracking. Kalman filter predicts the object’s location in the next frame. A new detection in a consequent frame is assigned to this track if it coherently overlaps with the tracker prediction. In practice we set the overlap rate as 0.7. Only candidates meeting this condition in three consecutive frames are considered as a stable pedestrian objective and are marked as positive.
5
Experimental Results
The proposed system is implemented on a personal computer with 3.0GHz CPU and 2GB memory running the Windows XP operating system and OpenCV libraries. For 320×240 pixel images, the implementation of the proposed system runs at about 16 frames/second. We created several training and test video sequences containing thousands of positive (pedestrians) and negative (non-pedestrians) samples. Video sequences are recorded using an on board monocular camera at a resolution of 640×480 pixels in busy pedestrian zones. The well-known INRIA person dataset is employed to evaluate the performance of the EOH-Oriented LBP feature. We also compare the detection results between the proposed system and several common pedestrian detection methods on both the INRIA dataset and our recorded video streams. 5.1
Dataset and Implementation
The training set of the INRIA person dataset has 2416 positive (human) images of 64×128 resolution and 1218 negative (non-human) images. A set of 1076 positive examples of 64×128 resolution is also collected from our two training video sequences. The negative training set with about two million negative images is collected by randomly sampling sub-images from INRIA negative training images and one training video sequence without any human objectives in the scene. In our implementation, the edge orientation is evenly divided into nine bins over 0 to 180 for EOH feature and the pixel orientation of Oriented LBP feature is divided into 8 bins over 0 to 360 . In a 64×128 detection window, the feature pool consists of 62532 EOH features and 55584 Oriented LBP features. 5.2
Oriented LBP parameters
The performance of a LBP descriptor can be impacted by changing two parameters: intensity difference threshold and orientation binning. LBP feature is sensitive to the way in which intensity differences between a center pixel and its neighbors are computed. We test LBP feature computed using various intensity difference thresholds
558
Y. Ma et al.
ranging from 0 to 80. In praactice, using a threshold of 20 gives the best results. Settting the threshold too low introd duces many noises in the LBP image, while a high threshhold loses too much detail inform mation. Fig. 3 illustrates a set of LBP images of two trainning examples and performance comparison of LBP features using different thresholdds is plotted in Fig. 4 (a).
Fig. 3. Two training images (left ( column) and their LBP images computed by using variious intensity difference thresholds including 0, 10, 20, 40, and 80 (from left to right)
The number of orienttation bins is another important parameter for goood performance. In our implem mentation, the pixel orientation is evenly divided into eiight bins over 0 to 360 to com mpute the LBP feature. We evaluate the impact of differrent orientation bin numbers ran nging from 4 to 9 bins over 0 to 360 . As Fig. 4(b) shoows, increasing the orientation bin numbers from 4 bins to 8 bins improves detecttion performance, but increasing g the bin numbers to 9 makes little difference.
(a)
(b)
Fig. 4. Performance compariison of various threshold values and different orientation bin numbers, (a) Performance com mparison of LBP features using different thresholds, (b) Impacct of various orientation bin numberrs
5.3
Performance of Diffferent Features with Real AdaBoost Algorithm
We compare five methods with w a cascade structure and the results are shown in Figg. 5 (a). The five methods emplo oy Haar wavelets, EOH feature, LBP feature, HOG featuure,
A Monocular Human Detecction System Based on EOH and Oriented LBP Features
559
and the proposed combinatiion of EOH and Oriented LBP feature in the Real AdaBooost algorithm respectively. Thee minimum detection rate and the maximum false positive rate of each AdaBoost stagee are set to 99.95% and 50%. By treating each bin off the HOG feature as an individual feature, we implem ment HOG-Adaboost pedestrian detection based on the INRIA dataset. All of the H HOG parameter settings follow those t suggested in [1]. As shown in Fig. 5 (a), the H HOG feature achieves higher deteection rate than Haar wavelet, EOH, and LBP feature. T This confirms that the gradient orientation o based feature is better for human detection. T The performance of Haar wavelet feature is worse than other features, which reflects the fact that the intensity pattern of human face is simple than that of human body. Hennce, Haar wavelet feature is morre suitable for human face detection applications. Fig. 5 (a) also shows that the combinaation of EOH and Oriented LBP feature always outperforrms others on the INRIA datasett. This demonstrates that both the gradient orientation baased feature and the local inten nsity based feature provide useful information for hum man detection and contribute to the t higher detection rate.
(a)
(b)
Fig. 5. Performance comparison of different features and different classifiers: (a) detecction performance of different feattures with linear SVM classifier, (b) performance compariison between linear SVM and Adab boost classifiers Tablee 1. Precessing speed of different features
5.4
Features
Haar
HOG
LBP
EOH
EOHOriented LBP
Precessing time (fps)
19.0
11.8
21.4
17.6
15.8
Comparison of SVM M and AdaBoost Classifier
SVM and AdaBoost are thee two most popular classifiers and are widely used in variious pedestrian detection system ms. We evaluate the performance of these classifiers ussing Haar wavelet, HOG, and LBP L features on the INRIA dataset. Since EOH featurre is similar to HOG feature, we w use HOG feature to evaluate performance of the E EOH feature in SVM classifier. The implementation of HOG and LBP feature in SV VM ggested in [27]. algorithms follow those sug
560
Y. Ma et al.
Fig. 6. Some pedestrian detection examples at equal error rate, Top row: detection results of the proposed EOH and Oriented LBP feature from INRIA dataset, Middle row: detection results of the HOG-SVM algorithm from INRIA dataset, Bottom row: detection results of the proposed approach from our video sequences
From Fig. 5 (b) we observe that SVMs achieve higher detection rate than AdaBoost algorithms using same features. However, the AdaBoost algorithms reduce the computation load which leads to a more efficient human detection approach. For video sequences with resolution of 320 × 240 pixels, the detection speed of the AdaBoost algorithm using EOH and Oriented LBP feature is about 63.15 ms per image, 30% faster than that of the HOG feature in linear SVM classifier (84.47ms per image). The processing speed of various features on the same video stream using an AdaBoost classifier is shown in Table 1. 5.5
Detection Results with EOH-Oriented LBP Feature
Fig. 6 shows some pedestrian detection results on some testing images from the INRIA dataset and our recorded video sequences. The top row shows the detection results obtained from the augmented system, while examples illustrated in the middle row are based on the HOG-SVM algorithm. Some pedestrian detection results of the proposed system from our video sequences are given in the bottom row. As shown in Fig.6, the proposed approach achieves lower false positives rate in most testing images. On the other hand, a detection validation step is useful for most up-to-date human detection approaches since the false positives are almost inevitable in the case of real-time pedestrian detection using low level feature based classifiers.
6
Conclusions
In this work a new pedestrian detection system is introduced, which aims at extracting and tracking human objectives from video streams with high efficiency and accuracy.
A Monocular Human Detection System Based on EOH and Oriented LBP Features
561
We demonstrate that the proposed human detection algorithm outperforms some upto-date approaches and maintains fast detection speed. This is achieved by integrating the EOH feature with an Oriented LBP feature. The EOH feature extracts gradient orientation based shape information whereas the Oriented LBP feature records local intensity based information. In this way, a set of high discriminative but less complex features are obtained for each detection window. To recover from false positives, a detection validation method based on temporal coherence condition is employed to reject possible false alarms.
References 1. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, vol. 1, pp. 886–893 (2005) 2. Sabzmeydani, P., Mori, G.: Detecting Pedestrians by Learning Shapelet Features. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1–8 (2007) 3. Paisitkriangkrai, S., Shen, C., Zhang, J.: Fast Pedestrian Detection Using a Cascade of Boosted Covariance Features. IEEE Transactions on Circuits and Systems for Video Technology 18(8), 1140–1151 (2008) 4. Li, S.Z., Zhang, Z.: FloatBoost Learning and Statistical Face Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(9), 1112–1123 (2004) 5. Viola, P., Jones, M.J., Snow, D.: Detecting Pedestrians Using Patterns of Motion and Appearance. In: IEEE International Conference on Computer Vision, ICCV (2003) 6. Walk, S., Majer, N., Schindler, K., Schiele, B.: New Features and Insights for Pedestrian Detection. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1030–1037 (2010) 7. Geronimo, D., Lopez, A.M., Sappa, A.D., Graf, T.: Survey of Pedestrian Detection for Advanced Driver Assistance Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(7), 1239–1258 (2010) 8. Enzweiler, M., Gavrila, D.M.: Monocular Pedestrian Detection: Survey and Experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(12), 2179–2195 (2009) 9. Papageorgiou, C., Poggio, T.: A trainable system for object detection. International Journal of Computer Vision 38(1), 15–33 (2000) 10. Levi, K., Weiss, Y.: Learning Object Detection from a Small Number of Examples: the Importance of Good Features. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 53–60 (2004) 11. Wu, B., Nevatia, R.: Detection and Segmentation of Multiple, Partially Occluded Objects by Grouping, Merging, Assigning Part Detection Responses. International Journal of Compute Vision 82, 185–204 (2009) 12. Paisitkriangkrai, S., Shen, C., Zhang, J.: Fast Pedestrian Detection Using a Cascade of Boosted Covariance Features. IEEE Transactions on Circuits and Systems for Video Technology 18(8), 1140–1151 (2008) 13. Mu, Y., Yan, S., Liu, Y., Huang, T., Zhou, B.: Discriminative Local Binary Patterns for Human Detection in Personal Album. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1–8 (2008)
562
Y. Ma et al.
14. Wojek, C., Walk, S., Schiele, B.: Multi-Cue Onboard Pedestrian Detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, pp. 794–801 (2009) 15. Hu, B., Wang, S., Ding, X.: Multi Features Combination for Pedestrian Detection. Journal of Multimedia 5(1), 79–84 (2010) 16. Chen, Y., Chen, C.: Fast Human Detection Using a Novel Boosted Cascading Structure With Meta Stages. IEEE Transactions on Image Processing 17(8), 1452–1464 (2008) 17. Kim, T., Cipolla, R.: MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features. In: Neural Information Processing Systems Foundation, NIPS (2008) 18. Lin, Z., Davis, L.S.: A pose-invariant descriptor for human detection and segmentation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 423–436. Springer, Heidelberg (2008) 19. Cao, X., Qiao, H., Keane, J.: A Low-Cost Pedestrian-Detection System With a Single Optical Camera. IEEE Transactions on Intelligent Transportation Systems 9(1), 58–67 (2008) 20. Wu, B., Nevatia, R.: Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In: IEEE International Conference on Computer Vision, ICCV, pp. 90–97 (2005) 21. Alonso, I.P., Llorca, D.F., Sotelo, M.A., Bergasa, L.M., Revenga de Toro, P., Nuevo, J., Ocana, M., Garrido, M.A.G.: Combination of Feature Extraction Methods for SVM Pedestrian Detection. IEEE Transactions on Intelligent Transportation Systems 8(2), 292– 307 (2007) 22. Dalal, N., Triggs, B., Schmid, C.: Human Detection Using Oriented Histograms of Flow and Appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part II. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006) 23. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, pp. 511–518 (2001) 24. Ojala, T., Pietikäainen, M., Harwood, D.: A comparative study of texture measures with classification based on featured distributions. Pattern Recognition 29(1), 51–59 (1996) 25. Zhu, Q., Avidan, S., Yeh, M., Cheng, K.: Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR (2006) 26. Wu, B., Ai, H., Huang, C., Lao, S.: Fast Rotation Invariant Multi-View Face Detection Based on Real AdaBoost. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 79–84 (2004) 27. Wang, X., Han, T., Yan, S.: An HOG-LBP Human Detector with Partial Occlusion Handling. In: IEEE International Conference on Computer Vision, ICCV (2009)
Using the Shadow as a Single Feature for Real-Time Monocular Vehicle Pose Determination Dennis Rosebrock, Markus Rilk, Jens Spehr, and Friedrich M. Wahl Institut f¨ ur Robotik und Prozessinformatik, Technische Universit¨ at Braunschweig M¨ uhlenpfordstraße 23, 38106 Braunschweig, Germany {d.rosebrock,m.rilk,j.spehr,f.wahl}@tu-bs.de http://www.robotik-bs.de
Abstract. In this work we propose a way to detect vehicles in monocular camera images and determine their position and orientation on the ground plane relative to the camera. The camera does not need to be stationary which allows the method to be used in mobile applications. Its results can therefore serve as an input to advanced driver assistance systems (ADAS). The single feature used is the shadow beneath the vehicles. We implemented a real-time applicable method to detect these shadows under strongly varying conditions and determine the corresponding vehicle pose. Finally we evaluate our results by comparing them to ground truth data. Keywords: driver assistance, shadow detection, vehicle pose.
1
Introduction
A prerequisite for advanced driver assistance systems (ADAS) is extensive knowledge about the environment of a vehicle. This includes static obstacles (buildings, plants) as well as dynamic objects (pedestrians, bicyclists). Of special interest are other vehicles as they potentially pose the greatest threat to the own vehicle and its passengers. ADAS like adaptive cruise control (ACC) could benefit from this information and it can also be used to determine the time to collision. The problem of vehicle detection in images from moving monocular cameras has been solved in many different ways in various works. Most of them are regarding well-structured environments, such as highways, which ensures that other cars can only be seen from a specific angle. The detection algorithms are adapted to this by searching for visual cues which characterize the occurring views of vehicles. Popular cues are symmetry [15], edges [3] or texture [2]. Please consult Sun’s review [13] for further elaborations. A well-known feature of vehicles is the shadow casted by them. Many works try to dispose of the shadows as they decrease the detection performance of their algorithms. Only by Johansson et al. in [6] the shadows are actually used as a G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 563–572, 2011. c Springer-Verlag Berlin Heidelberg 2011
564
D. Rosebrock et al.
Fig. 1. Detection results of our approach in a real-world scene
feature to enhance the pose estimation process. A common method for shadow segmentation with stationary cameras is to first create an image background model, then identify the foreground in the current image and subsequently segment it into shadow and non-shadow regions. Owing to the prerequisite of a stationary camera these works focus on surveillance applications. Prati et al. provide a comparative study in [12]. In this work we are proposing a method to determine the pose of vehicles from a single moving camera. The only feature used is the dark and diffuse shadow beneath the object, the umbra, which is visible whenever daylight is present. It was first described as a “sign pattern” for obstacles by Mori et al. in [10] and used for vehicle detection in various works, e.g. [14], [7] and [8]. Almost all works cited in this paper (except [6]) focus on finding other cars in a camera image, but do not try to determine the exact pose of those vehicles relative to the camera. All of them are lacking a quantitative evaluation of their methods. To do that, the challenge of obtaining ground truth data has to be overcome. There are works which gather 3d information about vehicles, e.g. the frontal distance in [4] or the vehicle pose in [11] by using stereo cameras. The work by Barrois et al. [1] as well uses stereo cameras and is the only one to our knowledge which actually evaluates the accuracy of the proposed method by comparing the results to ground truth data. The main contribution of this work is a method to determine the 3d pose of vehicles in real time, solely based on their shadows and by utilizing a single camera. Furthermore, the quality of the results is thoroughly tested. Section 2 introduces the shadow as a feature for vehicle detection. In Section 3, a method is presented which detects dark regions in camera images. These will be analyzed by the procedure described in Section 4 to determine the vehicle pose. Section 5 depicts an evaluation setup which was used to obtain the quality assessment shown in Section 6, where we compare our results to ground truth data. Section 7 concludes this work.
2
Shadows as Dark Regions
We are using shadows as a single cue for vehicle detection. Independent of the lighting conditions, shadows beneath obstacles belong to the darkest regions in
The Shadow as a Single Feature for Monocular Vehicle Pose Determination
565
an image. This covers all daylight scenarios, as there will always be a diffuse shadow, which is darker than the surrounding area, even darker than strong shadows caused by direct sunlight. The main problem that remains to be solved is the identification of a suitable color threshold which separates the darkest regions from their surroundings. Due to strongly varying lighting conditions, which can be induced by weather changes or environment variations (e.g. when the camera enters tunnels or forests), the use of a static threshold is not appropriate. One way of solving this problem is presented by Tzomakas et al. in [14]. They use horizontal edges caused by shadows to refine the vehicle position in the image. The shadow regions are identified by comparing them to a Gaussian distribution, which models the color of road pixels. As we want to determine the pose (including angle) of other vehicles, our scenario is much more general since shadow edges can not be assumed to have a constant angle in all images. Furthermore we are trying to detect the umbra, which in most cases does not have strong edges but a smooth transition to non-shadowed or penumbra regions. Therefore we take a different approach. We use binarized images, but apply multiple thresholds instead of a single static one. Each binarization result is analyzed subsequently and the gathered information is joined to obtain hypotheses of shadow regions.
3
Vehicle Detection
For vehicle detection based on shadows we make use of a method which was first described by Mallot et al. in [9], called inverse perspective mapping (IPM). This requires the knowledge about the 6d pose (translation + orientation) of
(c)
v2 (f)
occluded pixels
g ray viewin
(a)
v1
nonoccluded pixels
(d)
yw w
xw
(b)
pcam
(e)
Fig. 2. Detection of dark regions. (a) Scenario with two model cars. (b) IPM image of the model scenario with camera position and exemplary viewing ray. (c) Binarization result for a threshold of color value 32 (after converting image to grayscale). (d) Closeup of the contour of the dark region, non-occluded pixels are marked. (e) Detection result after executing Algorithm I. (f) Evaluation setup with world coordinate system w and vehicle side coordinate systems v1 and v2 .
566
D. Rosebrock et al.
the ground plane relative to the camera. The IPM can be considered as a backprojection of the camera image to the ground plane, generating an image which is free of any perspective distortions for regions which actually lie on the ground plane. All other regions, though, will appear strongly distorted. For an example see Fig. 2b. We use the IPM image because the cues we are looking for, the shadows, by definition lie on the ground plane. Furthermore, the structure of the projected image introduces additional interpretation potential, e.g. line of sight, which can be exploited to make our method more robust and reduce the computational load. To detect vehicles, we search the IPM image for dark regions, which is done by executing Algorithm 1 (Fig. 3a). It iterates through the binarization thresholds and analyzes all pixels with color values that lie between the current and the last threshold. The analyzation determines whether or not a pixel is occluded by a darker pixel. It is said to be occluded if the viewing ray towards it starting from pcam intersects a formerly marked dark pixel before reaching the pixel itself (cf. Fig. 2d). Occluded pixels are neglected for further processing as by definition only the darkest pixels can belong to a shadow region. See Fig. reffig:shadowContoure for an exemplary result of the detection algorithm. Algorithm 1 involves two time-consuming steps: (1) the identification of all pixels with color values within a specified range and (2) the determination whether a pixel is occluded by a darker one. Because one of our main goals is real-time applicability we take several measures to speed up the calculation. The first step is optimized by making use of a variation of the bucket sort algorithm, called counting sort. It utilizes a color histogram of the n pixels of an image to quickly sort them by ascending color value. The time complexity for a number of k colors is O(n + k). As result we get the sorted list of all pixel indices and an array of positions which mark the beginning of a new color in this list. This allows an easy access to all pixels with a desired color value. For a detailed description see e.g. [5]. The processing time of step two of above is reduced by executing the following algorithms: Algorithm 2 (Fig. 3b) generates a look-up table Ilut which will be used to quickly identify pixels in the projected image Iproj that are occluded by already accepted shadow region candidates. The look-up table has the same dimensions as Iproj . Each element is a list of those neighboring pixels which lie behind the corresponding pixel in Iproj , when seen from the position of the camera. Algorithm 3 (Fig. 3c) recursively marks all pixels behind a given one as occluded. This information is stored in the two-dimensional occlusion map IoMap . Rejecting occluded pixels in this way is the main reason for the gain of speed when processing all binarized images. It can be shown (but will not in this work due to space restrictions) that it reduces the complexity class from O(n3/2 ) to O(n) for an image with n pixels.
The Shadow as a Single Feature for Monocular Vehicle Pose Determination
567
for binarization threshold t = t0 to tmax , t+=ts do pt = p | t − ts < Iproj (p) ≤ t for all p ∈ pt do if IoM ap (p) = f alse then accept p as potential shadow pixel propagateOcclusions(p) end if end for end for
(a) Alg. 1. detection of dark regions for all pixels of IPM image do p = current pixel while p = pcam do pnext = next pixel towards camera if p ∈ / occlusion list of Ilut (pnext ) then add p to occlusion list at Ilut (pnext ) end if p = pnext end while end for
(b) Alg. 2. generate occlusion lookup table
function propagateOcclusions(p) if IoM ap (p) = true then return end if for all pocc ∈ occlusion list Ilut (p) do propagateOcclusions(pocc ) IoM ap (pocc ) = true end for end function
(c) Alg. 3. update occlusion map
Fig. 3. Algorithms 1, 2 and 3 (see text for explanations)
3.1
Region Segmentation
After processing all threshold images, we get an image like the one in Fig. 2e, where only our shadow pixel candidates from the originally projected camera image are marked. It reveals that the detection results can be noisy and insufficiently connected. This is why a segmentation procedure is necessary, which assigns dark pixels to coherent regions. To filter the noise we delete all pixels which have no immediate neighbor. Then we apply a dilation procedure with a rectangular kernel to close small gaps within supposedly connected regions. Subsequently, the image is labeled to identify pixel groups that potentially represent shadows. The resulting image can be seen in Fig. 4a. For simplicity, these groups will be called “shadow regions” from now on. Only one requirement has to be met by these regions to be analyzed for pose determination: their area has to exceed a certain size in the IPM image. In our experiments this size was arbitrarily set to an area corresponding to wv2 /8 with wv being a crude approximation of the vehicle width.
4
Pose Determination
As mentioned in the introduction, we aim at the detection and pose estimation of vehicles. Because the outlines of cars are mostly rectangular, their shadows can be assumed to be approximately rectangular as well. Their boundaries therefore should appear as straight lines in the IPM image. In practice this assumption is hardly ever met by the regions obtained by the process described in Section 3. Nevertheless it helps us to create robust algorithms for pose estimation. There are said to be only two ways the car can appear in a camera image: 1. Only one side of it can be seen (front, rear or one of the sides), resulting in a linear umbra shadow region.
568
D. Rosebrock et al. labeled shadow region dark pixel detection result
3a
3b 1
pixels from maximum of h
3c
c1
2
(a)
φ1
(b)
c2
φ2
Fig. 4. (a) the segmentation result after filtering noise, applying dilation and labeling. Two vehicle candidate regions are identified (region 1 and 2). 3a, 3b and 3c do not meet the size requirement and are therefore ignored. (b) the results of the pose determination process for region 1 and 2 of (a). The positions are marked with c1 and c2 , the corresponding angles with ϕ1 and ϕ2 .
2. If seen from an angle where two sides of the vehicle shadow are visible, an L-shape appears in the projected image, consisting of a long and a short line that are perpendicular. The linearity assumption leads to the problem of line fitting for each region. A simple least squares approach to fit lines to the shadow regions can not be applied as it is always possible that the region forms an L-shape instead of a straight line. Furthermore the shadow pixels tend to be strongly influenced by noise, which results in a broad distribution of the pixels (see Fig. 4). This is the reason why line detection by a standard method such as Hough transform or a RANSAC-based procedure fail to get good results. We therefore developed another method which works as follows: every two pixels which lie inside the same region are considered as a pair. If their distance is bigger than a lower threshold dmin and all pixels passed by the connecting line also lie in the same region, the pair is accepted and the angle of the line is determined. The angles cover a number domain from 0 to π and are analyzed using an angular histogram h(ϕ) consisting of b bins with a width of wb = πb each. The construction of h(ϕ) can be described by the following equation:
h(ϕ) =
np i=0
wb wb ≤ αi < ϕ + Ψ ϕ− 2 2
with
Ψ (s) =
1 for s = true 0 for s = false,
np being the number of pairs and αi the angle defined by pair i. To facilitate the identification of maxima, each histogram is convolved with a Gaussian kernel of width 10. The unfiltered and filtered results for regions 1 and 2 of Fig. 4 can be seen in Fig. 5. Doing the angle analysis for every possible pixel pair is time consuming and, as experiments showed, not necessary. Instead, we sample the umbra pixels by randomly choosing a predefined amount np of pairs. For our experiments we arbitrarily chose np = 5000 which significantly decreased the processing time but did hardly affect the pose determination performance.
The Shadow as a Single Feature for Monocular Vehicle Pose Determination
569
1
h(ϕ)
unfiltered 0.8
region 1 filtered
0.6
region 2 filtered
0.4 0.2 0
0
10
20
30
40
50
60
70
80
90 100 110 120 130 140 150 160 170 180 ϕ/o
Fig. 5. Angular histograms of regions 1 and 2 of Fig. 4 (a). The lines with markers are the results of filtering the histograms (lines without markers) with a Gaussian kernel. Values are normalized to 1.
The global maximum of the angular histogram defines the main angle of the shadow region. The region’s position is determined by calculating the mean position of all pixels that contributed to this maximum.
5
Evaluation Setup
Our goal is to develop a method which uses a single camera mounted on a moving vehicle to detect other vehicles e.g. to facilitate the recognition of potentially dangerous situations. Therefore not just the detection of vehicles in images is of importance, but the determination of their exact pose relative to the own vehicle. We know that our method delivers qualitatively good results as can be seen in Fig. 1. But we want to measure the accuracy of our results to get a decent assessment of their quality. As it is difficult to get ground truth data with several vehicles moving relatively to each other, we chose an evaluation setup which enables us to thoroughly analyze our approach. We attached chessboard patterns to a car model and used them to exactly determine the vehicle pose (x, y, ϕ) on the ground plane. The ground plane was determined using a chessboard pattern lying on it, which also defines our reference coordinate system. ϕ is the angle between the x axis of the reference system and the main vehicle axis (front to back). (x, y) is the center position of the vehicle side on the ground plane. As described in Section 4, we do not directly determine the vehicle position, but the position of one of the sides. Therefore the positioning accuracy is measured relative to these points. A visualization of the setup can be found in Fig. 2f. During all experiments, the automatic exposure and gain adjustment of the camera was turned on which enabled us to capture images under strongly varying lighting conditions. The camera delivered 8bit grayscale images with a resolution of 1024x768 pixels and we used lenses with a focal length of 4mm and 12mm. Our model car is 25cm long and 9.6cm wide and was positioned at a distance of ≈ 1.1m from the camera. Most images were captured at the same late afternoon under varying conditions concerning intensity and direction of sunlight. Our detection algorithm as described in Section 3 delivers multiple but few (1 − 4) hypotheses for each image. For this evaluation, we do not explicitly verify each hypothesis but pick the one which lies closest to the ground truth. Results
570
D. Rosebrock et al.
where no hypothesis was found, the position distance was greater than 5cm or the angular distance bigger than 22.5◦ were regarded as false negatives. All other results count as true positives. This enables us to determine a detection Ntp with Ntp and Nf n being the amount of true positives and false rate s = Ntp +N fn negatives, respectively.
6
Results
We started the evaluation by tuning our parameters to the conditions which best meet our assumptions. That is: only a diffuse shadow is visible which lies exactly beneath the vehicle, (see Fig. 6, first row). The threshold boundaries in Algorithm 1 were set to t0 = 1 and tmax = 100 and the step size ts to 3. This comfortably covers all our examples but still requires 33 binarization results to be analyzed for every camera image. We measured a detection rate of 97.0%, the position was determined with an average error of dp = (0.635, 1.662)cm and a standard deviation of σdp = (1.157, 1.611)cm. The average angular error lies at 4.877◦ with a standard deviation of σdϕ = 4.397◦. Concerning processing time, we achieved our goal of real-time applicability. On a standard PC with a 2.83Ghz Intel Core2 Duo CPU and 3GB of RAM the average time using a single core was 56.9ms which enables us to work at the camera frame rate of 15fps. The results are summarized in Table 1. Furthermore we tested our method with the same parameter set on various images which were captured under conditions that differed at times severely from
Fig. 6. Some detection results with diffuse shadows (first row), shadows from other objects (second row) and very strong penumbras (third row). The white lines visualize the ground truth with the white rectangle being the corresponding position. Detection results are shown as red lines, with the circle representing the obtained position.
The Shadow as a Single Feature for Monocular Vehicle Pose Determination
571
Table 1. Detection performances for Ni test images, measured in detection rate s, average position error dp , standard deviation of position error σdp , average angular error dϕ , standard deviation of angular error σdϕ and average processing time T lighting diffuse varying conditions all Ni 369 1993 2362 s 0.970 0.861 0.878 dp /cm (0.545,0.743) (0.654,1.853) (0.635,1.662) σdp /cm (0.914,1.509) (1.201,1.565) (1.157,1.611) dϕ /◦ 2.899 5.290 4.877 σdϕ /◦ 3.513 4.451 4.397 T /ms 56.9 55.3 55.6
the optimal case. The violations are mainly induced by direct sunlight, having either the effect that the vehicle shadow lies (partly) inside a shadow casted by another obstacle (Fig. 6, second row) or creating a very strong and stretched penumbra (Fig. 6, third row). This forced us to introduce a preprocessing step, which selects the darkest 20% of the pixels within a shadow region to separate the umbra pixels from the penumbra pixels. Only those were used for pose determination. But still we are able to robustly determine the vehicle pose with a slightly reduced precision and detection rate (see Table 1, column 2 and 3). In general the results verify the applicability of our method. The offset can be explained by regarding the location of our ground truth position, which lies at the vehicle side. As the dark shadow center usually lies beneath the vehicle, a certain distance is likely to occur. This distance depends on various aspects, such as lighting conditions, position of the wheels and height of the camera. We did not try to adjust our method to these circumstances and accepted the results as they were. Some of the effects could be diminished though, by exploiting additional information such as camera height or direction of the sunlight. The positioning deviations are mainly induced by varying conditions and violations of our assumptions of the umbra lying beneath the vehicle and being clearly darker than the penumbra. Using a camera with a higher color resolution could make our method more robust as smaller color changes could be detected.
7
Conclusion
In this work we proposed a method to detect vehicles in monocular camera images and determine their pose relative to the camera. We explained how shadow regions can be detected, even under strongly varying lighting conditions, and how they can be used to accurately determine the vehicle pose. The results were compared to ground truth data, proving the applicability of our method. To summarize: the vehicle shadow can be used as a single feature to robustly determine the pose of a vehicle. Real-time conditions are met due to various optimizations such as using counting sort, an occlusion map and pixel sampling for line determination.
572
D. Rosebrock et al.
In future works we are planning on implementing a tracking procedure, which is likely to improve the detection quality, especially with respect to standard deviation. Then as well other aspects will be covered, e.g. how multiple objects and occlusions can be handled and how an online adaption of the street pose can be realized.
References 1. Barrois, B., Hristova, S.: 3D pose estimation of vehicles using a stereo camera. In: 2009 IEEE Intelligent Vehicles Symposium, pp. 267–272 (June 2009) 2. Bebis, G., Miller, R.: On-road vehicle detection using Gabor filters and support vector machines. In: Proceedings of 14th International Conference on Digital Signal Processing 2002, pp. 1019–1022 (2002) 3. Broggi, A., Cerri, P.: Multi-resolution vehicle detection using artificial vision. In: 2004 IEEE Intelligent Vehicles Symposium, pp. 310–314 (2004) 4. Chang, J.Y., Cho, C.W.: Vision-Based Front Vehicle Detection and Its Distance Estimation. In: 2006 IEEE International Conference on Systems, Man and Cybernetics, pp. 2063–2068 (October 2006) 5. Cormen, T.H., Leiserson, C.E.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009) 6. Johansson, B., Wiklund, J.: Combining shadow detection and simulation for estimation of vehicle size and position. Pattern Recognition Letters 30(8), 751–759 (2009) 7. ten Kate, T., van Leewen, M.: Mid-range and distant vehicle detection with a mobile camera. In: 2004 IEEE Intelligent Vehicles Symposium, pp. 72–77 (2004) 8. Kim, S., Oh, S.: Front and rear vehicle detection and tracking in the day and night times using vision and sonar sensor fusion. In: International Conference on IEEE Intelligent Robots and Systems, IROS 2005, vol. 40, pp. 2173–2178 (June 2005) 9. Mallot, H.A., B¨ ulthoff, H.H.: Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biological Cybernetics 64(3), 177–185 (1991) 10. Mori, H., Charkari, N.: Shadow and rhythm as sign patterns of obstacle detection. In: Proceedings of IEEE International Symposium on Industrial Electronics Conference, ISIE 1993, Budapest, pp. 271–277 (1993) 11. Nedevschi, S., Danescu, R.: A Sensor for Urban Driving Assistance Systems Based on Dense Stereovision. In: 2007 IEEE Intelligent Vehicles Symposium, pp. 276–283 (June 2007) 12. Prati, A., Mikic, I.: Detecting moving shadows: algorithms and evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(7), 918–923 (2003) 13. Sun, Z., Bebis, G.: On-road vehicle detection: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(5), 694–711 (2006) 14. Tzomakas, C., Seelen, W.V.: Vehicle Detection in Traffic Scenes Using Shadows. Tech. Rep. August, Institut fuer Neuroinformatik, Bochum (1998) 15. Zielke, T., Brauckmann, M.: Intensity and edge-based symmetry detection applied to car-following. In: Sandini, G. (ed.) ECCV 1992. LNCS, vol. 588, pp. 865–873. Springer, Heidelberg (1992)
Multi-class Object Layout with Unsupervised Image Classification and Object Localization Ser-Nam Lim1 , Gianfranco Doretto2 , and Jens Rittscher1 1 2
Computer Vision Lab, GE Global Research, Niskayuna, NY 12309 Dept. of CS & EE, West Virginia University, Morgantown, WV 26506
Abstract. Recognizing the presence of object classes in an image, or image classification, has become an increasingly important topic of interest. Equally important, however, is also the capability to locate these object classes in the image. We consider in this paper an approach to these two related problems with the primary goal of minimizing the training requirements so as to allow for ease of adding new object classes, as opposed to approaches that favor training a suite of object-specific classifiers. To this end, we provide the analysis of an exemplarbased approach that leverages unsupervised clustering for classification purpose, and sliding window matching for localization. While such exemplar based approach by itself is brittle towards intraclass and viewpoint variations, we achieve robustness by introducing a novel Conditional Random Field model that facilitates a straightforward accept/reject decision of the localized object classes. Performance of our approach on the PASCAL Visual Object Challenge 2007 dataset demonstrates its efficacy.
1 Introduction In recent years, the integration of the tasks of image classification and object localization has generated a great amount of interest [1,2,3,4,5,6,7,8,9,10,11]. The first task is concerned with labeling an image with tags describing the object classes depicted in it. The second task is concerned with localizing, typically by a bounding box, the objects described by such tags in the image. The rationale for combining these two tasks is that solving the first one would improve the solution to the second one and vice versa [3]. Additionally, it is of great common interest to not only know the presence of certain object classes in the image, but also their locations. Despite all the progress in image classification [12,13,14] and object detection [15,16,17,18,19], localizing and tagging objects in images are still challenging due to large intraclass variations, illumination and viewpoint changes, partial occlusions, as well as deformations. In light of these challenges, the sliding window approach (e.g., [3]) has been shown to be one of the more promising approaches towards solving such a multi-class object layout problem. The sliding window approach entails the design and use of a suite of trained binary classifiers, each for classifying a specific object class, follow by applying these classifiers to the image with a sliding window approach. Training a suite of object-specific classifiers is, however, a tedious process that typically requires many training images per class and a lengthy optimization procedure. For this reason also, adding new object class becomes non-trivial. To this end, the G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 573–585, 2011. c Springer-Verlag Berlin Heidelberg 2011
574
S.-N. Lim, G. Doretto, and J. Rittscher
first contribution of this work comes from relaxing this requirement. That is, this work adopts unsupervised clustering to build a dictionary, from which a ranked list of object exemplars, each representing an object class, can be efficiently retrieved from a large database [20,21,22,23,24] when given a test image. By adopting an unsupervised learning approach, one immediate advantage is that the training efforts disappear, allowing for the ability to quickly add new object classes to be detected, which simply involves adding the exemplar images of a new class to the database, and running the clustering procedure. Additionally, the unsupervised clustering technique that we have adopted [22] allows for building a dictionary very quickly even with a large database. While not fully evaluated in this work, this ease of adding new object classes and dictionary construction makes the approach scalable with respect to the number of classes that can be represented. It is important to note that the use of such unsupervised clustering need not be restricted to retrieving exemplars, but in our case the retrieved object exemplars would allow one to determine an approximate object location simply by matching sliding windows with the retrieved exemplars at runtime. Given the exemplars and their approximate locations, the final object layout is obtained when the proposed exemplars in error are removed and the final locations of the remaining ones are known. Addressing this issue, which is the second contribution of the paper, is non-trivial because the use of exemplars faces the challenge of intraclass and viewpoint changes, and partial occlusion. A mitigating factor is the representation of each object class with a large number of exemplars [7,25]. We further address this by introducing a regularizing model based on a Conditional Random Field (CRF) [26,27]. At the core of it is a new PCA-based shape representation, called shape ordering that drives the CRF conditional distributions. The model can cope with occlusion, false positives, and missing features, and provides a straightforward way to declare an accept or reject decision on the localized object classes based on shape agreement. It is worthwhile to point out that even though we are fully aware of recent state-of-the-art approaches based on modeling and learning contextual object interactions [1,2,28,29], this work leaves that route for future developments, and concentrates here on designing an approach that requires minimal training and supervision (unlike work such as [3,29,30]). The third contribution of the paper is state-of-the-art performance on the PASCAL 2007 dataset [31]. This confirms that it is possible to address intraclass variations and viewpoint changes by representing a class with image exemplars, and by using a CRF to get rid of the resulting sensitivity due to the use of these exemplars, all while avoiding extensive classification and training.
2 Problem Setup Given a test, or query image I depicting N objects, where the object o belongs to a class co , is located at position po , and has size so in the image, the object layout problem aims at identifying the N object classes present in the image, as well as the positions and sizes of the objects. The proposed approach makes the assumption that it is possible to capture a good chunk of the intraclass and viewpoint variability by collecting a sufficient number of
Multi-class Object Layout with Unsupervised Image Classification
575
exemplar images containing tightly cropped objects of the same class. After collecting all the exemplars in a database, an unsupervised clustering method [7,21,22,23,24] can be used to build a dictionary from which exemplars similar to the image I can be retrieved. In this work, we use an implementation of the hierarchical vocabulary tree in [22]. This produces a ranked list of M exemplar images. M can be conveniently set to a large enough number, such that there is enough confidence that M ≥ N . Rather than being forced to use trained object classifiers during the localization of the objects in the image, the availability of the exemplars is suggesting a minimalistic approach. More precisely, one might attempt to leave with an approximate localization, easily obtainable with a form of template matching. Here we rely on a multi-scale sliding window approach where at every scale and image location an image region descriptor is computed, and compared with the corresponding region descriptor of the exemplar. The scale and location corresponding to the best descriptor match constitute the approximate location and size of an object in the image I. Here we have used a fast implementation of the HOG descriptor [16]. After retrieving the exemplar images, a corresponding list of M candidate object classes cˆ1 , · · · , cˆM is obtained, and after locating where each exemplar might ˆ M , and sizes ˆ1, · · · , p fit within the image I, a list of approximate object positions p ˆM = sˆ1 , · · · , sˆM is available. Therefore, we have obtained a candidate object layout L ˆ o , sˆo )}o=1,··· ,M . {(ˆ co , p Obviously, if one of the classes in the test image is not represented by any of the retrieved exemplars, it will never be recovered1. Also, the exemplars of interest may not be the first N in the proposed ranking. Needless to say that even N is not known at this point. Finally, even the location of the objects may be rather approximate. All these ˆ M , to the issues need to be addressed in order to go from the candidate object layout L actual layout LN = {(co , po , so )}o=1,··· ,N that is contained by the test image I. To do so, the search for LN should have been narrowed down to the point that the model in charge of estimating it is constrained enough to (a) suppress the M − N outlier object classes; (b) handle the remaining appearance and shape differences between exemplars and the objects in the image I, due to intraclass and viewpoint variations; (c) compute a refined estimation of the object positions, and scales. All of the above is achieved with a new Conditional Random Field (CRF), which is described in Sec. 3. 2.1 Shape and Appearance Representations In order to estimate the layout LN we use a representation for the test image I, and for the candidate exemplar images that is based on SIFT features [32]. In particular, the appearance of I is given by its set of SIFT descriptors {fi }, located at image positions {fi }, identified by the SIFT detector. Similarly, the appearance of an exemplar image representing the candidate object o is given by its SIFT descriptors {guo }, located at the exemplar image positions {guo }. We also introduce a set of planar transformations. More precisely, associated to every ˆ o , and size sˆo of candidate object o, the layout LM carries the information of position p 1
Sect. 5 shows how important is to set M big enough for good performance. Conversely, M too big forces the approach to work too hard for rejecting unwanted exemplars, with a reduction in overall performance.
576
S.-N. Lim, G. Doretto, and J. Rittscher
the object in the test image. Based on these parameters, the transformation φo defines a mapping of the feature locations {guo } from the coordinate system of the exemplar image onto the coordinate system of the test image I. Given the test image I, represented by {(fi , fi )}, and given the candidate layout LM , represented by {(guo , φo (guo ))}o=1,··· ,M , the actual layout LN will be estimated in two steps: 1) Attempt to associate each of the test image features to one of the candidate layout features; 2) Based on the associations above accept or reject outlier object classes, and refine the object positions and sizes. Sec. 3 addresses step 1, whereas Sec. 4 addresses step 2.
3 Conditional Random Field Model Given the feature representations of the test image and of the candidate layout, we are interested in attempting to match a test image feature fi , located at fi , to a feature guo , located at guo , in the exemplar representing the candidate object o. We indicate this matching with the variable assignment Xi = (o, u). The goal then is to estimate the quantity . . X = (X1 , X2 , · · · ), given the observables Y = {(fi , fi )} ∪ {(guo , φo (guo ))}o=1,··· ,M . Obviously, one might establish a match Xi = (o, u) by observing that the feature descriptors are similar enough (fi similar to guo ), and that the feature positions are close enough (fi close to φo (guo )). Limiting ourselves to such a rule for the matching assignment would leave completely open the possibility that close-by test image features be assigned to close-by exemplar features that belong to different candidate objects. Since these are not our expectations, one would have to apply a form of regularization that involves, for instance, the use of a Markov Random Field (MRF) to model the pairwise matching assignments. Although certainly beneficial, using a MRF for regularization does not allow to directly exploit the observable data to influence the pairwise assignments. A model that enables that is the Conditional Random Field [26,27] (CRF), here characterized by the following energy function D(Xi , Y ) + W (Xi , Xj , Y ) , (1) E(X, Y ) = i
i
j∈Ni
where D and W are the unary and pairwise energy terms. Ni indicates the neighborhood of Xi , and comprises of all the Xj ’s such that |fi − fj | is less than a given threshold. 3.1 Unary Energy Term This term accounts for the cost of matching fi at position fi in the test image, with the feature guo at position guo in the exemplar representing the candidate object o. It should penalize a matching assignment between features located far apart and/or dissimilar in appearance. Therefore, the data cost is defined as o )|2 |f −φo (gu . − i 2μ 2 (2) + νΔ(fi , guo ) , D(Xi , Y ) = 1 − e
Multi-class Object Layout with Unsupervised Image Classification
o
577
fi
gu
p gv
fj
Fig. 1. CRF pairwise assignments. Left image shows the exemplar images of a person head and torso (object o), and of a bicycle (object p). The objects are in the same configuration as they appear in the test image I, on the right. Such configuration is defined by the transformations φo , and φp . See also text in Sec. 3.2.
where the constant μ allows to calibrate the spatial weighting of the first term, and ν allows to calibrate the relative weight between the first and the second term. Δ computes the Bhattacharryya distance between fi and guo . Therefore, the first term introduces penalty based on the distance between fi and φo (guo ) (proximity among features). The second term introduces penalty based on the appearance dissimilarity between fi and guo . 3.2 Pairwise Energy Term This term accounts for the cost of assigning the matches Xi = (o, u), and Xj = (p, v). This means that fi is assigned to guo , and fj to gvp . Since the unary term takes care of assigning the matches based on the appearance, this term can simply focus on the shape. To better understand the meaning of the pairwise term, we consider the example depicted in Fig. 1. Let us consider two features in the test image on the right, fi and fj , located at fi and fj . fi describes the head of a person riding a bike, and fj describes the center of the front wheel of the bicycle. Let us also consider matching fi and fj with two features, guo and gvp , from the head of a person exemplar o, and from the center of the rear wheel of a bicycle exemplar p, respectively, which are located at guo and gvp . In such a case, the appearance similarity of the features would tend to assign a match between fi and guo , and between fj and gvp . Note that this would not be a desirable outcome. To better get a geometric intuition of why this is the case, we draw a segment fi fj from the head to the front wheel in the test image, and a segment guo gvo from the head to the rear wheel in the combined exemplar images on the left (we will later expand on such combination). Clearly, these two segments are not representative of homologous regions of the two compound person-bicycle objects. Therefore, they should not be matched. The pairwise energy is meant to achieve exactly this goal, through the use of the idea of shape ordering, which we now introduce. Shape Ordering. Given a point cloud on the plane, let us consider two specific points, a and b. The shape ordering of the pair (a, b) establishes an ordering of such points with respect to the cloud. Such cloud might be representative for the shape of an imaged
578
S.-N. Lim, G. Doretto, and J. Rittscher
object. In particular, let us consider the cloud of points {guo }, given by the feature locations of the exemplar image representing object o. With respect to this cloud, the shape ordering of the pair (guo , gvo ) is a 2 × 1 matrix, computed according to the following expression . o (3) = sign V oT (gvo − guo ) , σuv where V o = [v1o , v2o ] ∈ R2×2 are the principal components of the point cloud {guo }, and sign is the usual function that is +1, 0, or −1 when the argument is positive, 0, or negative, respectively. o , let us focus on its first component, and the proTo understand the meaning of σuv o o jections of gu and gv onto the first principal axis v1o . If the first component is +1 then guo precedes gvo along v1o . We indicate this fact with the notation guo ≺1 gvo . Similarly, if the first component is −1 means that guo succeeds gvo along v1o (i.e. guo 1 gvo ). Finally, if the first component is 0 means that guo and gvo are the same when projected onto v1o (i.e. guo =1 gv0 ). Analogous reasoning holds with respect to the second component of the shape ordering. More precisely we have +1 ⇔ guo ≺2 gvo , −1 ⇔ guo 2 gvo , and 0 ⇔ gu0 =2 gvo . When a pair of feature locations (guo , gvp ), belongs to two different candidate objects, o and p, the shape ordering is defined as follows. The objects are related according to the candidate layout geometry, defined by the transformations φo , and φp , p op repretherefore, the reference point cloud becomes {guo } ∪ φ−1 o (φp ({gv })). If V sents the principal components of such cloud, then the shape ordering is computed as op . p o = sign V op T (φ−1 σuv o (φp (gv )) − gu ) . The idea of shape ordering can be applied to the feature locations of the test image as well, and it is done in this way. First, we need to identify the point cloud with respect to which we want to compute the ordering. This is done by picking a candidate object o, and by using the transformation φo that maps the borders of the exemplar image o into a bounding box Bo , in the test image I. Therefore, the shape ordering σij of two feature locations (fi , fj ), is computed with respect to the point cloud given by the feature locations inside the bounding box Bo , i.e {fi |fi ⊂ Bo }. The expression is given by o . = sign U o T (fi − fj ) , where U o are the principal components of the set {fi |fi ⊂ σij Bo }. The shape ordering of the pair of feature locations (fi , fj ) can be computed also with respect to a point cloud induced by multiple candidate objects. The case of two objects, o and p, is of particular interest to us. In this case the reference cloud is given by the feature locations that are either inside Bo or Bp , i.e. {fi |fi ⊂ Bo ∪ Bp }. If U op represents the principal components of such cloud, then the shape ordering is computed op . = sign U op T (fi − fj ) . as σij Mutual Shape Ordering. Going back to the example of Fig. 1, let us consider the cloud of SIFT locations from the person exemplar {guo }, and from the bicycle exemplar {gvp }, and further assume that their candidate positions in the test image are defined through φo , and φp . The shape ordering of the pair of features (guo , gvp ), located on the op . At the same time, let us consider the head, and on the center of the rear wheel is σuv point cloud defined by the test image feature locations inside Bo and Bp . The shape ordering of the pair (fi , fj ), located on the head, and on the center of the front wheel
Multi-class Object Layout with Unsupervised Image Classification
579
op is σij . If the two clouds are coming from the same type of object, and if the segments fi fj and guo gvo represent homologous regions of the point cloud we should expect similar shape orderings. The mutual shape ordering aims at measuring this similarity by op op computing δ(σij , σuv ), which is simply the Kronecker delta between the first, and beop op op op tween the second components of σij and σuv . Therefore, when δ(σij , σuv ) = [0, 0]T op op T there is maximum shape agreement, and when δ(σij , σuv ) = [1, 1] there is minimum agreement.
Energy Term. Finally, in order to assign the matches Xi = (o, u), and Xj = (p, v) (i.e. fi is assigned to guo , and fj to gvp ), the pairwise energy term is defined as op T op op . W (Xi , Xj , Y ) = 1 − e−β|(fj −fi ) U |δ(σij ,σuv ) ,
(4)
where β > 0 is a scaling factor, and the notation | · | in the exponent is intended to take the absolute values of each of the components of the vector inside. Essentially, we are penalizing mismatch in the mutual shape ordering by the amount of distance that has to be made up between fi and fj on the principal axes in order to obtain shape ordering agreement. Inference on the CRF is done with the Belief Propagation approach described in [33]. Notice that the shape ordering and the mutual shape ordering are quantities that are very robust to noise (see also Fig. 2). This is a very desirable property that enables the model to handle a fair amount of variations in shape due to intraclass variations and viewpoint changes.
4 Final Object Layout Given the matching assignments we now wish to reject the M − N outlier objects to obtain the final layout. For a given object o, let us consider the set of matches {fi ↔ guo }, such that every fi is located inside Bo . In order to establish whether o is an outlier, from the set of matches we build two sets: one containing only the feature locations in the test image, and the other one containing only the feature locations in the exemplar object. By using the shape ordering within each set as a comparison criterion between a pair of feature locations, it is possible to comparison sort the feature locations of each set along each of the principal axis. This leads to the formation of four ordered sequences F1 , F2 , G1 , G2 . For example, with reference to Fig. 2: F1 = f7 ≺1 f5 ≺1 f8 ≺1 f6 ≺1 f2 ≺1 f4 ≺1 f3 ≺1 f1 ; F2 = f5 ≺2 f2 ≺2 f3 ≺2 f6 ≺2 f1 ≺2 f4 ≺2 f7 ≺2 f8 ; G1 = g7 ≺1 g5 ≺1 g8 ≺1 g6 ≺1 g3 ≺1 g4 ≺1 g2 ≺1 g1 ; G2 = g5 ≺2 g3 ≺2 g2 ≺2 g6 ≺2 g4 ≺2 g7 ≺2 g1 ≺2 g8 . We exploit these four sequences to perform a final rejection of the matches that do not show full shape agreement, and/or might come from background clutter. This is done by computing the longest increasing subsequence [34] between F1 and G1 , and between F2 and G2 . In our example they become 7 ≺1 5 ≺1 8 ≺1 6 ≺1 3 ≺1 1, and 5 ≺2 3 ≺2 6 ≺2 4 ≺2 7 ≺2 8, respectively. The intersection between the two resulting subsequences yields the final set of estimated matches. In . our example this is given by M = {f7 ↔ g7 , f5 ↔ g5 , f8 ↔ g8 , f6 ↔ g6 , f3 ↔ g3 }. Finally, we say that object o is part of the layout LN if the number of matches |M| ≥ γ,
580
S.-N. Lim, G. Doretto, and J. Rittscher
Fig. 2. Shape ordering. Illustration of the notion of shape ordering with a synthetic example. The left image depicts the exemplar of an object o. v1 and v2 are the principal axes for o. The right image shows only the portion of the test image inside the bounding box Bo . u1 and u2 are the principal axes of the test image features located inside Bo , and the indices indicate matches o = [−1, +1]T . (e.g., g1 would be matched with f1 ). For the pair (g4 , g8 ) the shape ordering is σ48 Similarly, the corresponding features in the test image (f4 , f8 ) have the same ordering [−1, +1]T . Therefore, the mutual shape ordering between them is a perfect match of [0, 0]T . This is despite that the motorbikes are taken from different viewpoints, illustrating how the shape ordering can tolerate a certain amount of viewpoint and intraclass variation.
where γ is a pre-specified required number of matches. While the optimal value of γ can certainly be obtained by training, we have found that setting γ between 8−10 yields very good performance. This can be understood by the fact that a sufficient number of matches represents a fairly low possibility of shape disagreement in rigid objects. For a good illustration of such a procedure, refer to Fig. 2, which purposely contains two “slight” mismatches, i.e., f2 ↔ g2 , and f3 ↔ g3 , of which one can be filtered by the procedure. Lastly, we fit a 2D affine transformation to the final set of matches M, mapping the features from the exemplar domain to the test image domain. This allows to refine the estimation of the position po , and size so , of the object o.
5 Experiments In order to be comparative, we evaluate our approach on the PASCAL Visual Object Challenge (VOC) 2007 dataset [31], on which many state-of-the-art methods have been evaluated on. These include in particular the work in [2], which tries to solve the same multi-class object layout problem by modeling the object interactions. The PASCAL 2007 dataset is widely agreed to be a very challenging dataset, consisting of about 10000 images of 20 different object classes. Occlusions are common in these images, making it especially challenging. Roughly half of the images are used for training while the other half are used for testing. For the training set, there are approximately 16000 objects in about 5000 images. We generate exemplars out of them by cropping out the 16000 objects from the images, based on the bounding box annotations that came along with the PASCAL dataset, and added them to our exemplar
Multi-class Object Layout with Unsupervised Image Classification
581
Fig. 3. Multi-class results. PR curves for four different runs with different number, M , of retrieved exemplar objects. Our AP scores of 0.277, 0.319, 0.425 and 0.359 for M = 3, 5, 8, 10 outperform that of [2], which reports an AP of 0.272. Table 1. Per-class results. The best per-class results among our approach, the approach in [2], and the approach in [3] are in bold. The mAP of our approach shows significant improvement. In general, it can be observed that object classes with reach shape context, as described by the shape ordering descriptor, do very well. Class
Approach [3]
Approach [2]
Our approach
Aeroplane Bicycle Bird Boat Bottle Bus Car Cat Chair Cow Diningtable Dog Horse Motorbike Person Pottedplant Sheep Sofa Train Tvmonitor
0.338 0.430 0.097 0.096 0.187 0.419 0.504 0.150 0.146 0.239 0.151 0.154 0.482 0.417 0.202 0.161 0.212 0.203 0.291 0.382
0.288 0.562 0.032 0.142 0.294 0.387 0.487 0.124 0.160 0.177 0.240 0.117 0.450 0.394 0.355 0.152 0.161 0.201 0.342 0.354
0.508 0.410 0.560 0.333 0.395 0.310 0.625 0.250 0.180 0.367 0.227 0.273 0.667 0.381 0.520 0.537 0.581 0.300 0.506 0.455
mAP
0.263
0.253
0.419
database. Following the PASCAL evaluation protocol, a detection is considered correct if its bounding box overlaps 50% or more with the groundtruth bounding box. Multi-Class Results. In four different runs, we retrieved M = 3, 5, 8, 10 objects from the database, based on the vocabulary tree. Together with the bounding boxes localized on the test image, they form the candidate object layout on which inference is performed.
582
S.-N. Lim, G. Doretto, and J. Rittscher
(a)
(b)
(d)
(c)
(e)
(f)
Fig. 4. Qualitative evaluation. In each subfigure, the upper image is the test image while the lower images are the retrieved exemplars. (a) The first horse exemplar retrieved shows the right side of a horse but was used to successfully detect the left side of a horse in the test image. The SIFT descriptors used during classification are invariant to such “geometrical flip”, which means that for a somewhat symmetrical object such as a horse, where the left and right side are similar to each other, both sides could have been retrieved. Even so, the shape ordering descriptor, being isotropic to geometrical flip, was able to do the right thing. In the second exemplar retrieved, the cropping, meant to generate a horse exemplar, has to include the whole horse, and consequently the person in a frontal pose. The person is actually successfully matched with the rider in the test image. (b) Another example of the shape ordering descriptor’s invariance to geometrical flip in the detection of the horses. The people in the test image were not detected, constituting false negatives. (c) To what degree can the shape ordering descriptor deal with viewpoint variations? Here, the retrieved exemplar is the backside of a horse, which proves to have too much viewpoint variation with respect to the right view of the backside of the horse in the test image, triggering the CRF to declare a no-match decision. (d) The viewpoint variation was overcome by the CRF, and with the bounding box intersecting more than 50% of the groundtruth bounding box, this was declared a true positive. (e) Although the localized bounding box for the first plant exemplar was too big (shown as the yellow bounding box in the test image before refinement), the CRF was able to discard feature matchings that do not agree with the shape of the exemplar, and correctly produce a true positive. The second exemplar was used to correctly detect the plant in the lower right corner of the test image. This is in spite of the plant in the test image being partially occluded because the shape ordering descriptors agree enough to trigger a true positive. (f) This is probably one of the best demonstrations of how a good exemplar retrieved, together with good localization, make it fairly straightforward for the CRF to declare correctly a true positive for the rider and bike on the right side of the test image. Unfortunately, the shape of the rider on the left side of the test image was confused with a plant exemplar.
Multi-class Object Layout with Unsupervised Image Classification
583
For each run, we follow the multi-class scoring methodology in [2] by pooling detections across classes, and computing the Average Precision (AP) according to the VOC protocol. We show the precision-recall (PR) curves in Fig. 3, with AP scores of 0.277, 0.319, 0.425 and 0.359 for each of the four runs respectively. For the PASCAL 2007 dataset, our algorithm peaks at M = 8 with an AP of 0.425, which is a great improvement over an AP of 0.272 for the multi-class experiments conducted in [2]. Per-Class Results. We also provide the per-class AP results in Table. 1. Our results show very good performance with all but four classes lower in AP than [2]. The mean Average Precision (mAP) over these classes is 0.419, which is a considerable improvement over an mAP of 0.253, achieved by [2]. Moreover, these results help to answer the question of how relevant the CRF is on top of the image classification and localization stage. We first compare our results with those reported in [3]. Our mAP of 0.419 is a significant improvement over the best mAP result of approximately 0.263 obtained over a different number of sliding windows by the work in [3]. Contrasting our results with those reported in [3] is important. The reason is that [3], being a state-of-the-art sliding window approach based on detectors trained on HOG and appearance features, bears resemblance to our HOG based localization stage. Similarly, we borrow the per-class results reported by [35] on the PASCAL 2007 dataset for the same unsupervised image classification technique [22] we use. The best reported mAP is approximately 0.3, which represents the performance of a detector that was trained based on descriptors extracted from a hierarchical vocabulary tree. This is in contrast to the mAP of 0.419 our approach is able to achieve. In both comparisons, even though the results in those works pertain to trained detectors, they give a good indication that the CRF is able to boost performance significantly on top of the image classification and localization stage. Qualitative Evaluation. Finally, we show and explain some of the interesting detection results qualitatively. Fig. 4 shows different examples that are meant to illustrate the following: (i) the robustness of the shape ordering descriptor to viewpoint variations, (ii) what might be too much viewpoint variation for the shape ordering descriptor, (iii) how tolerant is the CRF model to inaccuracy in the localization, and (iv) cases of false positives and false negatives, due also to shape confusion. Please refer to the caption of Fig. 4 for details.
6 Conclusions This work proposes an alternative solution to the object classification and localization problem. The solution avoids the effort involves in training a suite of object classifiers by adopting unsupervised clustering for building a dictionary from which proposed object classes given a test image can be retrieved efficiently. The proposed object classes, together with their locations, are evaluated robustly with a CRF model that is based on a newly proposed shape ordering. As a consequence of utilizing such a shape ordering, it becomes fairly straightforward to decide whether to accept or reject the list of proposed object locations.
584
S.-N. Lim, G. Doretto, and J. Rittscher
The resulting approach has minimum training requirements, and is able to produce a remarkable performance improvement on the challenging PASCAL 2007 dataset, showing that it is able to cope effectively with very large intraclass and viewpoint changes. This confirms the utility of such approach for practical purposes. Acknowledgements. This report was prepared by GE GRC as an account of work sponsored by Lockheed Martin Corporation. Information contained in this report constitutes technical information which is the property of Lockheed Martin Corporation. Neither GE nor Lockheed Martin Corporation, nor any person acting on behalf of either; a. Makes any warranty or representation, expressed or implied, with respect to the use of any information contained in this report, or that the use of any information, apparatus, method, or process disclosed in this report may not infringe privately owned rights; or b. Assume any liabilities with respect to the use of, or for damages resulting from the use of, any information, apparatus, method or process disclosed in this report.
References 1. Choi, M., Lim, J., Torralba, A., Willsky, A.: Exploiting hierarchical context on a large database of object categories. In: CVPR (2010) 2. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for multi-class object layout. In: ICCV, pp. 229–236 (2009) 3. Harzallah, H., Jurie, F., Schmid, C.: Combining efficient object localization and image classification. In: ICCV, pp. 237–244 (2009) 4. Heitz, G., Koller, D.: Learning spatial context: Using stuff to find things. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 30–43. Springer, Heidelberg (2008) 5. Ladicky, L., Sturgess, P., Alahari, K., Russel, C., Torr, P.H.S.: What, where and how many? Combining object detectors and cRFs. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 424–437. Springer, Heidelberg (2010) 6. Li, L.J., Fei-Fei, L.: What, where and who? classifying events by scene and object recognition. In: ICCV (2007) 7. Lim, S.N., Doretto, G., Rittscher, J.: Object constellations: Scalable, simultaneous detection and recognition of multiple specific objects. In: Workshop on Cognitive Vision in conjunction with ECCV (2010) 8. Mutch, J., Lowe, D.: Object class recognition and localization using sparse features with limited receptive fields. IJCV 80 (2008) 9. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: CVPR (2008) 10. Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR (2010) 11. Yeh, T., Lee, J., Darrell, T.: Fast concurrent object localization and recognition. In: CVPR, pp. 280–287 (2009) 12. Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scaleinvariant learning. In: CVPR (2003) 13. Perronin, F.: Universal and adapted vocabularies for generic visual categorization. IEEE TPAMI 30, 1243–1256 (2008) 14. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. IJCV 73, 213–238 (2007)
Multi-class Object Layout with Unsupervised Image Classification
585
15. Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting people using mutually consistent poselet activations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 168–181. Springer, Heidelberg (2010) 16. Dalal, N., Triggs, B.: Histogram of oriented gradients for human detection. In: ICCV, vol. 1, pp. 886–893 (2005) 17. Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE TPAMI 32, 1627–1645 (2010) 18. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on riemannian manifolds. In: CVPR, pp. 1–8 (2007) 19. Viola, P., Jones, M.J.: Robust real-time face detection. IJCV 57, 137–154 (2004) 20. Berg, T., Forsyth, D.: Animals on the web. In: CVPR (2006) 21. Chum, O., Philbin, J., Sivic, J., Isard, M., Zisserman, A.: Total recall: Automatic query expansion with a generative feature model for object retrieval. In: ICCV (2007) 22. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR, vol. 2, pp. 2161–2168 (2006) 23. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007) 24. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improving particular object retrieval in large scale image databases. In: CVPR (2008) 25. Gammeter, S., Bossard, L., Quack, T., van Gool, L.: I know what you did last summer: Object-level auto-annotation of holiday snaps. In: ICCV, pp. 614–621 (2009) 26. Kumar, S., Hebert, M.: Discriminative random fields. IJCV 68, 179–202 (2006) 27. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001) 28. Divvala, S.K., Hoiem, D., Hays, J.H., Efros, A.A., Hebert, M.: An empirical study of context in object detection. In: CVPR (2009) 29. Galleguillos, C., McFee, B., Belongie, S., Lanckriet, G.: Multi-class object localization by combining local contextual interactions. In: CVPR, pp. 113–120 (2010) 30. Lampert, C.H., Blaschko, M.B., Hofmann, T.: Beyond sliding windows: Object localization by efficient subwindow search. In: CVPR (2008) 31. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results (2007), http://www. pascal-network.org/challenges/VOC/voc2007/workshop/index.html 32. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91–110 (2004) 33. Felzenszwalb, P., Huttenlocher, D.: Efficient belief propagation for early vision. IJCV 70, 41–54 (2006) 34. Fredman, M.L.: On computing the length of the longest increasing subsequences. Discrete Mathematics 11, 29–35 (1975) 35. Bengio, S., Pereira, F., Singer, Y., Strelow, D.: Group sparse coding. In: NIPS (2009)
Efficient Detection of Consecutive Facial Expression Apices Using Biologically Based Log-Normal Filters Zakia Hammal The Robotics institute of the Carnegie Mellon University CMU, 4323 Sennott Square, 210 S. Bouquet Street, Pittsburgh, PA 15260, USA [email protected]
Abstract. The automatic extraction of the most relevant information in a video sequence made of continuous affective states is an important challenge for efficient human-machine interaction systems. In this paper a method is proposed to solve this problem based on two steps: first, the automatic segmentation of consecutive emotional segments based on the response of a set of Log-Normal filters; secondly, the automatic detection of the facial expression apices based on the estimation of the global face energy inside each emotional segment independently of the undergoing facial expression. The proposed method is fully automatic and independent from any reference image such as the neutral at the beginning of the sequence. The proposed method is the first contribution for the summary of the most important affective information present in a video sequence independently of the undergoing facial expressions. The robustness and efficiency of the proposed method to different data acquisition and facial differences has been evaluated on a large set of data (157 video sequences) taken from two benchmark databases (Hammal-Caplier and MMI databases) [1, 2] and from 20 recorded video sequences of multiple facial expressions (between three to seven facial expressions per sequence) in order to include more challenging image data in which expressions are not neatly packaged in neutral-expression-neutral. Keywords: Facial expressions, Apices, Video affect summary, Log-Normal filters.
1 Introduction One main challenge of human affect based interaction is the ability to efficiently analyze and respond to the user needs based on his affective states. This efficiency can be reached by the automatic extraction of the most relevant information in video sequences, such as the consecutive facial expression apices, rather than the analyses of all the redundant and non-informative images. For example, in the context of pain monitoring, the automatic extraction of the consecutive apices could be of considerable value in certain circumstances where videotaped patients are not able to communicate pain verbally (e.g. monitored waiting room, injured people). An automatic and subject based summery of the most informative images would definitively help medical personnel to efficiently analyze the extracted information (rather than analyzing all the videos) and then take a G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 586–595, 2011. © Springer-Verlag Berlin Heidelberg 2011
Efficient Detection of Consecutive Facial Expression Apices
587
decision accordingly. However, despite great efforts of the computer vision community (see [3, 4] for a review), the efficient recognition of facial expressions in every day life environment remains a critical task. One main problem in every day life environment is the difficulty to separate consecutive display of more than one facial expression in video sequences. Few efforts have been made for the analysis of uninterrupted sequences of several expressions [5- 8]. However, with the exception of [8], none of these works explicitly evaluated the automatic separation of multiple facial expressions in multiexpression sequences. Based on these considerations we propose to investigate the efficiency of the automatic detection of the most informative facial behavior called Key Affective States (i.e. the set of consecutive beginning, apex and end images) in video sequences independently of the undergoing facial expressions. The summary of the affective sequences will then consist of the set of Key Affective States that occurred in the video. The proposed method is based on two main steps: first, an automatic SVM based segmentation of consecutive emotional segments (i.e. a set of consecutive frames corresponding to a facial muscles activation compared to a relaxation state) is applied. The proposed method is based on a multiscale spatial-filtering using biologically based Log-Normal filters [8, 9]. Second, the automatic detection of the apices of each emotional segment is made based on the analysis of the corresponding global face energy behavior. The automatic recognition of each consecutive facial expression can be made based on the analysis of the detected key affective frames. The proposed method is based on the global face energy analysis and then it is independent of a specific facial expression or Action Units (AUs). This characteristic allows the proposed method to be generalized to non-prototypic and naturalistic facial behavior.
2 Key Affective States Detection Based on Log-Normal Filtering The automatic and consecutive detection of the Key Affective States corresponds to the extraction of the beginning-apex-end states of an affective video sequence, independently of the undergoing facial expressions. Fig.1 shows the steps of the proposed processing. First, the automatic segmentation of the input video into consecutive emotional segments (i.e. all the frames between the beginning and the end of each separate facial expression) is done based on a set of features, measured by Log-Normal filters [9]. Second, the analysis of the evolution of the global face energy inside the detected emotional segments allows the detection of the corresponding apices. 2.1 Automatic Segmentation of Consecutive Emotional Segments In each video sequence, each group of expressive images is automatically gathered in separate emotional segments by the detection of the corresponding pairs of beginning and end (see Fig. 1). To do so, state of the art systems (i.e. for the analysis of uninterrupted sequences of several expressions) usually make the assumption that the first images in the sequence correspond to neutral expression (e.g. [5-8]). However, this assumption is usually violated in the case of natural interaction and requires a manual intervention. To overcome this problem, here, each image is first individually classified as expressive vs. non-expressive using a SVM classifier trained on a set of extracted facial features using Log-Normal filters.
588
Z. Hammal
Features Extraction. Facial expressions induce the appearance of a set of frequencies and orientations due to facial muscle activation [13, 8]. This activation can be measured by the Log-Normal filters [9] which measure the energy displayed by the face at different frequency bands and orientations (see Fig. 1.a filtered images). Compared to the more standard Gabor filters, the Log-Normal filters better sample the power spectrum and they are easily tuned and separable in frequencies and orientations [9], which make them well suited for detecting features at different scales and orientations [8]. To do that, the studied face is first automatically detected in video streams and tracked in the remaining of the sequence [10]. To cope with the problem of illumination variation, a preprocessing stage based on a model of the human retina [14] is applied to each detected face. The power spectrum of each filtered face area is then multiplied by a bank of Log-Normal filters (15 orientations and 7 central frequencies leading to the definition of 105 features) defined as follow: 2
2
G i, j ( f ,θ ) = G i ( f ).G j (θ ) = A.
2⎞ ⎛ ⎛ 1 ⎛ θ − θ ⎞2 ⎞ 1 1 ⎛ ln( f / f i ) ⎞ ⎟ j ⎟⎟ .exp ⎜⎜ − ⎜ .exp ⎜ − ⎜⎜ ⎟ ⎟⎟ ⎜ 2⎝ σf ⎟ f 2 σ ⎝ ⎠ ⎠ θ ⎠ ⎝ ⎝ ⎠
Where Gi, j is the transfer function of the filter at the ith frequency orientation, Gi ( f ) and G j (θ ) , represents the frequency and the components of the filter, respectively; f i is the central frequency, θ j , orientation ( θ j = (180 /15).( j − 1) ), the frequency σf,
(1)
and the jth
orientation the central bandwidth
( σ f = ln( f i +1 / f i ) ), σ θ , the orientation bandwidth and A, a normalization factor. The factor 1/ f ensures that the sampling of the spectral information of the face takes into account the specific distribution of energy of the studied face at different scales. Fig. 1.a shows some examples of the filtering process. In order to cope with different acquisition conditions, a normalization is applied to the set of extracted features to make the analysis independent of the intensity of the face spatial frequency. The normalization consists in dividing the response of all the filters sharing the same central frequency by the sum of their response over all the orientations (see equation 2). This operation leads to a uniform distribution of the energy in each frequency band. Gi, j ( f ,θ ) =
Gi, j ( f ,θ )
∑G
i, j
( f ,θ ) + ε
(2)
j =1..15
Where ε is a non-linear term (defined and validated over the used databases) allowing discarding very low energy filters’ responses [9]. Fig. 1.b shows some examples of the obtained features. Each example represents the 105 Log-Normal filters’ responses organized as a vector raw (the 15 filters’ responses corresponding to the same frequency and at all the orientations are shown grouped together followed by the next 15 responses for the next frequency). Based on the obtained results, a machine learning method, based on the analysis of the 105 extracted features, is used for the automatic classification of each individual image. Support Vector Machine Based Classification. A radial basis function (RBF) SVM is employed to recognize each separate image as expressive vs. non-expressive. The
Efficient Detection of Consecutive Facial Expression Apices
589
SVM based approach allows taking into account the facial energy, which is higher in the case of expressive compared to non expressive faces [8] as well as the distribution of this energy over all the frequency and orientation bands.
Fig. 1. Automatic video affect summary: a) The input consists of a video sequence of 360 images corresponding to three consecutive facial expressions; b) Example of the 105 LogNormal filters’ responses on the 3 examples of expressive faces presented in a. The vectors are used as input to a SVM classifier. c) Example of 3 extracted emotional segments. d) The global face energy signal is extracted from each segment. e) The output consists of the 3 consecutive beginning-apex-end set of images corresponding to the three key affective states present in the input video.
The generalization of the SVM based model for the recognition of expressive images to new data was tested using 3-fold cross-validation, in which all images of the test were excluded from training (see section 3 for the obtained performances on 3 different databases). Fig. 2 shows 3 examples of the SVM based classification of video sequences from the Hammal-Caplier database [1], the MMI database [2], and a multi-expressions database respectively. In each single video sequence the subject starts with the neutral expression evolve to the apex of the expression and comes back to the neutral state. In the Multi-expression the subjects do not necessarily come back to the neutral state between two facial expressions but to an intermediate relaxation state. The proposed method associates the value 1 to expressive images and 0 to nonexpressive images. One can see the appearance of consecutive segments of expressive images (ones) separated by segments of non-expressive images (zeros). Based on the obtained results the expressive images are gathered in separate emotional segments by the detection of the corresponding pairs of beginning and end. Beginnings and Ends Detection. The beginning of each expressive segment corresponds to the transition from non-expressive to expressive image (transition
590
Z. Hammal
Fig. 2. Examples of the automatic classification result of expressive (ones) vs. non-expressive (zeros) video images (first line) and the evolution of the corresponding global face energy (second line) from three different databases. a) Example sequence from the Hammal-Caplier database, c) from the MMI database and from d) multi-expressions data, respectively. Dashed lines correspond to the result of the automatic detected beginnings and ends of each emotional segment (one detected segment for Hammal-Caplier and MMI databases and five detected segments for the multi-expression sequence). Black and red lines (dark and clear in black and white images) of the second line represent the corresponding manual detection. The reported dots correspond to the detected apices.
from zero to one in Fig. 2 first line) and each end corresponds to the transition from expressive to non-expressive image (transition from one to zero in Fig. 2 first line). Based on this transitions detection, the method does not require a reference image to start the sequence such a neutral image. This process is repeated from the beginning to the end of the sequence. Fig. 2 shows examples of consecutive detected beginnings (dashed bottom up lines) and ends (dashed top down lines) in the case of single (Fig. 2.a and 2.b) and multi-expression sequences (Fig. 2.c). Tested on a large number of video sequences the estimated beginning and end of each separate emotional segment on single as well as multiple facial expression sequences have been compared to ground truth and lead to very good performances (see section 3). Importantly the proposed method is independent of the undergoing facial expressions. As a consequence, any of the already proposed static or dynamic models for affect recognition can be successfully applied to recognize the undergoing facial expression or Action Unit (AU) on each individual emotional segment. 2.2 Apices Detection
In order to extract the key frames relatively to the affective content of the video, we propose a technique to automatically extract the apex of the facial expression between each pair of beginning and end previously detected. Facial expressions lead to the change of the global face energy compared to a face relaxation state. One can see in Fig1.a that facial feature behaviors effectively induce a change of the measured face energy (high-energy response around the facial features in the filtered images). The global face energy corresponding to the sum of the energy responses of the LogNormal filters across all the frequencies and all the orientations is then computed as:
Efficient Detection of Consecutive Facial Expression Apices
Eglobal = ∑
∑|| S
( f ,θ)*Gi, j ( f ,θ) ||2
frame
591
(3)
i=1..2 j=1..15
Where E global is the global face energy and S frame ( f ,θ ) the power spectra of the current frame. Fig. 2 second line shows examples of the temporal evolution of the measured global face energy (using equation 3) during the video sequences displaying between one and five consecutive facial expressions from three different databases. One can see the evolution of the global face energy and it’s coming back to a reference value between each relaxation or transition states. The analysis of the correlation between the obtained curves and the corresponding images (over the used databases) shows that the maximum of change of the global face energy inside each emotional segment compared to the relaxation state is reached for the apices (i.e. during one single image or a plateau, see Fig. 2 second line). The apex of each facial expression is then computed by the detection of the maximum of changes of the global face energy inside each emotional segment. Fig. 2 second line, red points (bold points) show examples of the obtained apices in these videos. Fig. 2.a and Fig. 3 shows the images corresponding to the detected apices as well as the corresponding beginnings and ends for Hammal-Caplier and a multi-expression sequence respectively.
Fig. 3. Videos affect summary: result of the automatic detected beginning, apex, and end images of sequence of five consecutive facial expressions from multi-expression database.
The obtained results show the efficiency of the proposed method at summarizing the key affect images independently of the undergoing facial expressions. For example, from the 1000 images of the multi-expression video sequence of Fig. 2 only fifteen key images are picked to summarize the consecutively expressed five expressions independently of the undergoing facial expression. In the case of video monitoring these results are sufficient to inform independent observer about the temporal evolution of the affective states of the monitored subject without having to look back at the whole video.
3 Simulation Results The robustness of the proposed method for video affect summary has been evaluated based on intensive tests on three dynamic facial expression databases: first on single
592
Z. Hammal
facial expression sequences from the Hammal-Caplier database (63 sequences of 120 images per sequence, [1]) and the MMI database (74 sequences of more than 80 images per sequence, [2]). A total of 137 sequences of the six basic facial expressions: happiness, surprise, disgust, fear, sadness and anger. Secondly, on 20 multi-expression sequences, of more than 1000 images per sequence, recorded in our laboratory (with a mean number of 3 consecutive facial expressions per sequence). The 157 video sequences (137 single and 20 multi) have been first manually labeled (beginnings, apices, and ends) as ground-truth with which the obtained performances are compared. To evaluate the precision of the automatic detection of the key affect images in video sequences, three main simulations are reported: first, the automatic detection of expressive vs. non-expressive images; second, the automatic detection of the beginning and end of each emotional segment; and finally, the automatic detection of the consecutive facial expression apices in video sequences. Table 1. Performances of the SVM based model for expressive face detection. CR = Classification Rate (%), PR = Precision (%), RC = Recall (%), F = F-measure (%)
RBF SVM
CR 94.91
PR 90.07
RC 97.67
F 93.70
Classification rate (CR), Recall (R), Precision (PR) and F-measure (F) are used for the evaluation of the SVM based method for expressive vs. non-expressive images detection. A 3-fold cross-validation loop, where all images of the test were excluded from training, has been used to evaluate the generalization of the proposed method compared to ground truth (see Table 1). The obtained performances F=93.70% prove the robustness of the proposed method for the automatic detection of expressive images in single as well multiple facial expression sequences. Table 2. Mean frame error detection of the beginning and the end of emotional segments and the number of tested sequences. Berr = Mean beginning frame error, Eerr= Mean end frame error, #Seq = Number of tested sequences
Hammal-Caplier MMI Multi Mean
Berr 2.5 (1.6) 2.7 (2.8) 1.1 (0.7) 2.1
Eerr 5.1 (2.5) 3.5 (2.8) 3.7 (5.5) 4.1
#Seq. 63 74 20 157
To better evaluate the automatic separation of consecutive facial expressions, in the second simulation the performances of the detection of the beginning and the end of each emotional segment are measured separately and compared to the manual detection. The mean frame error detection (compared to the ground truth) of the beginning and the end and their the standard deviation on all the sequences of the three used databases are reported in Table 2. The obtained performances especially on
Efficient Detection of Consecutive Facial Expression Apices
593
multiple sequences data shows that the proposed method can be successfully used for a large number of consecutive facial expressions independently of their nature. Moreover, the obtained results show that even if the subject do not come back necessarily to the same neutral expression between each emotional segment the system is still able to successfully separate the different emotional segments. Over all the used data the obtained mean frame error detection is equal to 2.1 frames for the beginning and 4.1 frames for the end. Based on a minimum video frame rate of 24 frames/s, the obtained mean frame error (i.e. 3.1 frames) is lower than the possible shortest facial muscle activity duration (i.e. 0.25s (6 frames), [11]). This suggests that the obtained results are precise enough to detect the shortest facial muscle activity. To the best of our knowledge the only authors who explicitly report performances on the automatic separation of emotional segment are [8]. Compared to these authors, the proposed method, first, overcome one main limit of their method by being completely independent of the assumption to begin with the neutral state as a reference. Secondly, tested on a larger multi-expression data (20 rather than 6) the current model leads to better performances (i.e. 3.1 mean frames error compared to 8 mean frames error). In the third simulation we investigate the performances of the automatic apices detection. The apex detection have been also measured compared to human manual detection on the 157 video sequences (137 single expressions and 20 multi expressions). However, compared to the detection of the beginning and the end of each emotional segment it was difficult for human observer to select one single image as the apex per single expression. Indeed, the apex corresponds to a short duration were the facial features reach a maximum of deformation and thus often correspond to several images and not only to one single image. Given these considerations we asked the human observer to select the range of images between which the apex of the undergoing expression is considered reached. Considering each detected apex inside the human labeled interval as a good detection, the performances of the apices detection are reported in Table 3. Table 3. Apices detection performances, CR = Classification Rate (%),
CR
Hammal-Caplier 84.38
MMI 76
Multi-expression 96.49
Mean 85.51
To summarize, the proposed key affect images detection method, is able, for the first time in the literature, to sum up the most important information related to the undergoing affective state independently of the recognition of facial expression. Tested on a large number of benchmark sequences the proposed method shows it’s robustness to data acquired in different conditions. These results are of first interest for three main reasons: first, compared to the state of the art, the proposed method is completely independent of any manually defined reference image (i.e. there is no assumption on the beginning to correspond to the neutral state). Second, based on a precise global face energy analysis, the proposed method is completely independent of any facial feature segmentation or facial expression classification errors for the automatic selection of the apices. Finally, the proposed method is independent of the
594
Z. Hammal
Fig. 4. Videos affect summary of two spontaneous video images (induced emotions) from MMI database [12]: result of the automatic detected beginning, apex, and end images and the corresponding manual detection (between brackets). Frames corresponding to the apices are not reported as they appeared inside the manually defined interval of the set of apices. First videos sequence: detection of two consecutive happiness intensities (first segment less intense than the second). Second video: detection of consecutive happiness and disgust expressions.
undergoing facial expression and therefore opens promising perspectives to be generalized to non-prototypic and spontaneous facial behaviour as long as this behaviour leads to facial muscle activation (see Fig. 4). This last claim is based on preliminary results on video sequences of audio-visual recordings of induced emotions from the MMI database [12]. In these videos a comedy and disgusting clips were shown to a number of participants, who displayed consecutive expressions of disgust, happiness, and surprise in response. Fig. 4 shows the results of the application of the proposed method to two videos of these naturalistic and spontaneous data. In the first video sequence of 1197 images, the proposed method was able to summarize the key affective states into 6 images corresponding to the two segments of expressed happiness at different intensities. In the second video sequence of 1943 images, the proposed method was able to separate consecutive happiness (segments 1-2 and 4-6) and disgust (segment 3) segments separated by intermediate states leading to 18 images. Intensive tests on larger data should be made to better evaluate the proposed method on spontaneous data.
4 Conclusion In this paper we successfully implemented a new method based on Log-Normal filters for automatic detection of consecutive affect key images (beginning-apex-end) in
Efficient Detection of Consecutive Facial Expression Apices
595
video sequences. The experimental results on benchmark databases clearly show that our proposed descriptors are efficient for the dynamic separation between consecutive facial expressions in video sequences independently of their nature. Then the proposed method extends the state of the art in automatic analysis and interpretation of the dynamic of facial expressions by an efficient and fully automatic method for video affect summary. Acknowledgments. (Portions of) the research in this paper uses the MMI-Facial Expression Database collected by Valstar and Pantic (www.mmifacedb.com).
References 1. Hammal, Z., Couvreur, L., Caplier, A., Rombaut, M.: Facial expressions classification: A new approach based on transferable belief model. International Journal of Approximate Reasoning 46(3), 542–567 (2007) 2. Pantic, M., Valstar, M.F., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In: Proc. IEEE Int. Conf. ICME 2005, Amsterdam, The Netherlands (July 2005) 3. Pantic, P., Rothkrantz, L.J.M.: Automatic Analysis of Facial Expressions: The State of the Art. IEEE Trans. PAMI. 22(12), 1424–1445 (2000) 4. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions. IEEE Trans. PAMI 31(1), 39–58 (2009) 5. Essa, I.A., Pentland, A.P.: Coding, Analysis, Interpretation, and Recognition of Facial Expressions. IEEE Trans. PAMI. 19(7), 757–763 (1997) 6. Otsuka, T., Ohya, J.: Recognizing multiple persons’ facial expressions using HMM based on automatic extraction of significant frames from image sequences. In: Proc. IEEE Int. Conf. Image Processing, vol. 2, pp. 546–549 (1997) 7. Cohen, I., Cozman, F.G., Sebe, N., Cirelo, M.C., Huang, T.S.: Learning Bayesian network classifiers for facial expression recognition using both labeled and unlabeled data. In: Proc. IEEE CVPR (2003) 8. Hammal, Z., Massot, C.: Holistic and Feature-Based Information Towards Dynamic MultiExpressions Recognition. In: Proc. Int. Conf. VISIGRAPP, Anger, France, (May 17-21, 2010) 9. Massot, C., Herault, J.: Model of Frequency Analysis in the Visual Cortex and the Shape from Texture Problem. International Journal of Computer Vision 76(2) (2008) 10. Hammal, Z., Eveno, N., Caplier, A., Coulon, P.-Y.: Parametric models for facial features segmentation. Signal processing 86, 399–413 (2006) 11. Ekman, P., Friesen, W.V.: The facial action coding system (FACS): A technique for the measurement of facial action. Consulting Psychologists Press, Palo Alto (1978) 12. Valstar, M.F., Pantic, M.: Induced Disgust, Happiness and Surprise: an Addition to the MMI Facial Expression Database. In: Proc. Int. Conf. LREC, Malta (May 2010) 13. Tian, Y.L., Kanade, T., Cohn, J.F.: Facial expression analysis. In: Li, S.Z., Jain, A.K. (eds.) Handboock of Face Recognition, pp. 247–276. Springer, NY (2005) 14. Beaudot, W.: Le traitement neuronal de l’information dans la rétine des vertébrés: Un creuset d’idées pour la vision artificielle, Thèse de Doctorat INPG, TIRF, Grenoble France (1994)
DTTM: A Discriminative Temporal Topic Model for Facial Expression Recognition Lifeng Shang, Kwok-Ping Chan, and Guodong Pan The University of Hong Kong, Pokfulam, Hong Kong {lfshang,kpchan,gdpan}@cs.hku.hk
Abstract. This paper presents a discriminative temporal topic model (DTTM) for facial expression recognition. Our DTTM is developed by introducing temporal and categorical information into Latent Dirichlet Allocation (LDA) topic model. Temporal information is integrated by placing an asymmetric Dirichlet prior over document-topic distributions. The discriminative ability is improved by a supervised term weighting scheme. We describe the resulting DTTM in detail and show how it can be applied to facial expression recognition. Experiments on CMU expression database illustrate that the proposed DTTM is very effective in facial expression recognition.
1 Introduction Facial expressions are commonly modeled by the Facial Action Coding System (FACS) [7], in which each expression is a combination of atomic facial action units (AUs) such as Jaw Drop, Eye Closure and Cheek Raiser. These AUs are further represented by some low-level features (e.g. movement of facial landmarks). A generative process of facial expression modeling is the combination of AUs, which are the co-occurrence of low-level features. On the other hand, LDA is a three-level hierarchical generative graphical model and widely used in modeling document corpora [1]. In this model, each document is an admixture of latent topics where the topics are the co-occurrence of words. We can observe that there is a good correspondence between the LDA model and FACS. This motivates us to use latent topic model as described by LDA to mimic the generative process of facial expression. To apply LDA to facial expression recognition, low-level features correspond to words, atomic facial AUs (co-occurring low level features) to topics, and facial image (combination of basic AUs) to document. LDA has achieved significant success in statistical text community. Besides modeling text generation, LDA has also been widely used to solve computer vision problems, such as scene categorization [8] and object discovery [14]. The standard LDA implies that data records (documents) are fully exchangeable. However, time-sequential face images have strong temporal order. Exchange of them can lead to very different results. Thus we need to incorporate temporal information into LDA. The studies on extending LDA to model temporal structures of topics and documents have gained more and more attention. Blei et al. [2] developed a dynamic topic model (DTM). In this model topics evolve over time, while AUs (topics) should not vary with time, so DTM is not suitable for facial expression recognition. Wang et al. [15] G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 596–606, 2011. c Springer-Verlag Berlin Heidelberg 2011
DTTM: A Discriminative Temporal Topic Model for Facial Expression Recognition
597
proposed a continuous extension of DTM based on Brownian motion. Wang et al. [16] presented a topics over time (TOT) model that treats time as an observed continuous variable. Since all time stamps are drawn from one distribution, TOT is not appropriate for bursty data streams [17]. Wei et al. [17] presented a dynamic mixture models (DMMs), which includes temporal information by setting the expectation of the document-topic distribution of the tth snapshot to be that of the (t − 1)th snapshot. Our DTTM is inspired by DMMs and can be viewed as a generalization of this model. The objective of our DTTM is to ensure that temporally adjacent facial images are most likely to have the similar document-topic distributions. It is implemented by placing an asymmetric Dirichlet prior over the document-topic distribution. The original LDA is an unsupervised generative model, to get better performance we also need to increase the discriminative ability. When LDA is used for classification, it often needs to combine with a classifier like SVM. Recently, various supervised LDAs have been proposed such as sLDA [3], DiscLDA [11] and MedLDA [20]. These supervised variants often rely on the optimization of objective functions (e.g. the joint likelihood of data and class labels or expected margin constraints) to retain predictive power for supervised tasks. Our discriminative topic model is motivated by the recent work on dealing with the problem of high-frequency words for topic models [18]. This work firstly introduced the term weighting (TW) scheme into LDA. However, it is developed for an information retrieval (IR) problem and not suitable for classification problem. So in this work we employ the supervised term weighting (sTW) scheme, which is originally used in automatic text categorization domain [6] to embed categorical information into term weights, instead of the TW to increase discriminative ability. Finally, both the asymmetric Dirichlet priors and the sTW scheme are incorporated into LDA by the Gibbs sampling model and a single updating rule is obtained. Compared to most existing discriminative and temporal variants of LDA, the proposed DTTM has the merit of simpleness, since it does not use new latent variables or increase inference difficulty. The learning algorithm of the proposed DTTM is actually a weighted version of the original LDA learning algorithm, which makes it as efficient as LDA. In the experimental part, we first use an example to illustrate the correspondence between our DTTM and the FACS. Then we use cross-validation to verify the benefits of considering temporal and categorical information.
2 Face Tracking and Feature Extraction There are mainly two types of facial features: permanent and transient features. The permanent facial features are the shapes and locations of facial components (e.g. eyebrows, eye lids, nose, lips and chin). The transient features are the wrinkles and bulges appeared with expressions. Since transient features are not so reliable and repeatable as permanent features, we will only use the permanent features. As in [12], a recently developed robust Active Appearance Model (AAM) [5] [19] is used to track the movement of facial points. Figure 1(a) shows the shape model used in our case, which is composed of 58 facial points. The blue digit around each circle denotes the number of facial point. Figure 2 displays the tracking results of one subject’s six basic expressions. After localizing the facial points, we normalize and register faces
598
L. Shang, K.-P. Chan, and G. Pan
Fig. 1. (a)Shape model including 58 facial points and (b) selected feature points
based on four facial points, inner corners of the two eyes (facial points 18 and 26) and the corner of mouth (facial points 40 and 44), using affine transformation. Facial images of 170 × 210 pixels were cropped from the normalized frames. 2.1 Facial Feature Extraction and Quantization In [4], the coordinates of the 58 localized facial points forms a 116-dimensional vector to represent an image. However, based on the analysis of AUs in FACS, we found that not all movements of facial points (e.g. facial points 1 and 13) are pertinent to the six basic expressions. Consequently, a subset is selected from the 116 movement features and is visually depicted in Fig. 1(b). In this figure, the solid triangles and rectangles denote that either the X- or Y-coordinates are used as features and the solid circles represent that both the X- and Y-coordinates are used. After feature selection, each image is represented by a 54-dimensional feature vector (i.e. 26 rectangles + 4 triangles + 2 × 12 circles). LDA model was proposed for Bag-of-Words (BoW) representation, so we need to quantize the movement of the selected features into some words which should reveal both the movement directions and amplitudes. The movement in the X-axis direction is quantized into a word of the vocabulary WX = {LEFTi, RIGHTi, MotionlessXi|i = 1, 2, · · · , 58}, where the word LEFTi (RIGHTi) represents that the i-th facial point moves at lest one pixels to the left (right) from its neutral position, otherwise it will be quantized to the word MotionlessXi. Similarly, the vocabulary describing the movements along the Y-axis is defined as WY = {UPi, DOWNi, MotionlessYi|i = 1, 2, · · · , 58}.
Fig. 2. The tracking results of one subject’s six basic expressions
DTTM: A Discriminative Temporal Topic Model for Facial Expression Recognition
599
The appearing times of these words are responsible for movement amplitudes, so we need to define a mapping from movement amplitude to appearing times. The movement amplitude x is first mapped to the range (0, 1) by a logistic function F (x; δ) = 1/(1 + exp−(x−δ) ),
(1)
to reduce the inter-personal variations with regard to the amplitudes of facial actions. Then the appearing time is calculated by H(x; a, τ, δ) = fix(aF (x; δ) + τ )
(2)
where the function fix(·) returns the integer part of a specified number. In our experiments the parameter settings of δ, τ , and a are 2, 0.3 and 6, respectively. Given the d-th image, its 54-dimensional feature vector is quantized into a BoW representation.
3 The Discriminative Temporal Topic Model This section describes the method to modeling temporal information and increasing the discriminative ability of LDA. To establish notations, we first give a brief review for LDA. 3.1 LDA The standard LDA is an unsupervised generative topic model for text documents. Each document is represented as a bag of unordered words. To facilitate presentation, a word corpus w = {w1 , w2 , · · · , wN } is constructed by concatenating these bags. In LDA, each individual word token wn is assumed to have been generated by a latent topic zn ∈ {1, 2, · · · , T }, which is characterized by a discrete distribution over words with probability vector φzn . The topic zn is drawn from a document-specific distribution over topics with probability vector θ dn , here dn ∈ {1, 2, · · · , D} is the index of document to which the n-th word wn belongs. If the probabilities Φ = {φ1 , · · · , φT } and Θ = {θ1 , · · · , θD } are given, the joint probability of the corpus w and the set of corresponding latent topics z = {z1 , · · · , zN } is P (w, z|Φ, Θ) =
N
φwn |zn θzn |dn
(3)
n=1
where φwn |zn is the probability of generating the word wn from the topic zn , and θzn |dn is the probability of generating the topic zn from the document dn . To make the model fully Bayesian, symmetric Dirichlet priors are often placed over Θ and Φ, i.e. P (Θ|αu) = d Dir(θ d |αu) and P (Φ|βv) = t Dir(φt |βv), where Dir(·|αu) is a Dirichlet distribution with concentration parameter α and base measure u. The base measures u and v are both uniform distributions. Combining the two priors with joint distribution (3) and integrating over Θ and Φ gives the joint probability of corpus and latent topics given hyperparameters: P (w, z|αu, βv), by which the posterior probability P (z|w, αu, βv) for latent topics z is calculated.
600
L. Shang, K.-P. Chan, and G. Pan
3.2 Modeling Temporal Information In our DTTM, the temporal information is included by replacing the uniform base measure u of Dir(θ d |αu) with an asymmetric base measure md (Note that md is image dependent), which is calculated from the prior knowledge embedded in temporally neighboring images. Use the d-th image (document) Id as an example, if θd is given an asymmetric Dirichlet prior with concentration parameter α and nonuniform base measure md , the predictive probability of generating topic t from image Id given z is P (t|z, αmd ) = P (t|θ d )P (θd |z, αmd )dθ d =
Nt|d + αmt|d Nd + α
(4)
here Nd is the number of words in image Id , Nt|d is the number of times topic t occurring in image Id , and mt|d is the t-th element of the vector md . Probability P (θd |z, αmd ), the posterior probability of θd after the observation z is taken into account, is still a Dirichlet distribution with parameters {Nt|d + αmt|d }Tt=1 . P (t|z, αmd ) is the expectation of P (θ d |z, αmd ) and can be written in terms of the average of the prior mean mt|d and the observed conditional frequency ft|d = (Nt|d /Nd ): P (t|z, αmd ) = λd ft|d + (1 − λd )mt|d
(5)
where λd = Nd /(Nd + α), and ft|d is smoothed by the prior expectation mt|d . The value of α determines to what extent the current image Id is affected by its neighboring images within the same sequence. The larger the value of α the more temporal dependence between adjacent images is considered. We model the dependence of neighboring images by approximating md as k(Id , Is ) θˆt|s mt|d = (6) Is ∈N (Id ) k(Id , Is ) Is ∈N (Id )
ˆ s is the estimate of θs with a set where N (Id ) is the neighborhood of the image Id , θ ˆ ˆ s , and k(Id , Is ) is a kernel of samples as given in [9], θt|s is the t-th element of vector θ function to measure the similarity between two images. If we want to preserve the continuity of document-topic distributions along the temporal direction, the neighborhood set can be defined as the set of temporally adjacent neighbors by using sliding window N (Id ) = {Id−Wlef t , Id−Wlef t +1 , · · · , Id+Wright }. The kernel function can be defined as (d − s)2 k(Id , Is ) = exp − , (7) 2lx2 where lx is the length scale, and d and s are the indices of the current image and its neighboring image, respectively. In our experiments, the neighborhood set N (Id ) = {Id−1 , Id+1 } is used, so k(Id , Id−1 ) = k(Id , Id+1 ) = exp(− 2l12 ) and mt|d is calcux lated as mt|d = (θˆt|d−1 + θˆt|d+1 )/2. The DMMs can be viewed as a special case of the ˆ d−1 . proposed temporal model by setting md to θ
DTTM: A Discriminative Temporal Topic Model for Facial Expression Recognition
601
3.3 Increasing Discriminative Ability Different from existing supervised variants of LDA by training with discriminative criterions (e.g. Maximum Margin), we incorporate the sTW scheme into LDA. The term weighting scheme on LDA was proven to be effective for IR problems. In this work we combine LDA with sTW to investigate whether classification problem can benefit from TW schemes. The sTW is a supervised variant of TW for categorization problem and has been widely used in text categorization domain. Let notation s(w, d) denote the supervised weight of word w in image Id . The sTW used in our experiments is a supervised variant of tf-idf weighting by replacing the idf function with the information gain (IG) [6] as follows ST W (w, d) = log(1 + tf (w, d))
6
IG(w, c),
(8)
c=1
where tf (w, d) denotes the term frequency of word w occurs in image Id , and IG(w, c) denotes the information gain of word w of category c. For the detailed calculation of IG please refer to [6]. s(w, d) is calculated from ST W (w, d) by normalization ST W (w, d) . s(w, d) = 2 w ST W (w, d)
(9)
Both the asymmetric priors and the sTW are combined with Gibbs sampling update for the topic zn as follows P (zn = t|z −n , w, α, β) N β i=1 s(wi , di )I(zi = t)I(wi = wn ) + W i=n = N i=1 s(wi , di )I(zi = t) + β i=n N s(w i=1 i , di )I(zi = t)I(di = dn ) + αmt|dn i=n N i=1 s(wi , di )I(di = dn ) + α
(10)
i=n
where I(·) is the identity function that returns 1 if its argument is true. Our learning algorithm seems to be very different from the learning algorithm of LDA as given in [9]. However, if we set s(wi , di ) ≡ 1 and mt|dn ≡ 1/T , our learning algorithm (10) will reduce to the same learning algorithm of the LDA model. So the main difference between the two learning algorithms is that we assigned different weights to visual words and smoothed the prior md . A similar weighted learning algorithm has been studied in [18]. The convergence of this weighted learning algorithm has been proved. With a set of samples from the posterior distribution P (z|w), Θ and Φ can be estimated by the right side of the equation (10) with considering all words and their topic assignments. With the learning algorithm (10), the methods to including temporal and categorical information are both to be incorporated into DTTM through the Gibbs sampling model.
602
L. Shang, K.-P. Chan, and G. Pan
3.4 Applying to Facial Expression Recognition After getting the BoW representation for facial expression database, the DTTM is first [Tr] learned for training database. Let Θ [Tr] = {θi } denote the learned document-topic distributions of training database. For testing facial images not in the training set, we need to quickly assess the topic assignments by some efficient inference methods. We adopt the efficient Monte Carlo algorithm as described in [13]. The basic idea of this method is to run only on the word tokens in the new image. Once the j-th testing image is obtained, we will sample new assignments of words to topics by applying equation (10) only to the words in the j-th testing image. After several sampling iterations (30 iterations in our simulation), we can get the topic assignments and estimate the [Te] document-topic distribution θj . The probability of classifying the j-th testing image to class c is calculated by an exemplar based classifier. For facial expression classification, the exemplar model consists of the labeled train[Te] [Tr] ing exemplars Θ [Tr] and the similarity function sim(θj , θi ) measuring how closely [Te]
[Tr]
a new observation θj is related to θi . The posterior probability of the new observation being classified to class c is [Te] [Tr] [Tr] = c) [Te] i sim(θ j , θ i )I(ci (11) P (c|θ j ) = [Te] [Tr] i sim(θ j , θ i ) [Tr]
here ci is the label for the i-th exemplar. We used the symmetrized Kullback-Leibler (KL) divergence as the similarity function, since it is a method for measuring the similarity between probability distributions. The exemplar set is the union of k-nearest [Te] neighbors of θj for each class. The expression label for a whole sequence is c[Te] = arg max
c=1,...,6
M j=1
[Te]
P (c|θj
).
(12)
Discriminant or sequential data classifiers are not used in this work, since we want to investigate how much can we benefit from the proposed DTTM.
4 Experiments 4.1 Dataset We use the CMU Cohn-Kanade Database [10] to evaluate the performance of DTTM. This database consists of 100 university students ranging in age from 18 to 30. Sixtyfive percent were female, fifteen percent African-American and three percent Asian or Latino. In this work we attempt to recognize six prototypic expressions (namely joy, surprise, anger, disgust, sadness and fear). 4.2 Parameter Settings Before using the proposed topic model and LDA, we need to first set the hyperparameters α, β and the number of latent topics T . For all runs of our algorithm, we set α and
DTTM: A Discriminative Temporal Topic Model for Facial Expression Recognition
603
β to constant values α = 200 and β = 0.01W = 4.35, where W = 435 is the vocabulary size. T is a very influential parameter for any latent topic model, it is generally acceptable to use empirical methods to determine optimal value of T. We run our model for different T values and found latent topics T = 19 provides the best recognition rate. 4.3 An Intuitive Example and Analysis of Topics To facilitate the understanding of the correspondence between the DTTM model and the FACS based facial expression recognition in an intuitive way, we created a short image sequence as shown in Figure 3, in which the subject performed surprise expression. In FACS the surprise expression is described as the combination of the following five AUs, i.e. AU1 (Inner Brow Raiser) + AU2 (Outer Brow Raiser) + AU5 (Upper Lid Raiser) + AU26 (Jaw Drop) + AU27 (Mouth Stretch). Figure 4 draws the probability distributions of the six basic expressions for this sequence by the proposed method. For the first two frames, the probabilities of the six expressions are close to each other, which implies that the two frames have neutral expression. The probability of surprise becomes significantly larger than the other expressions from the third snapshot, since just from this frame the inner and outer brows begin to raise (AU1+AU2), the upper lids are also slightly raised (AU5) and jaw is dropped (AU26). As the expression progresses with time the probability of surprise increases gradually and approaches to 0.8 at the 8-th snapshot. This example illustrates that our model can well model the evolution of facial expression. We then investigate the relationship between document-topic distributions and the ¨ i denote the document-topic distribution for the i-th snapevolving expressions. Let θ shot of the exemplar sequence. The importance of the 19 topics is ranked according 8 to the value i=1 θ¨t|i , which is the sum of each topic probability along the image sequence. Figure 5 shows the probabilities of the top-6 topics for the sequence. We can
Fig. 3. An image sequence shows a subject performing surprise
1 Joy
Surprise
Anger
Disgust
Sad
Fear
Probability
0.8 0.6 0.4 0.2 0 1
2
3
4 5 Frame Number
6
7
8
Fig. 4. The probability distributions of facial expressions for the sequence shown in Fig. 3
604
L. Shang, K.-P. Chan, and G. Pan Document−Specific Distribution
0.14 Topic1
Topic2
Topic3
Topic4
Topic5
Topic6
0.12 0.1 0.08 0.06 0.04 0.02
1
2
3
4 5 Frame Number
6
7
8
Fig. 5. The top-6 document-topic distributions of the sequence shown in Fig. 3
Topics
Words
Topic1
UP30, UP31, UP32, UP33, UP35, UP36,UP37, UP38
Topic2
UP15, UP16,UP17, UP23, UP24, UP25, UP41, U43
Topic3
DOWN45, DOWN46, DOWN47
Topic4
DOWN19, DOWN20, DOWN21, DOWN27, DOWN28,DOWN29, LEFT35, RIGHT30, RIGHT49
Topic5
DOWN6, DOWN7,DOWN8, UP41, UP42, UP43
Topic6
MotionlessY23, MotionlessY24, MotionlessY25, MotionlessY19, MotionlessY20, MotionlessY21, MotionlessY14, MotionlessY22, MotionlessX41, MotionlessX45, MotionlessX47
Fig. 6. The dominant words of the top-6 topics
see that the probabilities of some topics gradually increase (e.g. topics 2, 4 and 5) or always preserve relatively larger values (e.g. topics 1, and 3). Thus these topics are positively relevant to the surprise expression. The variation trend of the topic6 is different from the first five topics, it has the largest value for the neutral expression snapshot (i.e. the first frame), but quickly decreases when the expression approaches to apex surprise. To explain this phenomena, we need to know the content of each topic. Figure 6 shows the dominant words of each topic. For a given topic, the “dominant” words are the words which occur at least 100 times in this topic. Topic1 has eight dominant words, which describe the upward movement of facial points located at two brows, i.e. the topic1 corresponds to the combination of AU1 and AU2 and the two AUs are the key components of surprise expression. Topic2 reveals that the upper lids moving upward often occurs with the upward movement of upper lips. In Topic3, there are three words, DOWN45, DOWN46, and DOWN47, which describes the lower lip moves downward. The Topic4 includes 9 dominant words, in which the first 6 words are about the downward movements of the lower lids and the words LEFT35 and RIGHT30 are the movements of inner brows in the horizontal direction. The word RIGHT49 is about the movement of facial point 49, which is a nose facial point and seems to be irrelevant to surprise expression. In Topic5, the words are mainly about the downward movements of the jaw. The words of Topic6 are all about the motionless movement of some facial points, so for the neutral expression this topic will have a higher probability
DTTM: A Discriminative Temporal Topic Model for Facial Expression Recognition
605
Table 1. Comparison with different LDAs based method for sequence level recognition Methods LDA TTM DTM DTTM
JOY SUR (%) ANG (%) DIS (%) SAD (%) FEA (%) Overall(%) 94.44 100.00 91.67 88.89 88.89 88.89 92.13 91.67 100.00 88.89 83.33 97.22 97.22 93.06 94.44 100.00 91.67 91.67 97.22 97.22 95.37 94.44 100.00 97.22 91.67 97.22 97.22 96.29
and gradually decrease with the evolution of expressions. The first five topics can be attributed to the upward movement of brows, upper lids, upper lips and the downward movement of lower lids, lower lips and jaw, which are consistent with the AUs for describing the surprise expression. 4.4 Comparison of Recognition Results We use cross-validation to verify the benefits of incorporating temporal and categorical information. Table 1 (the upper one) presents the recognition results, where TTM is the temporal topic model and DTM is the discriminative topic model. From Table 1, we can see that our method DTTM achieves 96.29% overall recognition rate and outperforms all the other three methods. The TTM (93.06%) only slightly outperforms the original LDA (92.13%) and the DTM (95.37%) has better performance than TTM (93.06%), which implies modeling temporal information is not so important for sequence level recognition compared to increasing discriminative ability. This is because in sequence level recognition the recognition results for frames are summed together for each sequence (see the formula (12)), so even if several frames are misclassified the whole sequence can still be correctly classified.
5 Conclusions This paper proposed a new latent topic model DTTM for facial expression analysis by integrating temporal and categorical information with LDA. In our DTTM, an asymmetric Dirichlet prior is placed over document-topic distributions, through which the topic generative probability of each image is smoothed by that of its neighboring images. We used a sTW scheme to embed the categorical information into word weights, which are combined with LDA through Gibbs sampling model. Finally a single learning rule is obtained. Experiments on CMU expression database confirmed the effectiveness of the proposed DTTM in facial expression recognition.
References 1. 2. 3. 4.
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. JMLR 3(2-3), 993–1022 (2003) Blei, D., Lafferty, J.D.: Dynamic topic models. In: ICML (2006) Blei, D., McAuliffe, J.D.: Supervised topic models. In: NIPS, pp. 121–128 (2007) Chang, Y., Hu, C., Turk, M.: Probabilistic expression analysis on manifolds. In: CVPR, pp. 520–527 (2004)
606
L. Shang, K.-P. Chan, and G. Pan
5. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. on PAMI 23(6), 681–685 (2001) 6. Debole, F. and Sebastiani, F.: Supervised term weighting for automated text categorization. In: ACM SAC, pp. 784–788 (2003) 7. Ekman, P., Friesen, W.V.: Facial Action Coding System (FACS): Manual. Consulting Psychologists Press, Palo Alto (1978) 8. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: CVPR, pp. 524–531 (2005) 9. Griffiths, T.L., Steyvers, M.: Finding scientific topics. PNAS 101, 5228–5235 (2004) 10. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis. In: FG, pp. 46–53 (2000) 11. Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification. In: NIPS, pp. 897–904 (2008) 12. Shang, L., Chan, K.P.: A temporal latent topic model for facial expression recognition. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part IV. LNCS, vol. 6495, pp. 51–63. Springer, Heidelberg (2011) 13. Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic Author-Topic Models for Information Discovery. In: KDD, pp. 306–315 (2004) 14. Wang, X., Grimson, E.: Spatial latent dirichlet allocation. In: NIPS (2007) 15. Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. In: UAI (2008) 16. Wang, X., McCallum, A.: Topics over time: A non-Markov continuous-time model of topical trends. In: SIGKDD (2006) 17. Wei, X., Sun, J., Wang, X.: Dynamic mixture models for multiple time-series. In: IJCAI, pp. 2909–2914 (2007) 18. Wilson, A.T., Chew, P.A.: Term weighting schmes for latent Dirichlet allocation. In: NAACL, pp. 465–473 (2010) 19. Zhou, M., Liang, L., Sun, J., Wang, Y.: AAM based face tracking with temporal matching and face segmentation. In: CVPR, pp. 701-708 (2010) 20. Zhu, J., Ahmed, A., Xing, E.P.: MedLDA: Maximum margin supervised topic models for regression and classification. In: ICML (2009)
Direct Spherical Parameterization of 3D Triangular Meshes Using Local Flattening Operations Bogdan Mocanu and Titus Zaharia Institut Télécom / Télécom SudParis, ARTEMIS Department, UMR CNRS 8145 MAP5, Evry, France {bogdan.mocanu,titus.zaharia}@it-sudparis.eu
Abstract. In this paper, we propose a novel spherical parameterization approach for closed, genus-0, two-manifold, 3D triangular meshes. The method exploits a modified version of the Gaussian curvature, associated to the model vertices. Valid spherical embeddings are obtained by locally flattening the mesh in an iterative manner, which makes it possible to convert the initial mesh into a rounded, sphere-like surface that can be directly mapped onto the unit sphere. Our approach shows superior performances with respect to state of the art techniques, with a reduction in terms of angular and area distortions of more than 35% and 19% respectively.
1 Introduction Firstly used in the computer graphics field in order to map textures onto surfaces [1], 3D mesh parameterization has become lately an essential phase in numerous mesh processing applications such as surface-fitting [2], mesh-editing [3], re-meshing [4], compression [5] and morphing [6]. The interest of 3D mesh parameterization techniques comes from the fact that complex operations that are intractable on the original 3D surface representation can be performed easily on a simple parametric domain such as the unit disk or the unit sphere. The parameterization of a 3D surface S ⊂ IR 3 is defined as a homeomorphism Φ :S→D which maps the surface S over an appropriate 2D domain D ⊂ IR 2 . In the case of a 3D mesh M = (V , E, F ) , where V, E and F respectively denote the sets of vertices, edges and triangles, a parameterization is defined as a piece-wise linear embedding completely specified by a function ϕ : V → D , which associates to each nota tion
vertex pi(xi,yi,zi) of V a point φ i = φ ( p i ) in the parametric domain D. The domain D is selected in most of the cases based on the original model topology. Most often, for open triangular meshes the intuitive way to obtain a parameterization is to map its vertices in a planar domain while for closed models, a spherical domain (i.e., the unit sphere) is more appropriate. As a homeomorphism, the mapping must satisfy the bijectivity property which ensures that all mesh triangles have an appropriate image in the parametric domain and do not overlap. In practice, there are some circumstances when the obtained function is not continuous or bijective and this leads to invalid parameterizations G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 607–618, 2011. © Springer-Verlag Berlin Heidelberg 2011
608
B. Mocanu and T. Zaharia
(foldovers - triangles overlapping). Over the time numerous approaches [7], [8], [9], [10], [11] have been proposed in order to prevent the triangles from flipping. However, the price to pay is an increase in the model shape distortion. In this paper, we propose and validate a novel spherical parameterization algorithm for closed, genus-0, two-manifold, 3D triangular meshes which exploits the Gaussian curvature associated to the mesh vertices in order to obtain valid transformations. The rest of the paper is organized as follows. Section 2 gives an overview of the main families of parameterization methods. Next, in Section III, we describe in details the proposed mesh parameterization algorithm, while in Section IV we present and discuss the experimental results obtained on a test set of 3D objects selected from the Princeton and MPEG 7 database. Finally, Section V concludes the paper and opens perspectives of future work.
2 Related Work A first family of parameterization methods is dedicated to planar mapping of 3D models with disk-like topology [7], [8]. Such techniques are based on physical mesh models, where edges are represented as springs with various elasticity parameters, linking the model vertices. Several approaches attempt to obtain the best shape preserving parameterization. The discrete harmonic mapping introduced in [12] uses differential geometry to minimize the Dirichlet energy of a piecewise linear embedding, but with fixed boundary. Desbrun et al. [7] improve this method by refining the position of the boundary vertices with the help of a minimization procedure (discrete conformal mapping). The mean value coordinates technique first introduced by Floater [8] aims to preserve the model angles by developing a generalized barycentric coordinate system which expresses a vertex as a linear combination of its neighbors. With the new set of weights the resulting system guarantees a bijective mapping for any type of open 3D models. However, in practice this technique yields less satisfactory results than the classical harmonic mapping [12]. Another class of planar parameterization methods consists of the so-called discrete authalic maps which aim to preserve the original model areas. However, in [13] Floater proved that authalic mappings, unlike conformal ones, are not unique. For this reason, additional conditions/constraints should be imposed in order to obtain stable and computationally tractable solutions. For 3D objects with arbitrary topologies (i.e., different from the unit disk) and in particular, for closed, genus 0 meshes with spherical topology, the issue of finding appropriate parameterizations proves to be more challenging and more complex than in the planar case. A first family of approaches [9], [14] attempt to extend and adapt existing planar parameterization methods. However, such simple extensions prove to be inefficient when dealing with complex geometries. Another approach [10], [11] consists of creating an artificial boundary by determining a path along the mesh edges. The mesh is then cut along the path. The resulting patches are open meshes and can be treated with the help of planar mapping methods. Despite its simplicity, the results depend heavily on the choice of the considered path and there is no guarantee that the cuts will have the same lengths on
Direct Spherical Parameterization of 3D Triangular Meshes
609
both sides in the parametric domain. In order to overcome this problem Kharevych et al. [15] introduce a different approach based on cone singularities that is further improved by Ben-Chen et al. in [16]. The main concept behind their techniques is to concentrate the entire Gaussian curvature of the mesh into a small number of vertices situated on the boundary, while the sum of the curvatures on both sides of the cut is minimized. Kent et al. [17] propose a technique that returns valid spherical embeddings only if the model is convex. Starting from the Kent technique, Alexa [18] develop a new vertex relaxation procedure that iteratively modifies each vertex position by placing it in the barycenter of its neighbors. The relaxation process resolves the foldovers with a low computational time. However, in some cases, the Alexa’s algorithm can breakdown into a single point and it is necessary to fix several vertices (called anchors) in the parametric domain. Without a sufficient number of adequately selected anchors the embedding may collapse. The drawback of such an approach comes from the fact that anchor vertices cannot be moved and cause triangle overlapping problems. In order to overcome such a limitation, the author proposes to replace the anchors after a certain number of iterations. However, the process is difficult to be controlled and does not guarantee a valid embedding in all cases. Praun and Hoppe propose in [19] a different spherical parameterization approach, based on a mesh simplification procedure that reduces the original mesh to a tetrahedron. The tetrahedron is then simply mapped onto the sphere and the vertices are successively inserted with the help of a progressive mesh sequence. After each vertex insertion, an optimization procedure is applied in order to minimize the stretch metric of the parameterization. The analysis of the state of the art shows a lack of spherical parameterization methods that can fully respond to different requirements such as low area/angular/length distortions, generality and computational efficiency. The approach proposed in this paper is presented in the next section.
3 Curvature-Driven Spherical Parameterization The spherical parameterization method proposed extends the previous state of the art methods introduced in [18] and [19]. Fig. 1 presents a synoptic overview of the proposed approach. Analogously to the method proposed in [19], we incorporate in our implementation a mesh simplification step. In contrast with the Praun and Hoppe approach, the decimation process is controlled such that the simplified mesh obtained still preserve the main geometrical features of the initial model. The role of the mesh simplification procedure is to decrease the computational complexity of the parameterization process. A sequence of edge collapse operations [20], driven by a quadratic error metric [21], is here applied. The parameterization process is further carried out on the simplified, decimated mesh. The algorithm exploits the Gaussian curvature associated to the mesh vertices in order to define a sequence of locally flattening operations. At the end of this procedure, the simplified model is transformed into a rounded, sphere-like surface that can be directly mapped onto the unit sphere. Finally, the removed vertices of the
610
B. Mocanu and T. Zaharia
Fig. 1. Overview of the proposed method
initial mesh are successively inserted on the sphere by constructing a progressive mesh sequence based on a suite of vertex split operations. The algorithm avoids solving complex, non-linear systems of equations and thus is computationally efficient. The various steps involved are detailed in the following sub-sections. 3.1 Preprocessing Steps Our algorithm starts with two preprocessing phases, which are pose normalization and mesh simplification. 3.1.1 3D Model Normalization Since 3D mesh objects can be defined by arbitrary positions and orientations in a 3D Cartesian coordinates system, we first apply a PCA-based normalization [22] in order to align the object with respect to its principal axes and scale it to the unit sphere. 3.1.2 Mesh Simplification In order to reduce as much as possible the computational complexity of the parameterization process and thus be able to treat complex meshes with a large number of vertices, we have adopted a mesh simplification scheme inspired from [19]. For simplification purposes, we have adopted the Hoppe’s edge collapse operator [20] because of its simplicity and for its capacity to preserve the initial mesh topology. Concerning the metric involved, instead of using the native measure introduced in [20], which suffers from a high computational complexity, we have adopted the quadric error metric proposed by Garland and Heckbert [21]. In contrast with the original technique in [21], which allows the contraction of arbitrary pairs of vertices, in our case we join solely vertices that define an edge in the mesh. This constraint is useful for preserving the original mesh topology.
Direct Spherical Parameterization of 3D Triangular Meshes
611
The mesh simplification process is stopped when the mean square error between two simplified versions M’ and M” of the original model exceeds a pre-established threshold Terr. Let denote by MN the input model with N number of vertices and by {MN-1, MN-2, ... , M0} the progressive sequence of simplified meshes. Then, at each step i, we have considered M’ = M i and M” = M i-100. This stop criterion makes it possible to dynamically determine the point, in the simplified mesh sequence, where the mesh geometry variation becomes important. In our experiments, the threshold was empirically set to 0.0025. This value provides a good compromise between quality of the resulting simplified meshes and computational time required. The proposed decimation algorithm yields high quality results that preserve the initial model’s shape and topology (Fig. 2) even for drastic simplification rates, in a relatively short time. The obtained simplified mesh is further processed as described in the next section. 16995 vertices
700 vertices
16266 vertices
a.
800 vertices
b.
Fig. 2. Initial and simplified meshes for the: (a) Dino and (b) Alien models
3.2 Gaussian Curvature-Driven Spherical Mapping Curvature measures are by definition expressed as functions of the second order surface derivatives and thus associated with smooth, C2–continuous surfaces. However, because a 3D mesh model is not a smooth surface, it is necessary to perform a piecewise linear approximation in order to obtain an estimation of the curvatures. In this paper, we have adopted the approach proposed in [23]. For a vertex p of a mesh M, let { pi ∈ Neighbours( p) | i = 1, l} be the set of the one-ring neighbor vertices and {( pi ppi +1 ) ∈ F | i = 1, l} the set of adjacent triangles.
If we denote with α i the angle determined by pi, p, and pi+1 , then we can compute the angular defect at the point p as:
δ = 2π − ∑ α i .
(1)
i
The Gaussian curvature is then defined as described by the following equation:
K=
3(2π − ∑ α i ) i
∑ Ai ( p) i
.
(2)
612
B. Mocanu and T. Zaharia
where Ai(p) is the area of each triangle adjacent to p. The proposed parameterization algorithm proceeds with the following three core steps: Step1 – Curvature-driven iterative flattening First, we compute the Gaussian curvature Kp for each vertex p of the mesh. Then, we determine the vertex pmax with the maximum absolute value of the Gaussian curvature. The barycenter of its neighboring nodes is computed, as described by the following equation:
∑
p' max =
pi pi ∈Neigh ( p max ) val ( p max )
(3)
.
where Neigh(pmax) denotes the set of vertices adjacent to pmax and val(pGMax) represents its valence. If the Euclidian distance between new and initial positions ||p’max – pmax|| is superior to a threshold dist, its position is changed to p’max. Otherwise, the considered vertex is not affected and the algorithm selects as a candidate the following highest curvature vertex, reiterating the process. When modifying the position of a vertex, the various measures (triangle areas and angles) involved in the computation of the Gaussian curvatures, need to be recomputed. This is done locally, exclusively for the displaced vertex and for its neighbors, since the other mesh vertices are not affected. The process is repeated recursively: determine the vertex with maximum Gaussian curvature, compute the barycentric coordinates, displace the vertex and re-compute K only for the affected vertices. Thus, salient mesh vertices are firstly detected and processed, leading, after each iteration, to a locally flattened version of the 3D mesh model. At the end of the process, a sphere-like surface is obtained. In contrast with [15] and [16] that identify the high-curvature vertices and concentrate the entire mesh curvature there, we aim at determining such vertices in order to distribute the curvature to its neighbors and thus construct models with constant curvature values, like the unit sphere. The classical Gaussian curvature in equation (2) privileges the selection of vertices located in densely sampled mesh regions, where the triangle areas tend to zero. This behavior can penalize our algorithm, which can perform long sequences of iterations inside such regions. In order to avoid such a problem, we have considered a modified expression of the Gaussian curvature, defined as: KS =
2π − ∑ α i i
χ s + ∑ Ai ( p) / 3
,
(4)
i
where χS denotes the average triangle area, computed over the entire mesh. The correction factor χS makes it possible to reinforce, in the selection process, the influence of the angular defect term and thus to avoid long loops in densely sampled regions characterized by low values of triangle areas.
Direct Spherical Parameterization of 3D Triangular Meshes
613
Step 1 is successively repeated for a number It of iterations. At the end of step 1, PCA and size normalization are re-applied in order to avoid shrinkage problems. Step 2 – Visibility check and projection onto the sphere At this stage, we first check if the mesh obtained at the end of step 1 can be projected onto the sphere. This consists of applying to each mesh vertex a visibility test performed with the help of a ray casting operation. Finally, if all mesh vertices are visible from the object’s gravity center, the mapping onto the unit sphere is simply obtained by a vertex projection defined as described by the following equation:
∀pi ∈ V ,
φi =
pi . pi
(5)
where φ i is the image on the unit surface sphere of the vertex pi. The visibility property ensures that the obtained parameterization is bijective. If the visibility condition is not satisfied, then step 1 is re-iterated. Step 3 – Vertex split sequence In this phase, all vertices removed in the mesh simplification process are iteratively re-inserted on the sphere by constructing a progressive mesh sequence analogously to the method described in [20] by Hoppe. The algorithm exploits the fact that a contraction operation is invertible. For each edge collapse, a corresponding inverse operator, called vertex split, is defined. In contrast with Hoppe objectives that try to reconstruct the original shape of the model from a coarser version of it, we aim to return to the original mesh topology with its surface directly mapped on the sphere. Thus, the objective is to re-insert a removed vertex in the mesh structure without generating triangle flipping or degenerate triangles. This requires a position optimization of the vertex to be inserted. In contrast to the approach in [20], we have adopted a simple, yet efficient optimization procedure. The first ring neighborhood of the considered vertex (Fig. 3.a) is first subdivided in order to obtain a set of potential positions (Fig. 3.b). A sub-set of valid possibilities (i.e., position which do not lead to overlaps or degenerate triangles) is then determined (Fig. 3.c). Among them, the vertex which provides the optimal angular distribution of the corresponding triangles is determined. In order to reach this objective, we select the position which yields the maximal value of the minimal angle of the adjacent triangles. Note, that if the mapping is an embedding prior the vertex split operation, then it should remain also valid after the insertion.
Fig. 3. Vertex insertion operation: (a) initial configuration; (b) polygon subdivision; (c) set of valid positions; (d) final retained position and the new configuration
614
B. Mocanu and T. Zaharia
4 Experimental Results In order to validate the proposed algorithm we have considered a set of 8 closed, manifold, triangular mesh models (Fig. 4) from the Princeton Shape Benchmark (http://shape.cs.princeton.edu/benchmark/) and MPEG-7 databases. The retained models are characterized by various types of geometries, complexities and shapes. Body 14603 vertices
Hand 25001 vertices
Lyon 956 vertices
Alien 16266 vertices
Face 17358 vertices
Dino 16995 vertices
Rabbit 453 vertices
Horse 19851 vertices
Fig. 4. Retained test 3D models
Fig. 5 to 7 present some results obtained and illustrate the various intermediate steps involved. As it can be observed, the achieved spherical parameterizations yield, in all cases, valid embeddings which preserve well the models shape.
a.
b.
e.
c.
f.
d.
g.
h.
Fig. 5. Meshes obtained for the “Horse” model: (a) original model (19851 vertices); (b) simplified model (600 vertices); (c), (d), (e), (f) successively flattened models; (g) spherical parameterization (simplified model); (h) final spherical parameterization
Direct Spherical Parameterization of 3D Triangular Meshes
a.
b.
e.
c.
f.
615
d.
g.
h.
Fig. 6. Meshes obtained for the “Dino” model: (a) original model (16995 vertices); (b) simplified model (700 vertices); (c), (d), (e), (f) successively flattened models; (g) spherical parameterization (simplified model); (h) final spherical parameterization
a.
b.
c.
f.
d.
g.
e.
h.
Fig. 7. Meshes obtained for the “Body”model: (a) original model (14603 vertices); (b) simplified model (700 vertices); (c), (d), (e), (f) successively flattened models; (g) spherical parameterization (simplified model); (h) final spherical parameterization
In order to objectively establish the performances of the proposed method, we have retained as evaluation criteria the angular and area distortions [24], respectively denoted by AD and SD, and defined as described by the following equations: AD =
1 3F
F
SD = ∑
i =1
F
3
∑ ∑ α ij − α ' ij .
A(Ti ) F
∑ A(T j ) j =1
(6)
i =1 j =1
−
A(T ' i ) F
∑ A(T ' j ) j =1
.
(7)
616
B. Mocanu and T. Zaharia
Here, α ij and A(Ti ) denote respectively the j th angle and the area of a triangle Ti of the original mesh, while α 'ij and A(T ' i ) the j th angle and the area of a triangle Ti in the parametric domain. Table 1 synthesizes the results of our Gaussian curvature based spherical parameterization method, compared with the techniques proposed by Alexa in [18] and Praun et. al. in [19]. Table 1. Area and angular distortions obtained
Name
No. V
Body Lyon Hand Face Horse Rabbit Alien Dino
14603 956 25001 17358 19851 453 16266 16995
Proposed method AD 0.454 0.371 0.353 0.347 0.391 0.311 0.368 0.384
SD 1.417 1.174 1.126 0.576 1.194 0.682 1.288 1.417
Alexa method [18] AD SD 0,793 1,431 0,512 1,388 Overlapping 0,456 0,933 0,803 1,636 0,362 0,891 Overlapping 0,962 1,552
Praun et. al. method [19] AD SD 0,651 1,452 0,445 1,413 0,573 1,538 0,521 0,775 0,637 1,726 0,364 0,782 0,872 1,673 0,896 1,728
The results show that the proposed method provides superior performances in terms of both angular and area distortions, with gains of 36,72% and 19,04% respectively when compared to Alexa’s method and gains of 35,85% for angular distortions and 19,55% for area distortions compared with Praun’s method. Regarding the processing requirements, the proposed algorithm is slightly slower than the other two approaches. This is due to the iterative computation of the Gaussian Curvature. In contrast, the approach proposed by Alexa projects direct the vertices onto the sphere employing simple vertex normalizations operations and different relaxation processes. However, Alexa’s technique does not guarantee a valid embedding for all the models. Concerning the Praun’s method, the mesh simplification process is here performed until a simple tetrahedron, which can be directly projected onto the unit sphere. Despite the optimization procedure employed when re-inserting the initial mesh vertices, the resulting distortions are here more important. This shows the interest of stopping the simplification process with the help of a geometric distortion criterion. The role of the Gaussian curvature-driven mesh flattening phase, which makes it possible to directly project the simplified model onto the unit sphere, is here fundamental. Finally, as a key factor of the proposed method, let us mention its complete automatic nature: the algorithm does not require any human intervention.
5 Conclusions and Perspectives In this paper, we have proposed a novel spherical parameterization method which exploits a Gaussian curvature-driven, iterative flattening scheme. The algorithm
Direct Spherical Parameterization of 3D Triangular Meshes
617
automatically detects salient mesh vertices and locally flattens them, until a spherelike surface, adapted for a direct spherical mapping is obtained. The experimental evaluation, carried out on a set of eight 3D models with various shapes and complexities, shows that the proposed method makes it possible to reduce both angle and area distortions with more than 35% and 19%, respectively. In all cases, valid embeddings have been obtained. Our future work firstly concerns the improvement of the proposed method with the integration of an additional optimization procedure within the vertex split phase, aiming at minimizing the geometric distortions. In a second stage, we plan to apply the obtained parameterizations to objectives of mesh morphing. In this case, the Gaussian curvature flattening procedure can be exploited for minimizing the amount of human interaction required for establishing correspondences between source and target 3D models.
References 1. Haker, S., Angenent, S., Tannenbaum, A., Kikinis, R., Sapiro, G., Halle, M.: Conformal Surface Parameterization for Texture Mapping. IEEE Transactions on Visualization and Computer Graphics 6(2), 181–189 (2000) 2. Piegl, L.A., Tiller, W.: Parametrization for Surface Fitting in Reverse Engineering. Computer-Aided Design 33(8), 593–603 (2001) 3. Biermann, H., Martin, I., Bernardini, F., Zorin, D.: Cut-and-Paste Editing of Multiresolution Subdivision Surfaces. ACM Trans. Graphics 21(3), 312–321 (2002) 4. Smith, J.P., Boier-Martin, I.M.: Parameterization for Remeshing Over Dynamically Changing Domains. In: Proc. SMI (2006) 5. Alliez, P., Gotsman, C.: Recent Advances in Compression of 3D Meshes. Advances in Multiresolution for Geometric Modelling, 3–26 (2005) 6. Zhu, Z.J., Pang, M.Y.: Morphing 3D Mesh Models Based on Spherical Parameterization. In: International Conference on Multimedia Information Networking and Security, pp. 313–313 (2009) 7. Desbrun, M., Meyer, M., Alliez, P.: Intrinsic Parameterizations of Surface Meshes. Computer Graphics Forum 21, 210–218 (2002) 8. Floater, M.: Mean Value Coordinates. Computer Aided Geometric Design 20, 19–27 (2003) 9. Friedel, I., Schröder, P., Desbrun, M.: Unconstrained Spherical Parameterization. In: ACM SIGGRAPH Technical Sketches. ACM, New York (2005) 10. Sorkine, O., Cohen-Or, D., Goldenthal, R., Lischinski, D.: Bounded Distortion Piecewise Mesh Parametrization. IEEE Visualization, 355–362 (2002) 11. Sheffer, A., Hart, J.: Inconspicuous Low-Distortion Texture Seam Layout. IEEE Visualization, 291–298 (2002) 12. Pinkall, U., Polthier, K.: Computing Discrete Minimal Surfaces and Their Conjugates. Experiment Mathematics 2, 15–36 (1993) 13. Floater, M.S., Hormann, K.: Surface Parameterization: a Tutorial and Survey. In: Advances in Multiresolution for Geometric Modelling, pp. 157–186. Springer, Heidelberg (2005) 14. Sheffer, A., Gotsman, C., Dyn, N.: Robust Spherical Parameterization of Triangular Meshes. In: Proceedings of 4th Israel-Korea Binational Workshop on Computer Graphics and Geometric Modeling, Tel Aviv, pp. 94–99 (2003)
618
B. Mocanu and T. Zaharia
15. Kharevych, L., Springborn, B., Schroder, P.: Discrete conformal mappings via circle patterns. ACM Transactions on Graphics, vol 25(2), 412–438 (2006) 16. Ben-Chen, M., Gotsman, C., Bunin, G.: Conformal Flattening by Curvature Prescription and Metric Scaling. Computer Graphics Forum 27(2), 449–458 (2008) 17. Kent, J.R., Carlson, W.E., Parent, R.E.: Shape Transformation for Polyhedral Objects. Computer Graphics, SIGGRAPH Proceedings 26(2), 47–54 (1992) 18. Alexa, M.: Recent Advances in Mesh Morphing. Computer Graphics Forum 21, 173–196 (2002) 19. Praun, E., Hoppe, H.: Spherical Parametrization and Remeshing. ACM Transactions on Graphics 22(3) (2003) 20. Hoppe, H.: Mesh Optimization. In: Proceedings of ACM SIGGRAPH, pp. 19–26 (1993) 21. Garland, M., Heckbert, P.S.: Surface Simplification Using Quadric Error Metrics. In: 24th Annual Conference on Computer Graphics and Interactive, pp. 209–216 (1997) 22. Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer, New York (2002) 23. Xu, Z., Xu, G.: Discrete Schemes for Gaussian Curvature and Their Convergence. Computers & Mathematics with Applications 57, 1187–1195 (2009) 24. Yoshizawa, S., Belyaev, A., Seidel, H.P.: A fast and simple stretch minimizing mesh parameterization. In: International Conference on Shape Modeling and Applications, Genova, Italy, pp. 200–208 (2004)
Segmentation and Visualization of Multivariate Features Using Feature-Local Distributions Kenny Gruchalla1,2 , Mark Rast2,3 , Elizabeth Bradley2 , and Pablo Mininni4,3 1 3
National Renewable Energy Laboratory, Golden, Colorado 2 University of Colorado, Boulder, Colorado National Center for Atmospheric Research, Boulder, Colorado 4 Universidad de Buenos Aires, Argentina
Abstract. We introduce an iterative feature-based transfer function design that extracts and systematically incorporates multivariate featurelocal statistics into a texture-based volume rendering process. We argue that an interactive multivariate feature-local approach is advantageous when investigating ill-defined features, because it provides a physically meaningful, quantitatively rich environment within which to examine the sensitivity of the structure properties to the identification parameters. We demonstrate the efficacy of this approach by applying it to vortical structures in Taylor-Green turbulence. Our approach identified the existence of two distinct structure populations in these data, which cannot be isolated or distinguished via traditional transfer functions based on global distributions.
1
Introduction
We describe an iterative analysis and visualization technique that allows domain experts to interactively segment, group, and investigate individual multivariate volumetric structures by incorporating feature-local statistics into a multivariate transfer function. The opacity component of a volume-rendering transfer function is used as an initial threshold to create a binary segmentation of the data volume. Individual structures are then identified through a connected-component analysis, and variable distributions are calculated in the spatial extents of the individual structures. Through an interactive table, structures can be filtered and selected based on their local statistical properties (e.g., central moments of the feature-local distributions), which are then added to the volume-rendering transfer function as an additional dimension. Users can selectively iterate various stages of this process, allowing them to progressively refine the visualization, improve their understanding of the multivariate properties unique to each structure, and investigate the correlations between variables across localities at multiple scales. As a proof of concept, this tool was applied to data from a simulation of forced Taylor-Green turbulence [1]. Analysis of the high-vorticity structures in the Taylor-Green turbulence data volume by the means described in this paper G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 619–628, 2011. c Springer-Verlag Berlin Heidelberg 2011
620
K. Gruchalla et al.
revealed distinct vortical structure populations with quite different flow properties. These populations cannot be isolated or distinguished using a traditional multidimensional transfer function based on the global distributions of flow quantities.
2
Related Work
In general, the goal of feature-based visualization is to extract physically meaningful structures from the data, showing only those features that are of interest to the researcher. Feature-based visualization of turbulent flow typically targets physical characteristics of features using either image processing [2,3,4,5] or topological analysis [6,7,8]. Post et al. [9] have provided an extensive survey of feature-based visualization techniques as applied to flow data. Our extraction approach is based on an image-processing technique, where the physical characteristics of interest are defined by a multidimensional opacity function and the resulting connected opaque structures are extracted using a connectedcomponent labelling algorithm [10]. There is considerable literature in the field of multivariate visualization. A comprehensive review is beyond the scope of this paper. Wong and Bergeron [11] provide a survey of techniques appropriate for the visualization of abstract multivariate data. B¨ urger and Hauser [12] provide a survey of visualization techniques specifically for multivariate scientific data, with an emphasis on volumetric data. One common approach is to visualize the relationships or correlations between multiple variables, as our technique does. Local statistical complexity has been used to identify regions of multivariate data that communicate large amounts of information [13]. Multi-field graphs, which give a useful overview of correlation fields between variables, can help guide the selection of promising correlations [14]. Multidimensional transfer functions were introduced to investigate and exploit these kinds of correlations [15]. Park et al. [16] applied this in the context of turbulence data, incorporating flow field properties such as velocity, curl, helicity, and divergence into multidimensional transfer functions to visualize features in flow fields. Our incorporation of multiple fields is different from these approaches in several ways. We provide the user the ability to interactively investigate local multivariate relationships. The locality can be constrained to the currently visualized space or applied systematically to each individual connected-feature. These localities are defined by the opacity component of an initial transfer function, which can support multiple dimensions. The result is a system that can investigate and visualize the multivariate relationships between a spectrum of scales, which is particularly important when investigating phenomena believed to operate across multiple scales, such as turbulence. Statistical guides and the exploitation of various statistical properties are common elements in interactive volume visualization. The underlying principle of data-driven transfer function design is to use information derived from the data to guide the user in the design process. For example, all of the multidimensional transfer function techniques described above [15,16] incorporate
Segmentation and Visualization of Multivariate Features
621
global-data histograms and/or scatter plots to provide statistical context. The use of localized statistics is increasingly common in interactive volume visualization. The Contour Spectrum technique, which is used to determine isovalues in unstructured meshes, incorporates a variety of isosurface metrics (e.g., surface area, volume, and mean gradient magnitude) into the selection interface [17]. Tenginakai et al. [18] use localized higher order central moments based on a local kernel to find isosurface boundaries. Correa and Ma [19] use a multi-scale analysis to incorporate a derived scale field that represents the relative size of the local features at each voxel into a multidimensional transfer function. Lundstr¨ om et al. [20] introduced a partial range histogram – computed locally, over a block-partition of space – to help distinguish tissue boundaries in medical data. Our use of local statistics is novel in two ways. First, the locality is based on the geometry of extracted structures, which is user-defined through a multidimensional transfer function. Second, we provide systematic access to the local statistical measures and to their distributions for each extracted structure. The user can focus the visualization and analysis by employing the central moments of these local distributions to sort, filter, and select the individual structures. In this way, we generalize the concept of spatial locality to include both structural and physical properties of the flow, thereby facilitating effective interactive feature definition, identification, and property extraction.
3
Methods and Implementation
To evaluate the utility of the techniques described in this paper in the analysis of turbulent structures, we have implemented them in the open-source VAPOR [21,22] visualization and analysis environment, which is a widely deployed toolset in the turbulence community that specifically targets time-varying, multivariate, large-scale data sets. Our technique can be divided into four steps: selection, clustering, attribute calculation, and quantization. The selection step is used to create a binary segmentation, denoting which of the voxels are isolated from the original data. We approach the selection step by thresholding, using the opacity contribution from a multidimensional transfer function. In addition to providing a facility to visualize correlations between multiple variables, multidimensional transfer functions are far more expressive than traditional one-dimensional transfer functions, allowing the extraction of features that have overlapping data values in one dimension [23]. To avoid the exponential memory requirements of the general n-dimensional approach, we decompose the transfer function into n × m one-dimensional transfer functions, similar to the separable transfer function concept discussed by Kniss et al. [24]. We also handle the user interface issues of interacting with an n-dimensional transfer function via a decomposition approach, allowing the user to interact with either n one-dimensional interfaces or (n2 −n)/2 two-dimensional interfaces. Transfer function widgets are used to specify the function. Multiple widgets can be created by the user to isolate multiple regions in n-dimensional space. When selected, widgets are visually linked across the decomposition, relaying each one’s contribution in n-dimensional space.
622
K. Gruchalla et al.
The clustering step classifies the selected voxels into coherent regions. We have chosen a connected-component labelling scheme as our clustering step. For the purposes of labelling, we treat our data as a binary volume, using the opacity function from our decomposed n-dimensional transfer function as a threshold to define voxels as transparent or opaque. The connected-component algorithm is then used to identify and uniquely label regions of connected opaque voxels. In the next step, the attributes of the clustered regions are calculated. The attributes we are concerned with are feature-local statistics. i.e., data variable distributions bounded to the spatial extents of the individual clusters and their central moments. The user can interactively select field variables from the data set or import derived fields directly from external analysis packages, which can be linked to the data volume by metadata descriptors [21]. The distribution local to each structure is calculated on each of these fields, and the corresponding feature-local histogram is presented in an interactive table of structure attributes (see Figure 1). Subsets of interest can be interactively selected through sorting, filtering, and selecting the structures based on their attributes (e.g., their volume and the central moments of their local distributions). The quantization step incorporates this filtered data into the transfer function as an additional dimension, providing an interactive mechanism to explore the structures.
4
Results and Discussion
We demonstrate the power of this design by using it to analyze three-dimensional data from an incompressible forced Taylor-Green turbulence simulation [1]. We examined two flow properties: flow vorticity and helicity. Vorticity is the pointwise curl of the a velocity field, ω = ∇ × v. Helicity is the cosine of the angle v·ω . Relative between velocity and vorticity vectors, a scalar quantity, Hn = |v||ω| to the size of the volume, these Taylor-Green data are characterized by many small vortical features, and when applied to the full volume, a threshold of high vorticity magnitude, |ω| ≥ 64.8, isolates over 120,000 distinct regions of strong vorticity (see Figure 2a). The global histogram of helicity (see Figure 3a) shows a nearly uniform distribution, indicating that all values of the angle between the velocity and vorticity vectors occur in the volume with similar frequency. A visualization dependent histogram, restricted to areas of high vorticity magnitude, shown in Figure 3b, yields similar insights with a nearly uniform helicity distribution. However, an examination of feature-local helicity distributions shows that in any one vortical structure (defined by this same vorticity threshold) the distributions do not exhibit this uniformity! In fact, we find that individual regions of high vorticity magnitude can be distinguished by unique helicity distributions signatures, as shown in Figure 1. Three distinct populations are apparent from the set of feature-local helicity histograms: wide noisy distributions (see row #7 in Figure 1), distributions trending toward high absolute values of helicity (see rows #11 & #12 in Figure 1), and distributions that peak near low values of helicity (see row #4 in Figure 1). By isolating structures using our transfer function, we were able to determine that the wide noisy distributions are of tangled structures (see Figure 2b) whose
Segmentation and Visualization of Multivariate Features
623
Fig. 1. Structure analysis showing histograms representing the local distributions of normalized helicity and velocity magnitude on each structure. The user can incorporate any data field or derived field into this analysis. In addition to the local distribution, the central moments of the distributions are calculated and can be used to sort, filter, select, and classify the individual structures. Multiple fields can be analyzed, displaying the relationships between the local distributions. Notice the helicity distributions of individual connected regions of high vorticity magnitude have very distinctive signatures. For example, compare rows #4 and #13. Both have relatively narrow distributions, but row #4 has a mean near zero and row #13 a mean near one, suggesting unique flow dynamics in the two regions.
components were not well separated by the initial opacity threshold used in the selection step. Distributions that peak near minimum and maximum absolute helicity values correspond to crisp isolated tube-like structures. Those with low absolute helicity are dominated by flows with nearly orthogonal velocity and vorticity vectors. Those with high absolute helicity are dominated by flows with nearly parallel velocity and vorticity vectors, oriented either in the same (positive values) or opposite (negative values) directions. Volume renderings of the vorticity of the two populations are qualitatively indistinguishable – the features are of similar sizes and shapes. However, streamlines seeded within these regions highlight their differences (see Figure 4). In the regions dominated by low helicity values, streamlines follow the vortex structure in a very tight, compact winding. These vortical structures are dominated by twist. By contrast, velocity
624
K. Gruchalla et al.
Fig. 2. Progressive refinement through selective iteration: a) a transfer function is used to isolate regions of strong vorticity (shown with a visualization dependent histogram of helicity); b) one of the connected regions is selected, isolated, and visualized (shown with the feature-local histogram of helicity); c) the opacity threshold on the vorticity axis of the transfer function is increased and then sub-structures of the previously isolated “super-structure” are isolated (shown with feature-local histograms of helicity).
Fig. 3. Histograms of helicity shows a nearly uniform distribution of angles between the velocity and vorticity vectors. a) Helicity over the entire data set. b) Helicity values in regions of high vorticity.
streamlines seeded in the regions of predominately high helicity follow the vortex structure in loose open windings. These vortical structures are dominated by writhe. While both of these regions have high helicity values along the core, the twisting feature is dominated by low helicity values (see Gruchalla et al. [22] for futher analysis). By iterating our visualization and analysis pipeline, we can further deconstruct the complex features with the broad noisy helicity distributions into substructures and investigate their individual local distributions (see Figure 2). After we have identified a “super-structure” by its wide noisy helicity distribution, we select and visualize that structure in isolation. Then, by increasing the vorticity threshold, we begin to separate the individual sub-structures within the isolated super-structure. A second iteration of the connected-component analysis, focused on this region, identifies these individual structures. Finally, by computing and visualizing the individual feature-local histograms, we again see that once the vorticity threshold is set to isolate tube-like regions, these regions have unique helicity distributions (see Figure 2c).
Segmentation and Visualization of Multivariate Features
625
Fig. 4. Two volume and stream-line renderings accompanied by their feature-local histograms of helicity, vorticity magnitude, and velocity magnitude. Left: Streamlines in a feature isolated by low absolute mean helicity follow the structure in a tight winding. This type of structure is dominated by twist. Right: Streamlines seeded in a feature isolated by high absolute mean helicity follow the structure in a loose open helix. This type of structure is dominated by writhe.
4.1
Result Comparison
A comparison of our results to a standard two-dimensional transfer function based on global distributions demonstrates the utility of incorporating featurelocal information into the transfer function. Visualizing regions of high vorticity with high or low helicity using standard two-dimensional transfer function isolates a different set points than our transfer function and generates a significant amount of noise. Some individual structures, as is clear from Figure 1, have helicity distributions that are dominated by high or low values but have tails representing a full range of helicity values. A two-dimensional transfer function that renders regions of high vorticity and low helicity will render all the low helicity voxels, including those contained in structures dominated by high helicity values. For example, the twisting feature shown in Figure 4 was isolated with |ω| ≥ 98.27 and has a local helicity distribution with a mean of -0.137 and a standard deviation of 0.281. We can attempt to isolate this feature using a two-dimensional transfer function using the same opacity threshold on |ω| and a Gaussian opacity function on helicity with μ = −0.137, σ = 0.281, as shown in Figure 5. The result does visualize portions of the feature in question; however, it also visualizes portions of 7,395 other features that overlap those opacity settings! Those 7,395 visualized regions include regions of twisting and writhing, as well as regions that do not appear to constitute a vortex. This is because the two-dimensional transfer function is non-spatial: all areas of strong vorticity and low helicity are being visualized, without regard to local topology Similarly,
626
K. Gruchalla et al.
Fig. 5. Comparison of a feature-local two-dimensional transfer function versus a standard two-dimensional transfer function. Left: a volume rendering of a twisting feature, isolated using a global vorticity magnitude transfer function coupled with a featurelocal helicity transfer function. The feature-local helicity distribution has a mean of -0.137 and a standard deviation of 0.281. Right: A volume rendering using a standard two-dimensional (vorticity magnitude versus helicity) transfer function with a Gaussian opacity function in the helicity dimension with μ = −0.137, σ = 0.281. The portions of the twisting feature shown on the right are visualized; however, so are the absolute low values of helicity from 7,395 regions, including regions near writhing vortices and regions without apparent vortices.
high helicity values contained in structures dominated by low helicity would not be rendered, fragmenting what our analysis finds to be complete and physically sensible structures. Our new transfer function framework provides a fundamentally different approach: both local and global information are incorporated into the visualization selection process, enriching the transfer function design process with spatial information.
5
Conclusion
In this paper, we have described a powerful selective visualization system that provides the user the facilities to interactively investigate multivariate relationships local to individual features. Our system incorporates imaging and visualization techniques. We introduced a systematic integration of feature-local histograms and statistics into the analysis and visualization pipeline. We provide the ability to progressively refine the features and in turn, provide a method to investigate multivariate relationships at multiple scales. This capability is particularly useful when attempting to segment ill-defined features, such as those in turbulent flow simulations where no formal definitions for features or structures of interest exist. We have demonstrated that a versatile, multivariate, featurelocal approach is advantageous when investigating ill-defined features. The reasoning behind this is two-fold. First, feature-local multivariate distributions may be distinctive where global distributions are not, providing insight into the feature properties. Second, this type of approach allows for the comparative investigation of feature-identification techniques by examining the sensitivity of the
Segmentation and Visualization of Multivariate Features
627
structure properties to those techniques. We have demonstrated the power and the efficacy of our approach with an analysis of Taylor-Green turbulence, which resulted in the discovery of two populations of strong vorticity structures with distinct flow dynamics. These populations were discovered by a systematic examination of helicity histograms calculated local to structures defined as regions of high vorticity magnitude. These two feature populations were unlikely to be discovered by traditional visualization approaches that are based on histograms of global distributions, since helicity values are uniformly distributed across all values of vorticity magnitude in these data. In addition, our case study makes a strong case for visualization-driven analysis. These data are characterized by tens of thousands of vortical features, hindering traditional approaches to statistical overviews. Our system allowed interactive investigation of feature-local histograms on multiple data fields. When feature-local helicity histograms were visualized, unique populations of features were immediately apparent. Hypotheses on the nature of dynamics of these populations were then immediately and interactively tested by visualizing individual features and seeding streamlines within their boundaries.
References 1. Mininni, P.D., Alexakis, A., Pouquet, A.: Nonlocal interactions in hydrodynamic turbulence at high reynolds numbers: the slow emergence of scaling laws. Physical review. E, Statistical, nonlinear, and soft matter physics 77 (2008) 2. Silver, D., Zabusky, N.J.: Quantifying visualizations for reduced modeling in nonlinear science: Extracting structures from data sets. Journal of Visual Communication and Image Representation 4, 46–61 (1993) 3. Silver, D., Wang, X.: Tracking and visualizing turbulent 3d features. IEEE Transactions on Visualization and Computer Graphics 3, 129–141 (1997) 4. Ebling, J., Scheuermann, G.: Clifford convolution and pattern matching on vector fields. In: Proceedings of IEEE Visualization, pp. 193–200 (2003) 5. Heiberg, E., Ebbers, T., Wigstrom, L., Karlsson, M.: Three-dimensional flow characterization using vector pattern matching. IEEE Transactions on Visualization and Computer Graphics 9, 313–319 (2003) 6. Helman, J.L., Hesselink, L.: Representation and display of vector field topology in fluid flow data sets. Computer 22, 27–36 (1989) 7. Theisel, H., Weinkauf, T., Hege, H.C., Seidel, H.P.: Saddle connectors - an approach to visualizing the topological skeleton of complex 3d vector fields. In: Proceedings of IEEE Visualization, pp. 225–232. IEEE Computer Society, Los Alamitos (2003) 8. Scheuermann, G., Tricoche, X.: Topological methods for flow visualization. In: Hansen, C., Johnson, C. (eds.) Visualization Handbook. Academic Press, London (2005) 9. Post, F.H., Vrolijk, B., Hauser, H., Laramee, R.S., Doleisch, H.: The state of the art in flow visualisation: Feature extraction and tracking. Computer Graphics Forum 22, 775–792 (2003) 10. Suzuki, K., Horibia, I., Sugie, N.: Linear-time connected-component labeling based on sequential local operations. Computer Vision and Image Understanding 89, 1–23 (2003)
628
K. Gruchalla et al.
11. Wong, P.C., Bergeron, R.D.: 30 years of multidimensional multivariate visualization, pp. 3–33. IEEE Computer Society Press, Los Alamitos (1997) 12. B¨ urger, R., Hauser, H.: Visualization of multi-variate scientific data. In: Proceedings of EuroGraphics 2007 (State of the Art Reports), pp. 117–134 (2007) 13. J¨ anicke, H., Wiebel, A., Scheuermann, G., Kollmann, W.: Multifield visualization using local statistical complexity. IEEE Transactions on Visualization and Computer Graphics 13, 1384–1391 (2007) 14. Sauber, N., Theisel, H., Seidel, H.P.: Multifield-graphs: An approach to visualizing correlations in multifield scalar data. IEEE Transactions on Visualization and Computer Graphics 12, 917–924 (2006) 15. Kniss, J., Kindlmann, G., Hansen, C.: Multidimensional transfer functions for interactive volume rendering. IEEE Transactions on Visualization and Computer Graphics 8, 270–285 (2002) 16. Park, S.W., Budge, B., Linsen, L., Hamann, B., Joy, K.I.: Multi-dimensional transfer functions for interactive 3d flow visualization. In: Proceedings of the 12th Pacific Conference on (PG 2004) Computer Graphics and Applications, pp. 177–185. IEEE Computer Society, Los Alamitos (2004) 17. Bajaj, C.L., Pascucci, V., Schikore, D.R.: The contour spectrum. In: Proceedings of the 8th conference on Visualization 1997, pp. 167–175. IEEE Computer Society Press, Los Alamitos (1997) 18. Tenginakai, S., Machiraju, R.: Statistical computation of salient iso-values. In: Proceedings of the Symposium on Data Visualisation 2002, pp. 19–24. Eurographics Association, Barcelona (2002) 19. Correa, C., Ma, K.L.: Size-based transfer functions: A new volume exploration technique. IEEE Transactions on Visualization and Computer Graphics 14, 1380– 1387 (2008) 20. Lundstr¨ om, C., Ljung, P., Ynnerman, A.: Local histograms for design of transfer functions in direct volume rendering. IEEE Transactions on Visualization and Computer Graphics 12, 1570–1579 (2006) 21. Clyne, J., Mininni, P.D., Norton, A., Rast, M.: Interactive desktop analysis of high resolution simulations: application to turbulent plume dynamics and current sheet formation. New Journal of Physics 9 (2007) 22. Gruchalla, K., Rast, M., Bradley, E., Clyne, J., Mininni, P.: Visualization-driven structural and statistical analysis of turbulent flows. In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, J.-F. (eds.) IDA 2009. LNCS, vol. 5772, pp. 321–332. Springer, Heidelberg (2009) 23. Kindlmann, G., Durkin, J.W.: Semi-automatic generation of transfer functions for direct volume rendering. In: Proceedings of IEEE Visualization, pp. 79–86 (1998) 24. Kniss, J., Premoze, S., Ikits, M., Lefohn, A., Hansen, C., Praun, E.: Gaussian transfer functions for multi-field volume visualization. In: Proceedings of IEEE Visualization, pp. 497–504. IEEE Computer Society, Los Alamitos (2003)
Magic Marker: A Color Analytics Interface for Image Annotation Supriya Garg, Kshitij Padalkar, and Klaus Mueller Stony Brook University
Abstract. This paper presents a system that helps users by suggesting appropriate colors for inserting text and symbols into an image. The color distribution in the image regions surrounding the annotation area determines the colors that make a good choice – i.e. capture a viewer’s attention, while remaining legible. Each point in the color-space is assigned a distance-map value, where colors with higher values are better choices. This tool works like a “Magic Marker” giving users the power to automatically choose a good annotation color, which can be varied based on their personal preferences. Keywords: Human-computer interaction, Color-space.
1 Introduction Annotation of an image with text is a commonly encountered task. It is a process people use to create art, advertisements, presentations or educational tools. While using popular image processing tools like Photoshop people usually use a trial and error method before arriving at a suitable choice. None of these tools guide the users by pointing out the appropriate set of colors. The most common method to accomplish this task is to choose colors via the HSV color wheel. This paper presents a solution to this annotation problem by designing a user interface which works as a guide to the user in this color choosing task. We build an intuitive user interface which can be easily used by a non-expert, and base our calculations on well-known color perception paradigms. Our framework captures known color design rules to form a grading scheme. This grading scheme along with user preferences can help derive appropriate colorizations. The annotation problem includes problems like layering, highlighting and blending with the background. In our current work, we focus on selecting a color for text. Text has the interesting property that it has a high level of detail, but is familiar to people at the same time. We focus on legibility, i.e. each letter should be clearly visible, and recognizable on its own. This is different from letting people to be able to speed-read, since in that process users tend to use their mental model to complete illegible letters. Studies on readability or legibility of text, given a foreground-background color combination show that the subjective opinion is frequently based on the aesthetic and stereotypic presumptions and may thus differ from the objectively measured performances [5]. This tells us that it is better to have an interactive tool which guides the user towards a good sub-space; the user has to make the final choice. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 629–640, 2011. © Springer-Verlag Berlin Heidelberg 2011
630
S. Garg, K. Padalkar, and K. Mueller
Fig. 1. The flowchart shows how the Magic-marker system calculates, and updates the preference map (p-map), once the user loads an image into it
The large amount of the work on legibility by web page designers uncovers a wealth of information regarding the importance of luminance contrast and chromatic contrast. A good review of the past studies on readability and legibility on posters and CRT displays is presented in the paper by Humar et al [6]. Most results suggest that a luminance contrast accounts for most of the variance in typical legibility experiments. Experiments by Travis et al [15] show that when the luminance contrast between text and background color was 0, a near perfect reading was still possible. This important finding means that purely chromatic differences may be sufficient for the visual system to maintain word identification. They explained the previous results by saying that typical displays produce much larger multiples of threshold luminancecontrast than threshold chromatic-contrast. Experiments on reading speed on a color monitor [8] found that when both color and luminance contrast are present, there is no sign of additive interaction, and performance is determined by the form of contrast yielding the highest reading rate. Studies done on legibility on multi-color CRT displays [13] confirm this to some extent by saying that chromaticity contrast and luminance contrast are additive only under specific conditions. However, their results give more importance to luminance contrast by saying that chromaticity contrast can neither improve legibility if an acceptable level of luminance contrast is already present, nor substitute for luminance contrast.
Magic Marker: A Color Analytics Interface for Image Annotation
631
Even today color difference (ΔE) equations assume that luminance and chromatic differences are additive — they usually use a weighted Euclidean distance approach. This clearly shows the lack of conclusion on the combined influence of luminance contrast and chromatic contrast on legibility. Our tool is therefore meant to be used interactively to quickly get a choice of optimal label colors from which users can choose one according to their preferences. We encode rules from color perception, and legibility studies to arrive at the optimal colors. The rest of the paper is divided as follows. Section 2 looks at some related work. We discuss the contributions of our work in section 3. In section 4, we discuss the theory behind our distance calculation and interface design. In section 5, we discuss the working details of the magic marker. In section 6 we look at some results. Finally we conclude in section 7, and discuss ways to apply this tool in different applications.
2 Related Work In this section we look at some of the related work. First we look at how color can be represented to closely imitate the way humans perceive them. Next, we look at some methods that study the interaction of color. These papers concentrate on two aspects – generating color maps for mapping data to color, and designing interfaces where users can directly manipulate the colors in their visualization. The Munsell color chart is used to evaluate the perceptual qualities of color spaces. Munsell describes color in terms of hue, value and chroma; hue corresponds to dominant wavelength, value to brightness and chroma to colorfulness. Unlike saturation, which is a statement of colorfulness as a fraction of the maximum possible for a given hue and value, chroma is an absolute measure of colorfulness. The maximum possible chroma differs among hues – for example, the maximum chroma for red is much greater than for green. The landmark texts by Itten [7] and Wong [19] provide great insight on the human perception of color and its aesthetic aspects. Much information is also available in books by Stone [14] and Ware [17]. Color mapping is the well-studied topic of mapping data points to color based on human perception, cognition, and color theory. PRAVDAColor [2] helps users select color maps for mapping data points to color in scientific visualization. The Color Brewer [3] contains expert designed color palettes for mapping cartographic scalar data. These tools either take a lot of tweaking to come up with a suitable palette, or are pre-designed by experts. The work presented in [18] takes the expert out of the loop, and generalizes this process. Relevant also is the interactive color palette tool proposed by Meier et al [10], designed to support creative interactions for graphics designers. While designing color, we often want to ensure that the final image looks natural, and aesthetically pleasing to the users. In [12] it was demonstrated that color maps should preserve a monotonic mapping in luminance since they are perceived as more natural by the human observers. One popular design aspect is color harmony, which defines sets of colors that are aesthetically pleasing to human perception. In search of an intuitive 2D representation for visual designers, Itten then arranged the harmonic colors into a color wheel, which flattened this color space to mostly variations in hue. This system was used by Cohen
632
S. Garg, K. Padalkar, and K. Mueller
et al [4] to quantify the color harmony of an image and shift the hues towards a harmonic setting. Wang et al [16] also use color harmony based rules while creating their framework to help users select colors in scene objects. The users can specify their hues of choice, while the system assists by making suggestions based on aesthetics, and by optimizing the luminance and contrast channels. Neophytou and Mueller [11] develop a framework that allows users to manipulate colors directly in a 3D perceptual color space, instead of using multiple iterations using 2D color manipulation tools. This system assists users in object highlighting and annotation in color-rich scenes.
Fig. 2. Illustrates the different components of the user-interface
3 Our Approach In this section, we discuss some of the shortcomings of the current state of art, and how we address these issues in our work. Global vs. local effects: The work by Bauer tells us that as long as we pick a color outside the convex hull formed by the colors of a set of displayed data points, we can easily spot this target color. However, when we consider the case of a photographic image, an annotation occupies only a small portion of it. Therefore, the annotation color is highly influenced by the colors in the background surrounding it. For e.g. if the person standing next to you wears a combination of red and green, it will strike as being mismatched. However, if there are two people standing next to each other – one in red, and the other in green, the color clash reduces, and as they move further away, this clash slowly disappears. Also, when you look at Figure 2 (b), the outermost convex-hull (CH3) occupies almost the entire color-space, leaving little choice for the users. However CH1, the convex-hull formed by the colors surrounding the text is
Magic Marker: A Color Analytics Interface for Image Annotation
633
much smaller, and is the set which truly represents the colors we must avoid. We take this local vs. global effect into consideration by giving different weights to different regions of the image depending on their distance from the annotation. Conflicts in color mixing: The studies on legibility and readability of web-pages have shown us that there is no clear consensus on the interaction of luminance and chromatic contrast. Our system has an intuitive interface to assist users where they are provided with informed choices to help them make the final decision. Color Space: Human vision is designed such that it is natural to describe colors as locations in a 3D space. The tristimulus values of a color are the amounts of three primary colors needed to match that test color. In the CIE 1931 color space, tristimulus values provide a complete color description; however they do not correspond to the way humans perceive colors. Distances between tristimulus triples do not reflect the degree to which we perceive the colors to differ. Therefore, the CIE introduced two perceptual color spaces in 1976 – the CIELUV and CIELAB spaces. The CIELUV space was used in [18] to design their color palettes. Though these spaces are designed so that the distance in the color space is proportional to the difference perceived by humans, they do not contain the colors arranged uniformly along the hue, and chromatic channels. The Munsell data set, as discussed in section 2 is an example of a color space which separates luminance, hue and chroma into 3 orthogonal axes. Therefore, for our calculations, and user interface, we use a modified CIELAB space, such that it satisfies the properties of the Munsell color set.
4 Overview The main task for our system is to find rules which assign legibility scores to each point in our 3D color space. In this section, we look at how these rules are derived. Minimum contrast for legibility: Web-page designers have long studied the ways to choose a good foreground-background combination for maximum legibility. Though placing text on an image is far tougher since the background color is rarely uniform, we can still learn from their work. Since results from their studies show that a certain luminance contrast assures legibility, we must make sure that we achieve the minimum contrast from most colors in the background. According to the studies by Maureen Stone [14], a contrast (ΔL) of 20 is legible for text; contrast of 30 is easily readable; and contrast of 60 is robustly readable. This result helps us assign scores to intensity levels based on luminance contrast. Perceptually uniform CIE-LAB space: We require a perceptually uniform color space for our system since this would lend to an intuitive system for a user. In the CIE-Lab, colors at similar distance have similar perceptual distances; however the space itself is not a perceptually uniform space. This is because colors sharing the same Lab hue angle do always not share the same apparent hue. For example if you move inwards along a line joining a bright blue to the origin, the apparent hue changes to purple: this is the well-known “blue turns purple” problem in gamut
634
S. Garg, K. Padalkar, and K. Mueller
mapping. Furthermore, the angles between hues are not equal, and colors of constant chroma are not equidistant from the neutral axis. To rectify this, we use the non-linear mapping ICC profile provided by Bruce Lindbloom [9]. Fig. 3. The transform of the CIE-Lab space to UPThe color space ICC profile Lab space for the Munsell value 5 performs a nonlinear mapping of Lab so that all properties of the Munsell color set are met. This means that when CIE-Lab colors are transformed through this profile, the resulting Uniform Perceptual Lab (or "UP Lab") colors have these properties:
• All colors of constant Munsell hue have the same UP Lab hue angle. • All Munsell hues are evenly distributed around the hue wheel. • All colors of constant Munsell chroma (saturation) lie the same distance from the neutral, i.e. chroma rings are perfect circles, centered on neutral. • All chroma rings are equally spaced. During this transformation, the L* channel and the neutral colors remain unaffected. An illustration of the transformation is shown in Figure 3. Convex hulls: Bauer et al in [1] have shown that given a bunch of colored points laid out as a scatterplot, it is easier to find a target color with a color outside the convex hull (in the CIE-Lab space) of the source colors than inside the convex hull. This holds both when the target was linearly separable in chromaticity only, or in a combination of luminance and chromaticity. This result gives us a metric which calculates the score for each point in the CIELab space based on their distance from the convex hull. Theory: Based on the background work done in color perception, and legibility of text on multi-color screens, we form some ground rules which help us in designing the Magic Marker user interface. We use three orthogonal preference-maps (p-maps) to find the colors which lead to good legibility. Finally, we combine these values by taking a weighted sum, to find the final preference-map value – the joint p-map. Luminance Contrast: We need to make sure that there is a minimum amount of luminance contrast between the text and background. Since the background is not uniform, we calculate the contrast based on the intensity histogram of the background. Our thumb-rule is: ΔL > 30 is sufficient for giving an ideal annotation color. Hue based contrast: Luminance contrast alone is not sufficient to make a good labeling color. We need to make sure that the hue is also sufficiently distinct. For this purpose, we use the UP-Lab space to calculate our hue-value. The hue is determined by the angle formed with the origin at a given intensity: hue = tan-1(b/a). This is
Magic Marker: A Color Analytics Interface for Image Annotation
635
similar to the LCHLab space as defined by Bruce Lindbloom. The further away a hue is from the hues in the background, the better it performs. Convex hulls: As we learn from Bauer et al [1], it is easier to find a target color when it lies outside the convex hull formed by the distractor colors. This makes sure that we avoid the hue and chroma Fig. 4. Sliding window. We combinations present in the background. Also, since show how uniform intensity is determined by passing a this space covers chroma, we do not create a separate window less than the size of preference-map for it. To make sure that we give a letter higher importance to local features, we divide the image into three nested boxes (see Figure 2 (e)) – the annotation box (b1) which is the bounding box for the annotation, the container box (b2) that extends a small amount beyond b1, and finally the whole image (b3). Interface design: While designing the interface, we have to make sure that a user can explore the color space in the intuitive manner. Further, he should be able to make small changes to the annotation color in a controlled fashion, i.e. change the luminance, chroma or hue in small steps. So, for each intensity level, we display the corresponding colors, and to guide the users, we overlay circles to represent chroma, and sectors to represent hues. The users can move along the sectors to get more saturated/de-saturated colors, and move along the circles to gradually change the hue.
5 Implementation Details A user can upload a picture of choice into the interface, and the system gives a score to each point in the 3-D color space, based on its legibility potential at a given location in the image. We work in the UP Lab color space for the score calculation, and the visual interface. To calculate the scores, we create a preference-map (p-map), which takes into accounts three values: (a) the convex hull p-map, (b) intensity based p-map, and (c) hue based p-map. For all of them, we work on the histogram of the image, in different dimensions. Also, to avoid sudden changes, and to take neighboring values into account, we apply the Gaussian filter at different steps. An overview of our system is presented in the flowchart in Figure 1. Initially, bounding boxes are created for the region surrounding the text, and the intensity pmap is calculated based on the colors present in these regions. The user can then select the appropriate intensity levels, and for the selected intensity, the hue based and convex hull based p-maps are created to display the joint p-map which is a weighted sum of all three p-maps. The rest of the section will discuss these steps in detail. Calculating the preference-map (p-map): Here we discuss how a color’s hue, intensity, and location with respect to the convex hulls defines its p-map value. Intensity based p-map: The intensity based p-map assigns each intensity level in the range [0,100], a value based on how well colors in that level will work from the point of view of luminance contrast with the background. As we mentioned in section 4.1, a contrast of 30 will make text easily legible. Here contrast is the absolute difference
636
S. Garg, K. Padalkar, and K. Mueller
between the foreground and background intensities. Intensities that are farther away from the ones present in the background are given a higher value. A simple way is to calculate the intensity histogram, reverse it, and normalize it to the range[0,1]. Also, applying a Gaussian filter will help to give a lower value to intensities in close proximity to those in the background. However, we found that this didn’t work well enough in cases where there is a small patch with an almost uniform intensity (see Figure 4). This is because they do not have sufficiently large presence in the histogram to be penalized heavily; however since they form a continuous block, using a similar intensity works poorly. So we add a module which passes a sliding block over the text area to find out regions with small variations in intensity. If any such region is found, then the mean intensity, and intensities within a certain width (sigma width) are given a low value in the final histogram. These low values start from zero at the mean, and increase linearly towards the ends.
Fig. 5. (a) Shows the interface, including the joint p-map at intensity level 50. Next we look (b) CH p-maps, and (d) Hue p-maps look separately. (c),(e) show the image annotated with the best p-map value restricted to CH, and hue respectively.
Hue based p-map: In the previous section we assign each intensity level a value. Within an intensity level, we would like to avoid the hues which are present in the background, especially those with similar intensity levels. We first find the hue values for each point in our current luminance space. Then we calculate the hue histogram for the annotation box, and the container box. This is followed by smoothing using a circular Gaussian filter since hues wrap around in the UPLab space. Then we normalize the data to the range [0,1]. Finally, the invert function is applied to get the hue p-map pH. Convex hull based p-map: The global convex hull method mentioned in section 3.3 appears overly restrictive for annotations which are local in nature, i.e., the color
Magic Marker: A Color Analytics Interface for Image Annotation
637
content of a distant image region may not be a distractor here. Further, they work in 2-D space, not taking the intensity into account. We therefore extend this idea by: (a) Calculating convex hulls separately for each intensity level. In order to prevent sudden changes in the shape of the convex hulls, and also to take the occurrence of colors in neighboring intensities into account , we apply a 1-D Gaussian filter to the histogram of each chroma value. (b) Creating separate convex hulls for three nested boxes in the image. Since the local effect is expected to be greater than the effect of the image as a whole, we give different weights to the boxes. The parameter window-size tells us the distance by which b2 extends beyond b1 (see Figure 2 (e)). Also, we give these 3 boxes weights w1, w2 and w3, with w1 > w2 > w3. Since weights are really ratios, we have 2 more parameters w1 and w2, with w3 = 1. We then calculate the distance field DF(a, b) of all possible a∗b∗ pairs from the convex hulls as follows: ∑i(wi × DFi(a, b)), where DFi(a, b) is the distance of point (a, b) from the ith convex hull, CHi (see Figure 5 (b)). These distance field values are normalized to the range [0,1] to give us the convex hull pmap values pCH. The joint p-map is calculated as a weighted sum of the scores calculated in the previous three subsections, i.e. (wI*pI + wH*pH + wCH*pCH) / (wI + wH + wCH). Visual Interface: The visual interface as shown in Figure 2 helps the user explore the UP Lab space and find colors which have good p-map values for a given image and location. The intensity map is shown as a histogram at the bottom to guide the user towards intensities providing good luminance contrast. Next, we see a legend at the right, which shows the range of the values present in the p-map at the current intensity level. Finally, we see the color space, along with the convex hulls (in red). We also overlay sectors and circles on the color space, so that a user can easily select colors with the same hue or saturation. The user can toggle the display to see the p-map instead. In this view, we discretize the p-map values into 20 levels, and assign each point an intensity value based on these levels. This helps the user follow the patterns in the p-map much better. The user can hover the mouse over a point to see the exact p-map value. We retain the circles and sectors from the color space display, and fill them with the colors they represent to guide the user as he makes selections on the basis of the p-map. When an expert user is using the system, we can expose the parameters of the system, so that he can tweak them to get better results. We provide an optimal button, which finds all the colors with the highest p-map values. This requires us to find the joint p-map values at all intensity levels. Next we find the maximum p-map value maxp-map, and return all the colors with the property: p-map(l, a, b) > maxp-map x thresholdopt, The threshold value thresholdopt is a real number between 0 and 1. We usually use a value of 0.95 in our interactive system. Once these optimal colors have been found, the user can easily browse through them using forward/backward buttons. A minor tweak in one of the optimal colors often gives acceptable results.
638
S. Garg, K. Padalkar, and K. Mueller
6 Results The Magic-marker tool is very simple to use for even novice users, and responds interactively for most operations. Let us look at a running example to see how the system works.
Fig. 6. This shows some sub-optimal results obtained when the sigma-cutoff parameter is changed to 0. The first two images come from the higher ends of the intensity range, and the third image is from a lower intensity.
Case Study: The tool starts up with some initialization steps. The mapping of the CIELab space to UPLab space using the look-up table takes a relatively large amount of time (~1-2 minutes). After the system starts running, a user can load any image of his choice. At this point, a default setting is already initialized, along with the default intensity level setting of 50. Once the image has been read in, we can see the intensity p-map values as a histogram, as well as the joint p-map values of the currently selected intensity in a 2-D representation as shown in Figure 5(a). The intensity p-map guides the user towards the intensity levels with higher p-maps. In the present case, we can clearly see that the intensity p-map is highest in the middle region – around the intensity level 50. Next, we look at the hue p-map and convex-hull p-map separately to understand how the p-map calculation works in those spaces. Convex hull p-maps, as shown in Figure 5(b) have the lowest p-map values inside CH1 (where it is always zero), and they progressively increase as we move outwards from CH1. Finally, a p-map value of 1 is reached at atleast one of the edge locations. In this example, the maximum is reached at purple, on the lower left corner. The histogram p-map calculation takes a hue sweep value of 180. This means that if there is exactly one color present in the background, the hue p-map values will increase starting from 0, as we move away from the background hue, reaching a maximum value of 1 at the diametrically opposite hue value. However, since the hue p-map is already normalized, we will definitely see a hue value with the p-value 1 throughout. In Figure 5(d), we can see the hue value with p-map 1 highlighted – let us call it huepref-max. The text in Figure 5(e) is annotated with a color which is a combination of the current intensity level (50), hue value huepref-max, and the maximum possible chroma for this hue and intensity combination. This analysis of the individual p-maps shows us that due to the way the CH p-map is calculated, the best colors will always lie on the edge of the color space, i.e. those colors that have the maximum possible chroma for a hue and intensity combination. Intensity maps and hue maps on the other hand restrict the best colors to a certain intensity level or hue value – thus any intensity or hue can be given the highest p-map
Magic Marker: A Color Analytics Interface for Image Annotation
639
value. When we combine these p-maps to form the joint p-map, the restriction from CH p-maps will remain. The best colors as suggested by the hue p-map and CH p-map are clearly distinct – red and purple respectively. However, they are both perceived to perform equally well. We transfer our attention back to the joint p-map, with all three p-maps being weighed equally. Next we press the optimize button, which indeed returns variations of the red and purple annotation colors seen in Figure 5. Our default setting had the sigma cutoff value set at 15. If we change it to zero, we end up with the intensity p-map at the bottom of Figure 5(a). This intensity p-map gives a good preference value to a much larger number of intensities. This makes it difficult for a user to explore the space, and find an appropriate color. Even the optimal button returns a lot of colors which do not lend themselves to a legible annotation. Some of them are shown in Figure 6. The first two have intensities in the higher intensity range (about 60-65), whereas the blue comes from intensity level 23. All these intensities have a low to mid-range score in the first intensity p-map, and are never considered as optimal color ranges. This shows us that identifying regions of uniform intensity in the background is a highly necessary process.
7 Conclusion In this paper we present an interactive framework for users to choose appropriate colors for annotating their images. This works better than a static program which can suggest the best colors, since the field of legibility is still under study, and has no fixed conclusions on the interaction between luminance contrast and chroma contrast. Our system is fairly general, and can be easily modified to incorporate any newly acquired knowledge about color perception and luminance/chromatic contrast. Future work includes extending this system to support highlighting elements in volume datasets. Further, we can incorporate aesthetics into our system by giving scores to colors based on color harmony.
References 1. Bauer, B., Jolicoeur, P., Cowan, W.B.: Distractor heterogeneity versus linear separability in colour visual search. Perception(London. Print) 25(11), 1281–1293 (1996) 2. Bergman, L.D., Rogowitz, B.E., Treinish, L.A.: A rule-based tool for assisting colormap selection. In: Proc. of IEEE Visualization 1995, pp. 118-125 (1995) 3. Brewer, C.A.: Color use guidelines for data representation. In: Proceedings of the Section on Statistical Graphics, American Statistical Association, pp. 55–60 (1999) 4. Cohen-Or, D., Sorkine, O., Gal, R., Leyvand, T., Xu, Y.Q.: Color harmonization. In: Proceedings of ACM SIGGRAPH 2006, vol. 25(3), pp. 624–630 (2006) 5. Hall, R., Hanna, P.: The impact of web page text-background colour combinations on readability, retention, aesthetics and behavioural intention. Behaviour and Information Technology 23(3), 183–195 (2004) 6. Humar, I., Gradisar, M., Turk, T.: The impact of color combinations on the legibility of a web page text presented on CRT displays. International Journal of Industrial Ergonomics 38, 885–899 (2008)
640
S. Garg, K. Padalkar, and K. Mueller
7. Itten, J.: The art of color. Van Nostrand Reinhold, New York (1969) 8. Legge, G.E., Parish, D.H., Luebker, A., Wurm, L.H.: Psychophysics of reading. XI. Comparing color contrast and luminance contrast. Journal of the Optical Society of America. A, Optics and Image Science 7(10), 2002–2010 (1990) 9. Lindbloom, B.J.: CIE Lab to Uniform Perceptual Lab profile is copyright © 2003 Bruce Justin Lindbloom, (2003) 10. Meier, B.J., Spalter, A.M., Karelitz, D.B.: Interactive color palette tools. IEEE Computer Graphics and Applications 24(3), 64–72 (2004) 11. Neophytou, N., Mueller, K.: Color-Space CAD: Direct Gamut Editing in 3D. IEEE Computer Graphics and Applications 28(3), 88–98 (2008) 12. Rogowitz, B.E., Kalvin, A.D.: The “Which Blair Project”: a quick visual method for evaluating perceptual color maps. In: Proc. of IEEE Visualization 2001, pp. 183–190 (2001) 13. Spenkelink, G., Besuijen, J.: Chromaticity contrast, luminance contrast, and legibility of text. Journal of the Society for Information Display 4(3), 135–144 (1996) 14. Stone, M.C.: A field guide to digital color. AK Peters, Ltd. (2003) 15. Travis, D.S., Bowles, S., Seton, J., Peppe, R.: Reading from color displays: A psychophysical model. Human Factors 32(2), 147–156 (1990) 16. Wang, L., Giesen, J., McDonnell, K.T., Zolliker, P., Mueller, K.: Color Design for Illustrative Visualization. IEEE TVCG 14(6), 1739–1754 (2008) 17. Ware, C.: Information visualization: perception for design. Morgan Kaufmann, San Francisco (2004) 18. Wijffelaars, M., Vliegen, R., van Wijk, J.J., van der Linden, E.J.: Generating Color Palettes using Intuitive Parameters. In: Computer Graphics Forum, pp. 743–750 (2008) 19. Wong, W.: Principles of color design. Wiley, Chichester (1996)
BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data Julian Heinrich, Robert Seifert, Michael Burch, and Daniel Weiskopf VISUS, University of Stuttgart
Abstract. Exploring data sets by applying biclustering algorithms was first introduced in gene expression analysis. While the generated biclustered data grows with increasing rates due to the technological progress in measuring gene expression data, the visualization of the computed biclusters still remains an open issue. For efficiently analyzing the vast amount of gene expression data, we propose an algorithm to generate and layout biclusters with a minimal number of row and column duplications on the one hand and a visualization tool for interactively exploring the uncovered biclusters on the other hand. In this paper, we illustrate how the BiCluster Viewer may be applied to highlight detected biclusters generated from the original data set by using heatmaps and parallel coordinate plots. Many interactive features are provided such as ordering functions, color codings, zooming, details-on-demand, and the like. We illustrate the usefulness of our tool in a case study where yeast data is analyzed. Furthermore, we conducted a small user study with 4 participants to demonstrate that researchers are able to learn und use our tool to find insights in gene expression data very rapidly.
1
Introduction
Modern research has developed promising approaches for analyzing gene expression data. The major bottleneck of such analyses is the vast amount of data. The behavior of single genes under different conditions such as different time points or tissue types are contained within these massive data sets. Such data poses a difficult task for analysis, as it might be incomplete and often has a low signal-to-noise ratio. Biclustering is frequently used to analyze such data sets and to discover dependencies among genes and conditions. A single bicluster represents part of a table in which the corresponding genes and the conditions involved behave similarly in a certain way. The immense and growing number of data sets as well as biclusters make it very difficult or even impossible to uncover all dependencies in a single static view and hence, exploit human perceptual abilities for a fast exploration of the data mapped to a visual form. Another problem is the fact that it is not clear a priori what an analyst hopes to detect in the data. In this paper, we present an interactive visualization tool that can be used to visualize a large number of biclusters. In general, the tool follows the Information G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 641–652, 2011. c Springer-Verlag Berlin Heidelberg 2011
642
J. Heinrich et al.
Visualization Seeking Mantra: Overview first, zoom and filter, then details-ondemand proposed by Shneiderman [1]. The visualization techniques used in the tool are based on heatmap representations as used by Eisen et al. [2] and parallelcoordinate plots first introduced by Inselberg and Dimsdale [3]. In a case study we illustrate how a yeast data set can be explored for biclusters. Furthermore, we demonstrate the usefulness of our BiCluster Viewer by conducting a small user study with 4 participants. The visualized biclustering data is based on both an artificial data set and a real-world data set from gene expression analysis. Our visualization tool is not restricted to bicluster representation in gene expression data but can easily be applied to any kind of matrix-like data containing real-valued numbers. Heatmap representations are ideal to visualize large tabular data sets such as gene expression data and to get an overview representation by mapping data values to color values. Parallel coordinate techniques are used to map the rows and columns of biclusters to vertical axes and connect them by direct lines. Visual clutter caused by line crossings in dense data sets can be reduced using interactive features such as filtering, transparency, and color coding. Linking and brushing techniques are used to allow different views on selected subsets of the data simultaneously.
2
Related Work
Most related applications use heatmaps [2], parallel coordinates [3] or node-link diagrams for the visualization of biclusters. While heatmaps use a matrix layout and color to indicate expression levels, parallel coordinates use polylines across many axes (representing columns of the matrix) to represent multivariate data. Both visualizations suffer from the ordering problem such that is generally not possible to represent more than two biclusters contiguously without row (line) or column (axis) duplication. Grothaus et al. [4] represent overlapping biclusters in a single heatmap and allow row and column duplications if biclusters cannot be represented contiguously. While being optimal with respect to the number of duplications, such an automatic layout algorithm does not allow for interactivity. In our work, we allow overlapping biclusters and the user may decide which biclusters to show contiguously in order to minimize row and column duplications. BicOverlapper by Santamaria et al. [5] is able to represent several overlapping biclusters simultaneously. But the techniques used there are based on graphs represented in a node-link visual metaphor which is different to our work where heatmaps and parallel coordinates are used. Cheng et al. [6,7,8] use parallel coordinates to visualize additive and multiplicative biclusters. Visual clutter [9] caused by many intersecting lines is the main problem when drawing parallel-coordinate plots for dense data sets. We circumvent this drawback by using transparency and color coding of single lines and line groups. Furthermore, lines that are currently not in focus may be suppressed to further reduce the overlap and visual clutter. These clutter reduction
BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data
643
principles are also used in [10]. We further extend parallel-coordinate plots using bundling to show biclusters. The BiClust package developed by Kaiser et al. [11] is an extension for the R environment [12]. The package proposes a variety of computation algorithms and also many visualization techniques to represent the biclustered data. In addition to the traditional heatmap representation and parallel-coordinate plots also bubble charts are provided. Our tool alllows to use BiClust or any other algorithm provided in R for the computation of biclusters. BiCat [13] is another application that can be used to analyze biological data such as gene expression data. In contrast to R it is not based on command line arguments but provides a graphical user interface to manipulate and navigate in the data. All computed biclusters are shown in a list where they can be selected. A selection in the heatmap or parallel-coordinate plots is not supported by this tool. Also, a comparison of certain biclusters is not possible in this tool because only one bicluster at a time is represented. ExpressionView [14] is another R package that allows heatmap-based browsing of biclusters obtained from gene expression experiments. The tool uses an ordering that maximizes the areas of the largest contiguous parts of biclusters. Again, the ordering is fix and the user is not allowed to change that.
3
Bicluster Viewer
In this paper we present the Bicluster Viewer for visualizing biclustering results based on gene expression data. We use and extend traditional heatmap representations and parallel-coordinate plots such that more than one bicluster can be visualized simultaneously. In addition, we support contiguous representation of selected biclusters by allowing row and column duplications. 3.1
Heatmap Visualization
Heatmaps are a good choice to represent a large portion of the data as an overview in a single static view. Many implementations of the heatmap use multihue colormaps to indicate up- and downregulated genes separately. In this work, we map data values to grayscale values using linear interpolation between the smallest and largest value of the data matrix, as changes in single-hue colormaps are perceived more accurately than red to green color scales for continuous data values. In general, it is not possible to represent more than two biclusters in a way that all of them are located in contiguous regions in the matrix using row and column permutations only (see Figures 1 (a) and (b)). Bicluster Representation. In order to achieve a contiguous representation of more than one (possibly overlapping) bicluster, we allow row and column reordering as well as duplication. Biclusters are highlighted in the heatmap by surrounding rectangles. If biclusters cannot be represented contiguously, they are represented by several rectangles. To distinguish different biclusters perceptually, we use a nominal color scale that maps a unique color to each bicluster. For noncontiguous biclusters, only the largest area is displayed by default. The user can
644
J. Heinrich et al.
(a)
(b)
Fig. 1. Schematic example for the insertion of a bicluster without duplication: (a) The green and red colored biclusters are inserted in the matrix and their overlap is represented by a yellow color. (b) The dark blue colored bicluster is inserted in columns {3, 5, 6, 7, 8}. Without duplication, it is not possible to represent all biclusters in a connected way. The duplicated columns {3, 8} are displayed in a transparent red color.
decide interactively if biclusters should be represented only by its major rectangle or if all rectangles of a bicluster should be displayed. For the latter, the user gets a representation where the overlap of disconnected biclusters is also visualized. In this mode, only the major rectangle of a bicluster is represented by continuous lines, all other rectangles are displayed with dashed lines. Selected biclusters are color-coded with a transparent yellow color which is blended additively in overlapping regions. However, the user has the opportunity to choose colors for every bicluster individually. Figure 2 shows the different representations of the heatmap for an example dataset containing six biclusters. The heatmap in the left part of Figure 3 shows that the light blue colored bicluster is displayed as disconnected rectangular regions. In the right part of the figure, two columns are duplicated to achieve connectivity. These are displayed in a transparent red color. To better distinguish duplicated columns from original columns we insert direct links to the header as shown at the bottom of Figure 3. These can be displayed on user demand. By clicking on the lines the user can interactively highlight all corresponding rows and columns. While other automatic layout algorithms [4] produce one optimized solution, we developed an algorithm that allows the user to choose one bicluster that should be represented contiguously. Subsequent biclusters are then inserted iteratively into the matrix based on the total size of row- and column-overlap. In every iteration, the bicluster with the largest overlap to the last inserted bicluster will be added to the current set of biclusters, and rows and columns are duplicated to ensure a contiguous representation. By default, the first bicluster that is chosen is the bicluster with the largest total overlap (i.e. the sum of overlaps with all other biclusters). 3.2
Parallel-Coordinate Plots
We further use parallel-coordinate plots to display biclusters. Our visualization tool exploits linking and brushing techniques to link heatmap and parallelcoordinate plots. This allows a user to explore the data from different points of view.
BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data
(a)
(b)
(c)
(d)
645
Fig. 2. Different representation modi for the heatmap: (a) Default view: each bicluster represented by its major rectangle only. (b) All biclusters are represented. (c) Representation with three highlighted biclusters. (d) Representation with permanently highlighted biclusters.
In the parallel-coordinates plot, each polyline represents the expression of a gene over all conditions (which are represented by vertical axes). Genes belonging to a bicluster are rendered using the same color as the corresponding bicluster in the heatmap. The axes of the parallel-coordinates plot are arranged in the same order as the columns in the heatmap representation. In order to visualize the conditions (axes) of genes belonging to a bicluster, we compute the average vertical position of all lines of a bicluster halfway between adjacent conditions if at least one of the conditions is part of the bicluster. Then, the corresponding lines are forced to cross this point, which we call the centroid. Figures 4 (a)-(c) illustrate both visual metaphors in a heatmap representation and a parallelcoordinates plot for two biclusters. 3.3
Interactive Features
The BiCluster viewer supports many interactive features. Figure 6 shows how the graphical user interface of the BiCluster viewer is structured and Figure 5 (a) shows a screenshot of the visualization tool in its biclustering mode.
646
J. Heinrich et al.
Fig. 3. Column duplication to achieve bicluster connectivity
(b)
(a)
(c)
Fig. 4. Mapping biclusters from a heatmap representation to parallel-coordinate plots: (a) Two biclusters in a heatmap. (b) The same two biclusters in a parallel coordinates plot without centroids. (c) The same two biclusters in a parallel coordinates plot with centroids.
– Navigating the heatmap. In the upper part of the heatmap, column descriptions are displayed and row descriptions are represented in the first column. A zooming function can be applied by using the Plus(+) or the Minus(-) sign in the upper left part of the frame to select the zooming factor. The zooming factor can also be changed by pressing the Ctrl key on the keyboard and moving the mouse wheel. If the heatmap cannot be represented completely, scroll bars are displayed automatically. – Selection of biclusters. By clicking the left mouse button within a highlighted bicluster rectangle, the corresponding biclusters are selected and highlighted with a transparent yellow box. Overlapping biclusters are represented using additive blending. The corresponding rows and columns are also highlighted in yellow. Additionally, an information is shown about which biclusters are currently selected, see Figure 5 (b). – Bicluster navigation list. The navigation list in the right part of the application serves as an overview of all selected biclusters in the heatmap. Further-
BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data
(a)
(b)
(d)
(e)
647
(c)
Fig. 5. The BiCluster Viewer in its: (a) Biclustering mode. (b) Biclustering selection mode. (c) Navigation list mode. (d) Duplication mode. (e) Dashed rectangles mode.
more, it highlights selected biclusters in yellow. Additionally, an overview is shown with information about which bicluster is currently duplicated. This is colored in red. It is also possible to select a bicluster from the navigation list and selections are linked to the heatmap (see Figure 5 (c)). – Duplication of rows and columns. In some cases it is impossible to display all connected biclusters in a single view. For this reason we allow a duplication of involved rows and columns of single biclusters. To show a certain bicluster as a connected entity, a context menu has to be opened by rightclicking in the corresponding rectangle. By selecting expand duplicated the required rows and columns are displayed as duplicated. These rows and columns are then highlighted in a transparent red color, see Figure 5 (d). – Menu entries. If not all biclusters can be represented in a connected way the menu entry show all biclusters can be used to display rectangles with dashed lines, see Figure 5 (e). All cells that belong to a bicluster are represented by a rectangle with dashed lines in the corresponding bicluster color. The layout of the heatmap is generated automatically by the tool. It may happen that certain biclusters cannot be displayed as connected entities without row or column duplication. By selecting the menu entry setup fit policy a dialog will be opened where the user has an impact on the layout, i.e. on the order of rows and columns. The order of the heatmap is built by successively adding all biclusters into the heatmap representation.
4
Case Study
We conducted a small case study to show how the BiCluster Viewer can be used to visualize biclusters. We loaded the yeast data and biclusters from [15]
648
J. Heinrich et al.
Fig. 6. The GUI of the Bicluster Viewer proposes many interactive features to explore gene expression data and to navigate in it
into our tool and used the bicluster navigation list to sort all biclusters by score. After resorting the heatmap according to this ordering, we selected the highest ranking bicluster B87 . While the heatmap nicely shows the dimensions of the bicluster, its pattern cannot be seen from the heatmap. Highlighting the bicluster and displaying it in parallel coordinates resolves this issue. Figure 6 shows a screenshot of the tool with the heatmap in the background and parallel coordinates on top. As can be seen, B87 exhibits a pattern only for the five rightmost conditions, which is nicely illustrated in parallel coordinates.
5
Pilot Study
To show the usefulness of our technique we conducted a small pilot study with 4 participants. All subjects were researchers from our institute. The pilot study is also used to demonstrate the interactive features that are supported by the visualization tool. 5.1
Study Design
The study was conducted with an Intel Core2 Duo notebook at a frequency of 2 GHz and a 2 Gigabytes RAM. The functionality of our visualization tool was explained to the participants and they got a short introduction about biclusters. They were shown printed tutorials to understand the tool and the single steps in the study. After reading the tutorial, participants were introduced by working with the interactive features of the visualization tool. In the training phase they could work with the tool as long as they wanted and they were allowed to ask questions to the experimenter. Finally, subjects were asked questions to uncover if they understood the visualization and if they could navigate in the visualization tool to answer the given tasks correctly.
BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data
649
The goal of the pilot study was to test the interactive visualization and analysis tool for the bicluster representation. For this reason, participants were presented different data sets. The overall task for the participants was to explore preliminary defined biclusters based on these data sets. We outline four major categories of phenomena to be tested: – – – – 5.2
How How How How
accurately accurately accurately accurately
are are are are
biclusters distinguished from each other? overlaps between biclusters detected? biclusters mapped to their rows and columns? single biclusters analyzed?
Tasks
Participants had to perform six tasks where each of the tasks consisted of a number of subtasks. Tasks 1 to 3 have to be solved without using the BiCluster Viewer and its interactive features whereas tasks 4 to 6 had to be answered using the tool. The following tasks and subtasks were performed in the study: – Task 1: Biclustering results are shown for three different data sets, see Figure 7 (a)-(c). 1. How many biclusters can be found in each of the three representations? 2. In which representation is the largest overlap, i.e., the most cells belonging to more than one bicluster?
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Fig. 7. Biclustering results for three different data sets: (a) 15 biclusters. (b) 9 biclusters. (c) 10 biclusters. (d) 9 biclusters with option Show All Biclusters. (e) 10 biclusters with option Show All Biclusters. (f) 15 biclusters with option Show All Biclusters. (g) 10 biclusters with option Show All Biclusters not set but all clusters selected beforehand. (h) 15 biclusters with option Show All Biclusters not set but all clusters selected beforehand. (i) 9 biclusters with option Show All Biclusters not set but all clusters selected beforehand.
650
J. Heinrich et al.
– Task 2: Biclustering results are shown for three different data sets. Now, the option Show All Biclusters is set, see Figure 7 (d)-(f). 1. How many biclusters can be found in each of the three representations? 2. In which representation is the largest overlap, i.e., the most cells belonging to more than one bicluster? – Task 3: Biclustering results are shown for three different data sets. Now, the option Show All Biclusters is not set but all biclusters were selected beforehand, see Figure 7 (g)-(i). 1. How many biclusters can be found in each of the three representations? 2. In which representation is the largest overlap, i.e., the most cells belonging to more than one bicluster? – Task 4: A listing of all shown biclusters with the corresponding color codings is given. Write down the bicluster ID and the row and column number. 1. Which is the largest bicluster? 2. Are there overlapping biclusters? If so, which? 3. Which column contains the most biclusters? 4. Are there columns that do not contain any biclusters? If so, which? – Task 5: A listing of all shown biclusters with the corresponding color codings is given. Write down the bicluster ID and the row and column number. 1. Which bicluster has the most overlappings with bicluster B8 ? 2. Find a cell (give row and column) that is at least inside three biclusters. Write down the IDs of the biclusters in which the cell is located in. 3. A certain bicluster has to be analyzed more accurately. Determine all biclusters with which B9 has rows AND columns in common. Write down the corresponding row and column descriptions. – Task 6: A listing of all shown biclusters with the corresponding color codings is given. Write down the bicluster ID and the row and column number. 1. Find the bicluster that has an overlap with most other biclusters. Write down the IDs of the biclusters. 5.3
Study Results
The results of our pilot study are represented in Figure 8. The blue colored bars show the sum of the correct answers of all participants for each task. The maximal number of correct answers is limited by 4 (because of the 4 participants). Partially correct answers have not been taken as correctly answered. The red line shows the average completion time for each task. Because of the limited number of participants we cannot deduce any statistical results from the study but we can see that there are already some trends in the different tasks. The additionally recorded comments of the participants are a good basis for a first analysis of the applicability of our interactive visualization tool. We found out that the representation with the dashed lines is not suited to distinguish or count single biclusters (no correct answers). Surprisingly, the degree of overlap has been interpreted incorrectly in most cases. The reason for that is the fact that participants weighted the connected overlaps more than
BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data
651
Fig. 8. Correct answers and average completion times of the pilot study
those that are highlighted by the dashed rectangles. The representations from Tasks 1 and 3 (standard and select all) are suited well to distinguish biclusters. The wrong answers are based on the fact that the representations are interpreted incorrectly.
6
Conclusion and Future Work
In this paper we introduced the BiCluster Viewer, an interactive visualization tool for analyzing gene expression data. Biclusters are extracted from tabular data by applying a biclustering algorithm and a layout is computed by allowing a minimal number of row and column duplications. We use a heatmap representation in which all generated biclusters can be displayed simultaneously in a connected way by allowing these kinds of duplications. parallel-coordinate plots are used to have another point of view to the same data set and linking and brushing techniques additionally support a viewer to link both views together with the goal to get even more insights in the data than by using a single visual metaphor alone. The BiCluster Viewer contains many interactive features such as ordering functions, color codings, zooming, or details-on-demand to explore the data. We demonstrated the usefulness the tool in a case study by showing insights from a yeast data set. A small user study with 4 participants illustrates which features of the tool are easily understood and used accurately and efficiently.In future we plan a more sophisticated user study with a larger number of participants. Acknowledgment. The authors would like to thank the German Research Foundation (DFG) for financial support of the project within the Cluster of Excellence in Simulation Technology (EXC 310/1) at the University of Stuttgart.
References 1. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: Proceedings of the IEEE Symposium on Visual Languages, pp. 336–343 (1996) 2. Eisen, M., Spellman, P., Brown, P., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95, 14863–14868 (1998)
652
J. Heinrich et al.
3. Inselberg, A., Dimsdale, B.: Parallel coordinates: A tool for visualizing multidimensional geometry. In: Proceedings of IEEE Visualization, pp. 361–378 (1990) 4. Grothaus, G., Mufti, A., Murali, T.: Automatic layout and visualization of biclusters. Algorithms for Molecular Biology 1 (2006) 5. Santamaria, R., Theron, R., Quintales, L.: A visual analytics approach for understanding biclustering results from microarray data. Bioinformatics 9 (2008) 6. Cheng, K., Law, N., Siu, W., Liew, A.C.: Biclusters visualization and detection using parallel coordinates plots. In: Proceedings of the International Symposium on Computational Models for Life Sciences (2007) 7. Cheng, K., Law, N., Siu, W., Lau, T.: BiVisu: Software tool for bicluster detection and visualization. BMC Bioinformatics 23, 2342–2344 (2007) 8. Cheng, K.O., Law, N.F., Siu, W.C., Liew, A.: Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization. BMC Bioinformatics 9, 210–238 (2008) 9. Rosenholtz, R., Li, Y., Mansfield, J., Jin, Z.: Feature Congestion: A Measure of Display Clutter. In: Proceedings of SIGCHI Conference on Human Factors in Computing Systems, pp. 761–770. ACM Press, New York (2005) 10. Dietzsch, J., Heinrich, J., Nieselt, K., Bartz, D.: SpRay: A visual analytics approach for gene expression data. In: IEEE Symposium on Visual Analytics Science and Technology, pp. 179–186 (2009) 11. Kaiser, S., Santamaria, R., Theorn, R., Quintales, L., Leisch, F.: Bicluster algorithms (2009), http://cran.r-project.org/web/packages/biclust/biclust. pdf 12. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2011) ISBN 3-900051-07-0 13. Barkow, S., Bleuler, S., Zitzler, E., Prelic, A., Frick, D.: BicAT: Biclustering analysis toolbox, ETH Z¨ urich (2010), http://www.tik.ethz.ch/sop/bicat/? page=bicat.php 14. Luscher, A.: ExpressionView (2010), http://www2.unil.ch/cbg/index.php? title=ExpressionView 15. Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of International Conference on Intellegent Systems for Molecular Biology, pp. 93–103 (2000)
Visualizing Translation Variation: Shakespeare’s Othello Zhao Geng1 , Robert S. Laramee1, Tom Cheesman2 , Alison Ehrmann2, and David M. Berry2 1
Visual Computing Group, Computer Science Department, Swansea University, UK {cszg,r.s.laramee}@swansea.ac.uk 2 College of Arts and Humanities, Swansea University, UK [email protected], [email protected], [email protected]
Abstract. Recognized as great works of world literature, Shakespeare’s poems and plays have been translated into dozens of languages for over 300 years. Also, there are many re-translations into the same language, for example, there are more than 60 translations of Othello into German. Every translation is a different interpretation of the play. These large quantities of translations reflect changing culture and express individual thought by the authors. They demonstrate wide connections between different world regions today, and reveal a retrospective view of their cultural, intercultural, and linguistic histories. Researchers from Arts and Humanities at Swansea University are collecting a large number of translations of William Shakespeare’s Othello. In this paper, we have developed an interactive visualization system to present, analyze and explore the variations among these different translations. Our system is composed of two parts: the structureaware Treemap for document selection and meta data analysis, and Focus + Context parallel coordinates for in-depth document comparison and exploration. In particular, we want to learn more about which content varies highly with each translation, and which content remains stable. We also want to form hypotheses as to the implications behind these variations. Our visualization is evaluated by the domain experts from Arts and Humanities.
1 Introduction William Shakespeare is widely regarded as one of the greatest writers and his plays have been translated into every major living language. This is a historical and contemporary phenomenon. In German, the first translation of one play, Othello, was produced in 1766. By now there are over 60 translations including 7 new translations of this play which have been produced since the year 2000. Questions about these translations are seldom asked in the Anglophone world, because interpreting them is difficult without specialist linguistic and cultural knowledge. The original Shakespeare’s work in English is normally considered more important than any translations. But with increasing awareness of global cultural interconnections, more Arts and Humanities researchers recognize the significance of translations and are investigating them. The interpretation of Shakespeare’s work in translation is always influenced by the translator’s own culture, customs and conventions. Therefore, each translation is a product of changing culture as well as an expression of each translator’s individual thought G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 653–663, 2011. c Springer-Verlag Berlin Heidelberg 2011
654
Z. Geng et al.
within that culture. Also, each translation is a reply to received ideas about what Shakespeare’s work means. Semantic and textual variations between translations in the corpus carry relational cultural significance. Normally, researchers from Arts and Humanities read and compare cultural text in its raw form and this makes the analysis of the multiple translations difficult. In addition, interesting patterns are often associated with text metadata, such as historical period, place, text genre or translator profession. Up until now, researchers from Arts and Humanities have collected more than 50 different versions of German translations of Shakespeare’s play, Othello. Our general goal is to identify similarities and differences among these translations. Compared to traditional text mining, text visualization incorporates the visual metaphors and interactive design to facilitate in-depth exploratory data analysis. In this paper, we aim to develop an interactive visualization system to help the researchers from Arts and Humanities perceive and understand their collected German translations in new ways. In order to do so, we collect a large amount of metadata associated with the original documents and extract semantic features from the document contents. Based on such extracted information, various visualizations can be applied. We propose a structure-aware Treemap for metadata analysis and document selection. Once a group of documents are selected, they can be further analyzed by our Focus + Context parallel coordinates. The rest of this paper is organized as follows: In Section 2, we review the previous work on text visualization. In Section 3, we describe our source data. In Section 4, we explain how are the original documents processed before being input to the visualization. In Section 5, we illustrate our structure-aware Treemap for meta data analysis. In Section 6, we present the Focus + Context parallel coordinates for translation variation exploration. In Section 7, we report the feedback from the domain experts. Section 8 wraps up with the conclusion.
2 Related Work Since 2005, from the major visualization conferences, we can observe a rapid increase in the number of text visualization prototypes being developed. As a result, various visual representations for text streams and documents are proposed to effectively present and explore the text features. A large number of visualizations have been developed for presenting the global patterns of individual document or overviews of multiple documents. These visualizations are able to depict word or sentence frequencies, such as Tag Clouds [1], Wordle [2], WordTree [3], or relationships between different terms in a text, such as PhraseNet [4], TextArc [5] and DocBurst [6]. The standard Tag Clouds [1] is a popular text visualization for depicting term frequencies. Tags are usually listed alphabetically and the importance of each tag is shown with font size or color. Wordle [2] is a more artistically arranged version of a text which can give a more personal feel to a document. ManiWordle [7] provides flexible control such that the user can directly manipulate the original Wordle to change the layout and color of the visualization. Word Tree [3] is a visualization of the traditional keyword-in-context method. It is a visual search tool for unstructured text. Phrase Nets [4] illustrates the relationships between different words
Visualizing Translation Variation: Shakespeare’s Othello
655
used in a text. It uses a simple form of pattern matching to provide multiple views of the concepts contained in a book, speech, or poem. A TextArc [5] is a visual representation of an entire text on a single page. It provides animation to keep track of variations in the relationship between different words, phrases and sentences. DocuBurst [6] uses a radial, space-filling layout to depict the document content by visualizing the structured text. The structured text in this visualization refers to the is-kind-of or is-type-of relationship. These visualizations offer an effective overview of the individual document features, but they cannot provide a comparative analysis for multiple documents. In contrast to single document visualizations, there are relatively few attempts to differentiate features among multiple documents. Noticeable exceptions include TagLine Generator [8], Parallel Tag Clouds [9], ThemeRiver [10] and SparkClouds [11]. Tagline Generator [8] generates chronological tag clouds from multiple documents without manual tagging of the data entries. Because the TagLine Generator can only display one document at a time, it is unable to reveal the relationships among multiple documents. A much better visualization for this purpose is Parallel Tag Clouds [9]. This visualization combines parallel coordinates and tag clouds to provide a rich overview of a document collection. Each vertical axis represents a document. The words in each document are summarized in the form of tag clouds along the vertical axis. When clicking on a word, the same word appearing in other vertical axes is connected. Several filters can be defined to reduce the amount of text displayed in each document. One disadvantage of this visualization is its incapability to display groups of words which are missing in one document but frequently appear in the others. When we explore the variations among the Othello translations, the domain experts would like to know groups of words which a particular author never uses but which frequently appear in other authors’ work. Also, brushing multiple words in different documents might introduce clutter due to the crossing lines in parallel tag clouds. We also observe some interesting visualizations which can depict time trends over different documents. SparkClouds [11] integrates sparklines into a tag cloud to convey trends between multiple tag clouds over time. Results of a controlled study that compares SparkClouds with traditional trend visualizations, such as multiple line graphs, stacked bar charts and Parallel Tag Clouds, show that SparkClouds is more effective at showing trends over time. The ThemeRiver [10] visualization depicts thematic variations over time within a large collection of documents. The thematic changes are shown in the context of a time line and corresponding external events. This is the first work, to our knowledge, that compares multiple translations of a single play.
3 Background Data Description The domain experts from Arts and Humanities have collected 57 different German translations of Shakespeare’s play, Othello. For each translation, metadata recorded includes the author name, publication date, country, title of the play and impact index. The translations were written between 1766 and 2006 in seven different countries defined including Germany (pre-1949), East Germany (1949-1989), West Germany (1949-1989), FRG (Germany since 1989), Austria, Switzerland and England. The impact index refers to each translator’s productivity and reputation. it includes the re-publication figures or
656
Z. Geng et al.
Fig. 1. This image illustrates the distribution of our collected German Othello translations. The X-axis is mapped to the publication date and Y-axis to seven different countries. The dot size is mapped to the impact index. A larger radius depicts a translation with higher re-publishing figures.
each Othello translation. Figures were derived from the standard bibliography of Shakespeare in German [12]. The index has five levels ranging from 1 to 5, where 1 means that the translator is not listed in the bibliography and 5 means that more than 50 publications and re-publications by the translator are listed in the bibliography. Figure 1 shows the chronological distribution of our collected documents. The X-axis is mapped to the publication date and Y-axis to the different countries. The ellipse radius is mapped to impact index.
4 Text Preprocessing Before the original translation can be analyzed within our visualizations, we need to generate various features from the textual information and transform them into numerical vectors. In this work, we process our original text in five steps, namely document standardization, tokenization, stemming, vector generation and similarity calculation. The major outputs include making concordance of each document and computing their similarity. Since the Othello translations are collected from various sources (some PDF, some archival typescripts, mostly books), we firstly transform and integrate them into a standard XML format. Next, document tokenization breaks the stream of text into a list of individual words or tokens. During this process, common words carrying little meaning which are not of interest to domain experts, such as ”der” (the), ”da” (that) etc, are eliminated from the token list. Furthermore, stemming reduces all of the tokens to their root forms. Based on this cleaned and standardized token list, we are able to generate a concordance table for each document by counting the frequency of every unique token. For in-depth document comparison, we also need an objective document similarity measure. The domain experts from Arts and Humanities suggest a list of high-frequency keywords as a search query. This keyword list can be extracted from multiple interesting documents. The similarity between our collected translations can then be measured
Visualizing Translation Variation: Shakespeare’s Othello
657
using the LSI (Latent Semantic Index) model [13]. This model is widely used in information retrieval where the list of terms associated with their weight is treated as the document vectors. The weight of each term indicates its importance in a document, and is given by T f × Id f . We use Tf (Term Frequency) to refer to the number of times a term occurs in a given document, which measures the importance of a word in a given document. Idf (Inverse Document Frequency), as its name implies, is the inverse of the Document Frequency. The Document Frequency is the number of documents in which a word occurs within the collection of documents. Thus the weight of a term i in document j can be defined as: wi, j = t fi, j × id f j = t fi, j × log
N d fi
where N is the total number of documents in the corpus, df is the document frequency and idf is the inverse document frequency. Large values of wi, j imply term i is an important word in document j but not common in all documents N. Then a document j can be represented as a vector with each dimension replaced by the term weight: D j = (w(0, j), w(1, j), ...., w(n, j))T A large number of words in the search query might lead to extremely high-dimensional document vector, so we use the SVD (Singular Vector Decomposition) to perform a dimension reduction. Then the similarity between the two documents j and k can be measured by the angle between these two vectors: cos Sim(D j , Dk ) =
D j · Dk |D j ||Dk |
Such similarity measures are generated for all of our Othello translations. This information is featured in our treemap and parallel coordinates.
5 Structure-Aware Treemap As discussed in Section 3, metadata of each document includes author name, play title, date, place of publication and impact index. The scatterplot in Figure 1 is able to present the overall historical distribution, but it cannot provide an aggregation of the data. For example, if the user wants to explore or rank the total number of translations, or the total number of re-publications in any century, decade or country in our document collection, the scatterplot is unable to convey an answer. Next to this, we observe that the meta data can be arranged in a hierarchical structure. For example, each century breaks down into several decades. In each decade a few translations are published in several countries. In each country several authors published their work. For each author his translations have the impact index. Given this structure, we are able to generate a Treemap [14,15] visualization. The traditional treemap is able to compare the node values in any tree level. But it lacks the ability to show the entire tree structure intuitively. For tracing the treemap
658
Z. Geng et al.
Fig. 2. This image illustrates the interface of our structure-aware treemap. The left part shows the control panel by which the user is able to manipulate the tree hierarchy, compare the values in each hierarchy via a bar chart and set up the configuration for the visualization. Also the user is able to select their interesting documents from the spreadsheet. The right part shows the treemap and DOI-tree. The area of the leaf node is mapped to the quantity. As we drill down and up to different tree levels, the DOI-Tree keeps track of the structure. Also, the DOI-tree could initiate a searching task.
hierarchy, it’s necessary to only list the relevant substructure which shows the ancestor and descendants of the interested node. The Degree-of-Interest tree [16] provides a clear hierarchy at a low cost of screen space by changing the viewpoint and filtering out the uninteresting tree nodes. In addition, it offers instant readability of the node labels. Therefore, we adopt linked views using both DOI tree and treemap to enable structure tracing. Our system is composed of two parts, namely the control panel and structure-aware treemap. The control panel is shown on the left half of Figure 2. It extracts the ontological hierarchy information from the input data sets and sets up the configuration for the visualization. The user is able to change the order of hierarchy or reduce the number of hierarchies by moving the graph nodes. The right half of Figure 2 is a structure-aware hierarchical visualization, containing the coordinated views of the squarified treemap and DOI tree [16]. As we traverse back and forth between the intermediate levels of the treemap, the DOI tree view clearly keeps track of how each selected node is derived from its ancestors. The area of the leaf node can be either mapped to the impact index, the similarity measure or the quantity. In Figure 2, from the bar chart, we learn that most of our collected translations were published in the twentieth century. During this century, most translations are published in the 1940s and 1970s. For the domain specialists, this raises questions about possible correlations with comparable datasets (translations of other or all Shakespeare plays), and about possible correlations between periods in German history, and specific interest in Othello. By changing the hierarchy, we also learn that
Visualizing Translation Variation: Shakespeare’s Othello
659
Fig. 3. This image shows an overview of our visualization. The parallel coordinates illustrates a focus view of the term frequency. The text boxes below the parallel coordinates show the context views. They present the entire sentences from the original text where each keyword appears.
although the documents are all translations of Othello, they have different titles: the commonest titles of the translations are ”Othello” or ”Othello, der Mohr von Venedig”, some authors use the title ”Die Tragdie von Othello, dem Mohren von Venedig”, two use ”Othello, der Maure von Venedig” and one author uses the title ”Othello, Venedigs Neger”. These outliers are particular interest to the domain experts. Our treemap system helps users manage their documents, such as ranking the documents according to different criteria, analyzing the global features of the metadata and selecting the interesting documents. It can be scaled up to include new datasets such as translations of other works by Shakespeare and enable users to explore common patterns in the metadata. The DOI tree can initiate the searching task by which a user is able to search terms in any hierarchy. Since the collection of our German translations is still expanding, our treemap will play an increasingly important role in the meta data analysis.
6 Focus+Context Parallel Coordinates Parallel coordinates, introduced by Inselberg and Dimsdale [17,18] is a widely used visualization technique for exploring large, multidimensional data sets. It is powerful in revealing a wide range of data characteristics such as different data distributions and functional dependencies [19]. As discussed in Section 4, the textual information of each document can be transformed into a vector. In our parallel coordinates, we encode the document dimensions as term frequencies.
660
Z. Geng et al.
Fig. 4. In this image, we obtain five keywords which only appear once in all documents
Domain experts from Arts and Humanities selected eight interesting translations according to their similarity score. For initial analysis, we chose a significant passage from the play, Othello’s big speech to the Venetian Senate in Act1, Scene3: the longest single speech in the play (about 300 words in Shakespeare’s text). Figure 3 shows an overview of our visualization. The column on the far left displays a list of selected keywords: these are most frequently occurring significant words in the document corpus. The parallel coordinates present a focused view of keyword frequencies. Each document is represented by a vertical axis. In order to maintain a unified scale, the height of each vertical axis is made proportional to the range between each document’s minimal and maximal word frequencies. Zero frequency simply means that a keyword has not occurred in that document. The thickness of each vertical axis is mapped to the document’s similarity with others in terms of LSI score: a thicker line means a higher similarity value. The number of occurrences of each keyword in each document is connected by a polyline. Each polyline is rendered in a different color to enable visual discrimination. The text boxes below the parallel coordinates provide context views for keywords selected by the user. Each text box represents an individual document and shows the entire sentences from the original text where each selected keyword occurs. We also apply the edge bundling to enhance the visual clustering and user is able to control the curvature of the edge [20]. Curves with the least curvature become a straight line. We provide various interaction support, such as selection, brushing and linking. As the user selects individual or multiple keywords, the corresponding polylines are rendered. The user can also select various frequency levels in any document and the corresponding keywords having that frequency are displayed. Along with the selection and brushing, the text boxes which show the context views keep updating.
Visualizing Translation Variation: Shakespeare’s Othello
661
Fig. 5. In this image, there are two keywords showing a strong correlation
Our system also supports composite brushing such as an AND-bush or OR-bush [21]. We can use the AND-Brush to obtain all keywords which occur in every document: words used by all translators regardless of the translators’ reputations and impact. If we brush the keywords which do not appear in document ”Baudissin 1958”, we learn that this document contains all the keywords except ”fand”. This helps to explain why this document has the highest similarity score. The domain experts indicates that this finding is surprising and interesting. As shown in Figure 4, we observe five keywords which appear just once in all the documents. From the context views, the sentences containing these two words are almost the same in every translation. As shown in Figure 5, there are two keywords showing a strong correlation. Both findings raise interesting questions for the domain experts.
7 Domain Expert Reviews The focus+context parallel coordinates permits comparative visualization and exploration of concordances. A concordance is normally displayed as a simple list of words in a vertical column (in order of frequency or alphabetically). Standard concordance software also offers the option to display contexts of use for a particular word (i.e. the different word strings in which a word appears). This tool successfully combines a concordance-derived keyword list and context views with display of frequencies of words across multiple, comparable versions, in the form of parallel coordinates. This is a promising way of exploring texts through their different uses of meaningful words. In the display of parallel coordinates, the composite brushing enables us filter for any correlations between word-uses, positive or negative: pairs/groups of words which appear
662
Z. Geng et al.
together, or never appear together. The similarity of each document tells us an objective measure of how similar each document is to the keyword lists. In this particular case, the visualization tells us that Baudissin’s translation-which is the standard, most often republished and performed German translation of the play - contains the most keywords in this speech which are common to most of the other translations. Since other translations are produced and marketed as ”alternatives” to Baudissin, this high degree of apparent dependency on the standard translation is surprising, and it demands further investigation. Our current corpus of German Othello translations is relatively small (under 60 documents), but we envisage it growing: in respect of other works (Shakespeare’s many other plays, and poems; and potentially works by other writers) and also in respect of other languages of translation (at least one of Shakespeare’s works exists in about 100 languages). Hence, the flexible metadata overview offered by the structure-aware Treemap visualization will become increasingly valuable in managing the dataset, exploring its various dimensions and selecting subsets of translations for further analysis.
8 Conclusion In this paper, we describe an interactive visualization system for presenting, analyzing and exploring the variation among different German translations of Shakespeare’s play, Othello. A structure-aware treemap is developed for metadata analysis and the focus + context parallel coordinates is developed to investigate the variations among the translations. Our parallel coordinates incorporate an objective similarity measure for each document using LSI model. Also, various interaction supports are realized to facilitate the information seeking mantra: overview first, zoom and filter and detail on demand. Our visualization is evaluated by the domain experts from Arts and Humanities. Because it is just the beginning of our project, in the future, we would like to add more advanced features to the parallel coordinates, such as visual clustering. Also, we will keep on collecting more translations. Further statistical and linguistic analysis will be implemented. Acknowledgments. This study was funded by Swansea University’s Research Institute for Arts and Humanities (Research Initiatives Fund). The conference trip is supported by Computer Science Department of Swansea University. We are grateful to ABBYY Ltd for allowing us to use their unique Optical Character Recognition package which can handle the old German Fraktur font, which is used in many of the Othello books.
References 1. Scott, B., Carl, G., Miguel, N.: Seeing Things in the Clouds: The Effect of Visual Features on Tag Cloud Selections. In: HT 2008: Proceedings of the Nineteenth ACM Conference on Hypertext and Hypermedia, pp. 193–202. ACM, New York (2008) 2. Viegas, F.B., Wattenberg, M., Feinberg, J.: Participatory Visualization with Wordle. IEEE Transactions on Visualization and Computer Graphics 15, 1137–1144 (2009) 3. Wattenberg, M., Viegas, F.B.: The Word Tree, an Interactive Visual Concordance. IEEE Transactions on Visualization and Computer Graphics 14, 1221–1228 (2008)
Visualizing Translation Variation: Shakespeare’s Othello
663
4. van Ham, F., Wattenberg, M., Vi´egas, F.B.: Mapping Text with Phrase Nets. IEEE Transactions on Visualization and Computer Graphics 15, 1169–1176 (2009) 5. Paley, W.B.: TextArc: An Alternative Way to View Text (2002), http://www.textarc. org/ (last access date: 2011-2-18) 6. Collins, C., Carpendale, M.S.T., Penn, G.: DocuBurst: Visualizing Document Content using Language Structure. Computer Graphics Forum 28, 1039–1046 (2009) 7. Koh, K., Lee, B., Kim, B.H., Seo, J.: ManiWordle: Providing Flexible Control over Wordle. IEEE Transactions on Visualization and Computer Graphics 16, 1190–1197 (2010) 8. Mehta, C.: Tagline Generator - Timeline-based Tag Clouds (2006), http://chir.ag/ projects/tagline/ (last access date: 2011-2-18) 9. Collins, C., Viegas, F.B., Wattenberg, M.: Parallel Tag Clouds to Explore and Analyze Facted Text Corpora. In: IEEE Symposium on Visual Analytics Science and Technology, pp. 91–98. IEEE Computer Society, Los Alamitos (2009) 10. Havre, S., Hetzler, E., Whitney, P., Nowell, L.: ThemeRiver: Visualizing Thematic Changes in Large Document Collections. IEEE Transactions on Visualization and Computer Graphics 8, 9–20 (2002) 11. Lee, B., Riche, N.H., Karlson, A.K., Carpendale, M.S.T.: SparkClouds: Visualizing Trends in Tag Clouds. IEEE Transactions on Visualization and Computer Graphics 16, 1182–1189 (2010) ¨ 12. Blinn, H., Schmidt, W.G.: Shakespeare - deutsch: Bibliographie der Ubersetzungen und Bearbeitungen (2003) 13. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41 (1990) 14. Johnson, B., Shneiderman, B.: Tree Maps: A Space-Filling Approach to the Visualization of Hierarchical Information Structures. In: IEEE Visualization, pp. 284–291 (1991) 15. Shneiderman, B.: Tree Visualization With Treemaps: a 2-d Space-filling Approach. ACM Transactions on Graphics 11, 92–99 (1992) 16. Card, S.K., Nation, D.: Degree-of-Interest Trees: A Component of an Attention-Reactive User Interface. In: Working Conference on Advanced Visual Interfaces (AVI), pp. 231–245 (2002) 17. Inselberg, A., Dimsdale, B.: Parallel Coordinates: A Tool for Visualizing Multi-dimensional Geometry. In: Proceedings of IEEE Visualization, pp. 361–378 (1990) 18. Inselberg, A.: Parallel Coordinates: Visual Multidimensional Geometry and Its Applications. Springer, Heidelberg (2009) 19. Keim, D.A., Kriegel, H.P.: Visualization techniques for Mining Large Databases: A Comparison. IEEE Transactions on Knowledge and Data Engineering 8, 923–938 (1996) 20. Zhou, H., Yuan, X., Qu, H., Cui, W., Chen, B.: Visual Clustering in Parallel Coordinates. Computer Graphics Forum 27, 1047–1054 (2008) 21. Hauser, H., Ledermann, F., Doleisch, H.: Angular Brushing of Extended Parallel Coordinates. In: Proceedings of IEEE Symposium on Information Visualization, pp. 127–130. IEEE Computer Society, Los Alamitos (2002)
3D Object Modeling with Graphics Hardware Acceleration and Unsupervised Neural Networks Felipe Montoya–Franco1, Andrés F. Serna–Morales2, and Flavio Prieto1 1
Department of Mechanical and Mechatronics Engineering, Universidad Nacional de Colombia, Sede Bogotá, Carrera 30 No 45–03, Bogotá, Colombia, Tel.: +57 (1) 316 5000 Ext. 14103 {lfmontoyaf,faprietoo}@unal.edu.co 2 Department of Electrical, Electronic and Computer Engineering, Universidad Nacional de Colombia, Sede Manizales, Carrera 27 No. 64–60, Manizales, Colombia, Tel.: +57 (6) 8879300 Ext. 55798 [email protected]
Abstract. This paper presents a methodology for reaching higher performances when modeling 3D virtualized reality objects using Self-Organizing Maps (SOM) and Neural Gas Networks (NGN). Our aim is to improve the training speed of unsupervised neural networks when modeling 3D objects using a parallel implementation in a Graphic Process Unit (GPU). Experimental tests were performed over several virtualized reality objects as phantom brain tumors, archaeological items, faces and fruits. In this research, the classic SOM and NGN algorithms were adapted to the data–parallel GPU, and were compared to a similar implementation in an only–CPU platform. We present evidence that rates NGN as a better neural architecture, in quality terms, compared to SOM in the task of 3D object modeling. In order to combine the NGN accuracy with the SOM faster training, we propose and implement a hybrid neural network based on NGN using SOM as seed. Our experimental results show a considerable reduction in the training time without affecting the representation accuracy.
1 Introduction In three–dimensional (3D) modeling there are two main approaches: The first one, called Virtual Reality, refers to the artificial model construction of real or imaginary objects for applications in video games, visual and industrial design, prototype development, among others. Typically, these scenarios are built using Computer Aided Design (CAD) Software. The second approach, called Virtualized Reality, refers to develop models of real world objects from information obtained by a sensor. This work is focused on three–dimensional modeling of virtualized reality objects, from our previous works [11,12], as phantom brain tumors, archaeological items, faces and fruits. In the context of computer graphics, 3D modeling is the process of giving a faithful representation of the objects in a scenario to carry out studies of their properties. One of the greatest challenges is to make simple models and reduce complexity [3]. There are several aspects to be taken into account when building a 3D model: acquisition, modeling purpose, implementation complexity, computational cost, adequate visualization and easy manipulation. All these aspects influence the choice, design and G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 664–673, 2011. c Springer-Verlag Berlin Heidelberg 2011
3D Object Modeling with Graphics Hardware Acceleration
665
performance of the model. This paper presents a 3D modeling methodology using unsupervised learning architectures as the Self–Organizing Maps (SOM) and the Neural Gas Networks (NGN). We present experimental comparisons in terms of performance, computational time and training epochs. Nowadays, the Graphic Processing Units (GPU) have emerged as a possible desktop parallel computing solution for large scale scientific computation [15]. In this project, we explored its abilities and limitations with the proposed algorithms on a set of virtualized reality objects for 3D modeling. There, the training is the most expensive stage, because an exhaustive search is applied to find and update the winning neurons. This procedure can be implemented on a GPU in a parallel way reducing drastically the computational cost. In several papers, the potential of GPU technology has been demonstrated in problems like non–linear optimizations [16], pattern search [15], quadratic assignment problems [6] and 3D object retrieval [14]. For this research, the algorithms are implemented in the Compute Unified Device Architecture (CUDA) environment using the Thrust Library [4], on an NVIDIA Geforce GTX 460 SE GPU. The main contribution of this paper is the parallelization and analysis of the NGN, SOM and hybrid SOM–NGN algorithms using GPU acceleration. The paper is organized as follow: Section 2 describes the use of SOM and NGN for the 3D modeling task. Section 3 explains the graphics hardware acceleration of the NGN algorithm and reports experimental comparisons between GPU and CPU implementations. Section 4 describes the 3D modeling using SOM as seed for NGN and presents results using different configurations of the algorithm. Finally, Section 5 is devoted to conclude the work and expose the future research.
2 Modeling Using Self-Organizing Architectures The self–organizing architecture is a type of neural network based on unsupervised learning which can be used for 3D modeling of virtualized reality objects. Two of the most popular self–organizing architectures are Self-Organizing Maps (SOM) and Neural Gas Networks (NGN) [7]. In both SOM and NGN, each synaptic weight vector encodes the 3D coordinates of a point clustering on the object surface. During training, model falls asymptotically towards the points in the input space according to a density function, and therefore it takes the shape of the object coded in the input point cloud. 2.1 Self–Organizing Maps (SOM) A SOM is a set of neurons organized on a regular grid with rectangular or hexagonal connections. Input data is a set of coordinates (x, y, z) of a point cloud representing the object. Therefore, each neuron is represented by a weight vector mi ∈ 3 . The aim of the SOM algorithm is to approximate the input space ξ by prototypes or pointers in the form of synaptic weights mi , whereby the characteristic map Φ gives a faithful representation of the input vectors [5]. SOM Training. The SOM is trained iteratively as explained below. At each training stage, a random vector x is selected from the input dataset. The distance between x and all SOM weight vectors is calculated. The neuron whose weight vector is closest
666
F. Montoya–Franco, A.F. Serna–Morales, and F. Prieto
to the input vector x is called Best Match Unit (BMU), denoted as c. The BMU and its topological neighbors are updated using the adaptation rule presented in Equation 1, where α(t) is the learning rate and hci (t) is a neighborhood kernel around the BMU, which defines the region of influence of the input vectors on the map. We use a Gaussian neighborhood function hci (t) = exp(−d2ci /2σt2 ), where σt is a linearly decreasing function and dci is the distance between neurons c and i through the grid connections. mi (t + 1) = mi (t) + α(t)hci (t) [x(t) − mi (t)]
(1)
Before training, the input data are normalized and synaptic weights are initialized randomly along the two Principal Components (PC) of data distribution, extracted by PCA. Neighborhoods are defined by a rectangular grid. 2.2 Neural Gas Networks (NGN) A NGN is composed of neurons which move in the input space during training. The position of the BMU and its neighbors are updated at each training epoch. Unlike SOM, NGN does not use a neuron ranking based on grid connections; it uses a neighborhood ranking based on the closest neurons on the entire neural set. In this sense, NGN neurons can move freely through the input space and their topographic error is zero [13]. Compared with SOM, NGN converges faster, achieves lower distortion errors, has higher computational cost (increases with the computational complexity of the sorting algorithm) and achieves better performance using smaller training sets [7,10,9]. NGN Training. When each vector x is presented, we compute the neighborhood ranking (wi0 , wi1 , ..., wi(N −1) ) of the synaptic weight vectors, where wi0 and wi(N −1) are the closest and farthest vectors to x, respectively. Weights adaptation rule is given by Equation 2, where ε is the learning rate, hλ (ki (x, w)) = exp(−ki (x, w)/λ) is the neighborhood function, and ki (x, w) denotes the position of neuron i at the neighborhood ranking. Δwi = ε · hλ (ki (x, w)) · (x − wi )
(2)
2.3 Comparison between SOM and NGN To test the performance of two neural architectures in 3D modeling, virtualized reality objects from our previous works were used [11,12,2]. With these virtualized reality objects, we compute two error metrics: the quantization errors and the Hausdorff distances. The quantization errors are computed as the absolute average Euclidean distance from each input point to its corresponding BMU. The Hausdorff distance measures how far two subsets of a metric space are from each other [1]. Table 1 presents error comparisons between SOM and NGN. Note that, in all cases, NGN reaches lower errors than SOM, which implies better adaptation. Figure 1 presents 3D modeling of an archaeological item. Subfigures 1(b) and 1(c) show SOM and NGN modeling results, respectively. Both SOM and NGN are trained during 15 epochs using rough tuning (8 ≥ σt > 3) and 50 epochs using fine tuning (3 ≥ σt ≥ 1). Numerical performance from this test is presented in Table 1 under item named Cat.
3D Object Modeling with Graphics Hardware Acceleration
667
Table 1. SOM and NGN errors Absolute Average Euclidean distance SOM NGN 0.0248 0.0149 0.0176 0.0079 0.0152 0.0077 0.0027 0.0019 0.0111 0.0053 0.0144 0.0082 0.0030 0.0018
Model Jar Bottle Snail shell Bell Cat Head Face
(a)
Input pointcloud
(b)
SOM model
Hausdorff distance SOM NGN 0.1700 0.0934 0.1714 0.0520 0.2022 0.0469 0.0292 0.0155 0.0865 0.0245 0.1309 0.0503 0.0218 0.0113
(c)
NGN model
Fig. 1. Example of 3D modeling from one of the archaeological items used and the unsupervised neural networks obtained
3 Graphics Hardware Acceleration Given that the NGN algorithm proved to achieve better accuracy in the model representation, but it lacked the speed of the SOM, the next approach was to find a faster processor. This led to the use of a GPU. Now, the three main steps in both training algorithms are: calculating the Euclidean distance from the neurons to every new given input vector, define the neighborhood and update it with the corresponding adaptation rule (Equations 1 and 2). This sequence is repeated until the training length is reached. This training length can be measured in either training epochs or adaptation periods [7], where an epoch corresponds to n adaptation steps, being n the number of input vectors. Since the original algorithms are iterative and sequential, the process requires each task being completed before the next is started, which is not suitable for full parallelization. The only solution is to have the parallelization happening within each single step. Anyways due to the large time required for those individual steps and the large extends of the vector representing the neural network, it still can be run efficiently on a GPU. Figure 2 presents the schema of the learning process running in a serialized way, and composed of parallel routines with a specific multi–processor thread running for each neuron in each single step. Using Thrust as a templates library allows a fast implementation of the transformations needed to perform the Euclidean distance calculation and the adaptation steps using the high performance GPU. The NGN ranking determination requires a fast sorting that performs well in large datasets with more than 106 points. The back40Computing [8] algorithms are appropriate for this task, and they are included in the Thrust Library
668
F. Montoya–Franco, A.F. Serna–Morales, and F. Prieto
Fig. 2. Schema of the training algorithm on a GPU
[4]. In the implementation, the sorting algorithm template is set to use the sorting by key with built in data type(float due to the device 1.3 Compute Capability). This ensures that the radix sort algorithm handled the sorting task. The overall training time is dependent of both the training length and the performance of the neighborhood function. Additionally, this performance depends on the number of neurons n. For SOM, this neighborhood function has linear complexity O(n). On the other hand, for NGN, this function has the same complexity as the sorting algorithm used (radix sort), that is O(n + k), which asymptotically converges to O(n) [8]. The training length l = i × q depends on the input dataset size i and the expected quality q. Once a given training starts, the training length is set and the three variables are constant. Therefore, duration of NGN and SOM in a GPU implementation should scale according to O(n + k) and O(n), respectively. 3.1 Comparison between GPU and CPU Implementations In order to compare the performance of the GPU and the CPU implementations, different tests were run over the NGN algorithm. The first test compared the processing time between GPU and CPU when varying the neural network size, since it is the only parameter that scales differently in both implementations. The training length was kept constant at 42, 000 adaptation steps. Figure 3 shows the training time comparison between both implementations. In the second test, both the optimal training length and the optimal neural network size were searched using two datasets with 3 × 104 and 11 × 104 points, respectively. Figure 4 shows the NGN errors varying these parameters. Errors were computed as the average distance from the input points to the BMU and the Hausdorff distance from the input points to the NGN neurons. Figure 4(a) suggests that the training reaches a stable quality after 30 epochs. Meanwhile, Figure 4(b) naturally shows that the quality rises with the network size up to a size a little larger than the input data itself. However, this cannot always be pursued because then data size would not be reduced. Regarding training time, parallelization of each step of the NGN algorithm reaches a speed raise of up to 5 times the CPU speed. Results for different network sizes are presented in Table 2. The CPU time is estimated as tCP U = n × aCP U , where n is the
3D Object Modeling with Graphics Hardware Acceleration
(a)
669
(b)
Fig. 3. Training time comparison between CPU and GPU implementations
(a)
(b)
Fig. 4. Optimal training length and network size curves
training length and aCP U is the time of an individual adaptation step in a training with the same network size. These data are taken from calculation data used for Figure 4(b). Note that a 1.0x speed up indicates that both GPU and CPU implementations consumed the same training time.
4 Using SOM as Seed for NGN Both SOM and NGN algorithms carry out similar steps during their learning process. However, the main difference lies on the selection of the neighborhood. In the SOM training, calculation of the distance from each neuron to the BMU is simple and fast, since it involves only the use of a metric over the connection grid. This is computationally simple, and implies fast training and lacks accuracy due to the topological errors [12]. On the other hand, the NGN algorithm not only has no grid connections between neurons but also neurons can move freely through the input space. In this sense, this algorithm has better adaptation than the SOM. This advantage implies paying the computational cost associated with the sorting process when the neighborhood ranking is calculated. This cost is especially noticeable when training large neural networks with
670
F. Montoya–Franco, A.F. Serna–Morales, and F. Prieto Table 2. Training time for different virtualized objects Neurons
Training Length
GPU Time (seconds)
CPU Time (Estimated)
Speed up
10000 15000 40000 50000 80000 100000
882630 3385800 882630 882630 882630 882630
3780.5 11035.4 4125.5 4348.7 4868.3 5500.4
2013.5 10551.3 8181.0 13209.2 21679.7 27444.6
0.5x 1.0x 2.0x 3.0x 4.5x 5.0x
(a)
(b)
(c)
Fig. 5. Intermediate step that takes place when the neural connections break in the process of changing from SOM 5(a) to NGN 5(b) mode at 50% of the training. Finished training and original data 5(c). Input network was a point cloud randomly distributed over the bounding cube.
large datasets (more than 104 points). This situation can be improved using a hybrid neural network that combines the NGN accuracy with the SOM faster training. For this, we define the SOM model as seed for the NGN model. That means, using the SOM neighborhood function hci (t) during a fraction of the training, and then break the neural connections in order to use the NGN neighborhood ranking (wi0 , wi1 , ..., wi(N −1) ) for the rest of the training. This hybrid model reaches a good representation quality employing less training time than the traditional NGN algorithm. The transition between SOM and NGN training is depicted in Figure 5. To prove the performance of this hybrid model, several tests were carried out varying the number of training epochs used for SOM and NGN (SOM/NGN ratio). Training times are shown in Table 3 and later in Figure 6(a), while quantization errors are shown in Table 4. For this test, 4 datasets with size ranging from 2.5 × 103 to 8 × 104 were used. The training length was set to 30 epochs due to the results of the previous section. Since the SOM training only initialize the model, the accuracy depends mostly on NGN. It was also proved whether the SOM training could be done with a randomly downsampled dataset, further reducing the training time. This data reduction is done only for the SOM, when switching to NGN the whole dataset is used. Therefore, it should not affect the quality of the representation. A second test was run to analyze the
3D Object Modeling with Graphics Hardware Acceleration
671
Table 3. Training time(m) using different ratios of SOM/NGN Dataset length
NGN
SOM
2 × 103 5.5 × 103 3 × 104 8 × 104
7.012 13.925 79.485 223.45
1.729 3.296 18.349 58.979
20% SOM 50% SOM 80% SOM 5.840 11.628 66.504 186.36
4.253 8.405 47.423 132.79
2.537 4.941 29.653 94.162
50% SOM (Reduced) 3.466 6.222 33.286 111.21
Table 4. Errors using different ratios of SOM/NGN Dataset length
NGN
SOM
2 × 103 5.5 × 103 3 × 104 8 × 104
0.232 1.554 0.478 0.377
0.354 1.332 0.866 0.866
(a)
20% SOM 50% SOM 80% SOM 0.243 0.791 0.488 0.374
0.259 0.850 0.502 0.380
0.286 0.935 0.526 0.409
50% SOM (Reduced) 0.262 0.852 0.496 0.380
(b)
Fig. 6. Differences in time and quality when using downsampled data in SOM training
effect of the downsampled SOM with NGN training against the regular SOM with NGN training. Results of this comparison are depicted in Figure 6 and in the last column of Tables 3 and 4.
5 Conclusions We implemented a 3D modeling methodology based on neural networks using parallel programming techniques. The input patterns were cloud points (x, y, z) of virtualized reality objects. We used Neural Gas Networks (NGN) trained on an NVIDIA GPU and we presented experimental comparisons against a CPU implementation in terms of performance, computational cost, training time and quality. The GPU implementation performed better than the CPU when comparing the training time. There is always a particular network–size value in which CPU and GPU perform equal. This network–size
672
F. Montoya–Franco, A.F. Serna–Morales, and F. Prieto
may change on different hardware, but given the complexity of the sorting algorithm, GPU always find a better performance situation. On the hardware used for the tests, the GPU was faster as long as the network size exceeds 2 × 104 neurons. We presented evidence that rates NGN as a better neural architecture, in quality terms, compared to SOM in the task of 3D object modeling. In order to combines the NGN accuracy with the SOM faster training, we proposed a training algorithm based on NGN using SOM as seed. There was an improvement in time performance up to half the training time at a low error cost when the ratio of SOM/NGN was set to 50%. This time could be further optimized by using a reduced dataset in the SOM stages. In general, this did not affect the accuracy of the modeling. Future research will focus on the use of this algorithm to do On/Off-line training with datasets which change over time. Acknowledgments. We want to thank to the Universidad Nacional de Colombia at Manizales (DIMA Project No. 20201006025), and Colciencias-Colombia CONACYT– Mexico (Calls No. 460 and 483 of 2009) for their economical support.
References 1. Dubuisson, M.-P., Jain, A.K.: A modified hausdorff distance for object matching. In: Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 1, pp. 566– 568 (October 1994) 2. Figueroa, P., Londoño, E., Prieto, F., Boulanger, P., Borda, J., Restrepo, D.: Experiencias virtuales con piezas del museo del oro de colombia (2006), http://www.renata.edu.co/index.php/ciencias-sociales/ (last accessed: December 20, 2010) 3. Hall, R.: Supporting complexity and conceptual design in modeling tools. In: Rogers, D.F., Earnshaw, R.A. (eds.) State of the Art in Computer Graphics: Visualization and Modeling, SIGGRAPH (1991) 4. Hoberock, J., Bell, N.: Thrust: A parallel template library. version 1.3.0 (2010), http://www.meganewtons.com/ (last accessed: April 10, 2011) 5. Kohonen, T., Nieminen, I., Honkela, T.: On the quantization error in som vs. vq: A critical and systematic study. In: Príncipe, J., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 133–144. Springer, Heidelberg (2009) 6. Van Luong, T., Loukil, L., Melab, N., Talbi, E.-G.: A gpu-based iterated tabu search for solving the quadratic 3-dimensional assignment problem. In: IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2010, pp. 1–8 (May 2010) 7. Martinetz, T.M., Berkovich, S.G., Schulten, K.J.: Neural-gas network for vector quantization and its application to time-series prediction. IEEE Transactions on Neural Networks 4(4), 558–569 (1993) 8. Merrill, D., Grimshaw, A.: Revisiting sorting for gpgpu stream architectures. Technical Report CS2010-03, University of Virginia, Department of Computer Science, Charlottesville, VA, USA (2010) 9. Na, S., Xumin, L., Yong, G.: Research on k–means clustering algorithm: An improved kmeans clustering algorithm. In: Third International Symposium on Intelligent Information Technology and Security Informatics (IITSI), pp. 63–67 (April 2010) 10. Rivera-Rovelo, J., Bayro-Corrochano, E., Dillmann, R.: Geometric neural computing for 2d contour and 3d surface reconstruction. In: Bayro-Corrochano, E., Scheuermann, G. (eds.) Geometric Algebra Computing, pp. 191–209. Springer, London (2010)
3D Object Modeling with Graphics Hardware Acceleration
673
11. Serna-Morales, A.F., Prieto, F., Bayro-Corrochano, E.: Spatio–temporal image tracking based on optical flow and clustering: An endoneurosonographic application. In: Sidorov, G., Hernández-Aguirre, A., Reyes-García, C. (eds.) MICAI 2010. LNCS, vol. 6437, pp. 290– 300. Springer, Heidelberg (2010) 12. Serna-Morales, A.F., Prieto, F., Bayro-Corrochano, E., Sánchez, E.N.: 3d modeling of virtualized reality objects using neural computing. In: International Joint Conference on Neural Networks, IJCNN, Universidad Nacional de Colombia. IEEE, Los Alamitos (in press, 2011) 13. Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J.: SOM Toolbox for MATLAB. Helsinki University of Technology, P. O. Box 5400, FIN–02015 HUT, Finland (April 2000), http://www.cis.hut.fi/projects/somtoolbox/ (last accessed: October 5, 2010) 14. Zhang, Q., Jia, J., Li, H.: A gpu based 3d object retrieval approach using spatial shape information. In: IEEE International Symposium on Multimedia, ISM 2010, pp. 212–219 (December 2010) 15. Zhu, W., Curry, J.: Parallel ant colony for nonlinear function optimization with graphics hardware acceleration. In: IEEE International Conference on Systems, Man and Cybernetics, SMC 2009, pp. 1803–1808 (October 2009) 16. Zhu, W., Curry, J.: Particle swarm with graphics hardware acceleration and local pattern search on bound constrained problems. In: IEEE Swarm Intelligence Symposium, SIS 2009, pp. 1–8 (April 2009)
Event-Based Stereo Matching Approaches for Frameless Address Event Stereo Data J¨ urgen Kogler, Martin Humenberger, and Christoph Sulzbachner AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1, 1220 Vienna, Austria {juergen.kogler.fl,martin.humenberger,christoph.sulzbachner}@ait.ac.at
Abstract. In this paper we present different approaches of 3D stereo matching for bio-inspired image sensors. In contrast to conventional digital cameras, this image sensor, called Silicon Retina, delivers asynchronous events instead of synchronous intensity or color images. The events represent either an increase (on-event) or a decrease (off-event) of a pixel’s intensity. The sensor can provide events with a time resolution of up to 1ms and it operates in a dynamic range of up to 120dB. In this work we use two silicon retina cameras as a stereo sensor setup for 3D reconstruction of the observed scene, as already known from conventional cameras. The polarity, the timestamp, and a history of the events are used for stereo matching. Due to the different information content and data type of the events, in comparison to conventional pixels, standard stereo matching approaches cannot directly be used. Thus, we developed an area-based, an event-image-based, and a time-based approach and evaluated the results achieving promising results for stereo matching based on events.
1
Introduction
Different industry, home, or automotive applications use 3D information of the observed scene for reliable operation. Such applications include, e.g., driver assistance, home care, or industrial production. Laser range finders (LIDAR, light detection and ranging), time-of-flight (TOF) cameras, and ultrasonic sound sensors are commonly used for this purpose. All of these sensors are embedded, which means that a processing unit is integrated on-board which processes the raw sensor data and produces the output information directly on the device. Thus, no additional processing power is needed for depth measurement. This is advantageous in different applications where the processing power is limited but the processing effort is high. Additional benefits of embedded systems are the low power consumption and the small form factor. An alternative approach for real 3D sensing, which means not only depth measurement, is stereo vision. Classic stereo vision uses two conventional digital cameras [1], which are mounted sideby-side, separated by the baseline to capture the same scene from two different view points. The exact geometry between the cameras and the correspondences between the images are used to reconstruct the 3D data of the observed scene. A G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 674–685, 2011. c Springer-Verlag Berlin Heidelberg 2011
Event-Based Stereo Matching Approaches
675
different approach of stereo vision with digital cameras is using two silicon retina sensors and will be discussed in this work. The silicon retina is an optical sensor which has a time resolution of 1ms and a dynamic range of 120dB. Both are advantageous for applications in uncontrolled environments with varying lighting conditions and where fast movements occur. In contrast to conventional image sensors, such as CCD or CMOS, the silicon retina generates so called address events instead of intensity values. Due to the fact that events represent intensity changes in the observed scene, only moving objects or objects with changing color are recognized by the sensor. The data transmission is asynchronous, which means that events are transmitted only once after they occur without a fixed frame rate. With the maximum possible data rate, the system can be arbitrarily dimensioned, e.g., in terms of processing power. The remainder of the paper is organized as follows. First, section 2 gives an overview about the related work of stereo vision algorithms for conventional cameras and silicon retina sensors. Section 3 describes the silicon retina sensor and its basic characteristics. Then, section 4 presents the three introduced stereo matching approaches for the silicon retina. Finally, section 5 shows the results and gives an outlook about our future research.
2
Related Work
Classic stereo matching algorithms can be subdivided into area- and featurebased approaches. Area-based algorithms use each pixel for correspondence analysis, independent of the scene content. An overview and comparison of different, mostly area-based, techniques is presented in the work of Scharstein and Szeliski [2], Brown et al. [3], and Banks et al. [4]. Feature-based approaches calculate the correspondences of certain features in the images. Such features are introduced in the work of Shi and Tomasi [5]. In the work of Tang [6] an example of feature-based matching is shown where extracted feature points are connected to chains and further used for the matching step. An interesting point is to analyze if such algorithms are suitable for silicon retina stereo vision. Schraml et al. [7] have evaluated several area-based costs functions for grayscale images produced with a silicon retina stereo system (aggregation of events over time), including Normalized Cross-Correlation (NCC), Normalized Sum of Absolute Differences (NSAD), Sum of Squared Differences (SSD) and Census-Transform. More detailed information about the grayscale image generation with silicon retina data will be given in section 4.3. The results show that the best method and, thus, chosen for further investigations, is the NSAD approach where the depth measurement has an average error of ∼10% in 3m distance. In our previous work [8] we compared an area-based, feature-based, and a new time-based approach, using the characteristics of the silicon retina, with each other. The outcome of the evaluation was that the area-based algorithm achieved nearly the same results as Schraml. The best result of the feature-based algorithm had an average depth error of ∼18%. For the newly introduced time-based algorithm we achieved first promising results but the comparison with the other approaches was not possible
676
J. Kogler, M. Humenberger, and C. Sulzbachner
because we had to use different test data sets. Both papers proof that existing area-based matching approaches can be used for silicon retina stereo vision as long as images are created. Furthermore, the new time-based approach delivers the most promising results because it directly uses the elementary characteristics of the sensor, event polarity and time, without preprocessing. Therefore, the next step is to compare these algorithms directly in a way to find approaches which strongly exploit all the sensor characteristics.
3
Silicon Retina Stereo Sensor
In 1988, Mead and Mahowald [9] developed an electronic silicon model which reproduced the basic steps of human visual processing. One year later, Mahowald and Mead [10] implemented the first retina sensor based on silicon and established the name Silicon Retina. As a reminder, the silicon retina generates events instead of intensity values. The polarity of the produced events can either be on, for an increase of the intensity, or off, for a decrease. An event encodes the pixel location on chip, the time (with a time resolution of 1ms) when the event occurred, and the polarity (on or off). The polarity and the time information will further be used to deploy new techniques for correspondence analysis. The data format is called AddressEvent-Representation (AER) protocol which was introduced by Sivilotti [11] and Mahowald [12] in order to model the transmission of neural information within biological systems. It has to be mentioned that the description of the silicon retina is restricted to the functional behavior due to the algorithmic content of this work. Technical details can be found in the work of Lichtsteiner et al. [13,14]. The new generation of this sensor with a higher resolution in time and space as well as a higher dynamic range is described in the work of Posch et al. [15].
4
Silicon Retina Stereo Matching
In this paper we introduce three different stereo matching algorithms for silicon retina data. The first matching algorithm, described in section 4.3, is an area-based SAD (sum of absolute differences) approach which is derived from conventional stereo matching with grayscale or color images. To do this, grayscale images have to be generated out of the fired events. In section 4.4 an event-imagebased algorithm is presented which uses images, generated out of aggregated events in a different way, for stereo matching. The third approach presented in section 4.5 is time-based and the correspondence search is directly applied onto the events. In the first and the second approach a pre-processing step is needed to collect the events over time to generate the grayscale or event image. The third method exploits the additional time information and does not need any frame creation. For stereo camera geometry setup, calibration, and rectification we modified and used existing methods, as well as have been modified for the silicon retina sensors.
Event-Based Stereo Matching Approaches
4.1
677
Framework
Figure 1 shows the basic work flow which have all three approaches in common. Step 1 is the data acquisition where timed address events are generated by the sensor, then rectified and undistorted with our software. The three stereo matching approaches need three different types of data representation which is explained in detail in the next section. Basically, the mathematical background of silicon retina calibration and rectification is the same as of conventional camera calibration and can be found in [16]. The key difference is the data acquisition because static calibration pattern cannot be used here. To overcome this, we use a flashing checkerboard pattern to generate events. Step 2 is the stereo matching itself and, thus, the main part of the work flow. It consists of the matching costs calculation and optimization. Due to the previous rectification, the search is carried out along an horizontal line within the disparity range. The generated costs represent the probability of correct matching for each possible candidate. Step 3 is the disparity optimization which is a minimum or maximum search through the matching costs of all matching candidates for each pixel up to now. The result is a disparity map, ready for further processing such as 3D reconstruction. 4.2
Data Representation
Before the stereo matching approaches are explained in detail, we want to introduce the used data representation steps, illustrated in figure 2. First, timedaddress events fire over a certain time period. They can be described with ⎧ ⎪ I(u, v, t) − I(u, v, t − Δt) > ΔIon ⎨+1 AE(u, v, t) = −1 I(u, v, t) − I(u, v, t − Δt) < ΔIof f ⎪ (1) ⎩ 0 background - no activity ∀u ∈ [0, . . . , Hres − 1] ∧ ∀v ∈ [0, . . . , Vres − 1], where I(u, v, t) is the intensity of the pixel at position u, v (Hres and Vres represent the horizontal and vertical resolution of the sensor) and time t. The time is the absolute time from the sensor and the timestamp resolution is Δt =1ms. If the difference between the current intensity value at the time t and the previous
Right SR Left SR
TAE
Data acquisition TAE
Disparity map
TDE
Disparity optimization
Stereo matching Area-based correlation
Time-based Event Imagebased correlation correlation
Costs optimization in time/space domain
Fig. 1. The work flow of the proposed algorithms
678
J. Kogler, M. Humenberger, and C. Sulzbachner events firing over time
data transmitted for each event
t0
event image (collection of timestamps)
different converted frame formats
timestamp
t1
x-y coordinate polarity
t2
OFF-event ON-event
Fig. 2. Data representation types: From left to right: Single events, event collection, transformed grayscale image and event-image
intensity value at the time Δt =1ms exceeds a positive threshold ΔIon then an on-event occurred. The difference can also exceed a negative threshold ΔIof f , which fires an off-event. Second, if needed, the events are collected and further transformed to proper image formats. In this work, we used event images and grayscale images. The following sections introduce the three different matching approaches where each is based on one of the data representation types. Each approach benefits from space and time data in a different way. 4.3
Area-Based Approach
The first approach is dedicated to conventional stereo matching. This means that the events are collected over a certain period of time and a grayscale image, we call it address event frame (AEFt,L/R at time t for the left and right sensor), is generated out of them with AEFt,L/R (u, v) = 128 +
h max
AEL/R (u, v, t − iΔt),
(2)
i=0
where hmax is the time history within the events which are considered for the generation of the AEF . The optimum duration of the time history strongly depends on the movement and the speed of the movement in the observed scene. The longer the history, the more details of movements get lost. This address event frame has to be calculated for the left and the right silicon retina camera for each time step. As can be seen in (2), the intensity values of the resulting grayscale image depend on the number of on- and- off events at a position (u, v) during the time period nΔt (history). This grayscale image can then further be used for stereo matching algorithms known from conventional stereo vision. Figure 2 shows an example of such a frame generation. The top right image is the resulting grayscale image. Due to the less information such an image contains, we decided to use a sum of absolute difference (SAD) metric for costs calculation with n m |AEFt,R (u + i, v + j) − AEFt,L (u − d + i, v + j)|, DSI-ABt (u, v, d) = i
j
(3)
Event-Based Stereo Matching Approaches
679
where n and m define the block for an additional costs aggregation. The calculated costs are stored in the so-called disparity space image (DSI), which is a three-dimensional data structure of the size disparities × width × height at time t and is further used to search for the best matching candidate. 4.4
Event-Image-Based Approach
The second approach generates frames out of the events as well, with the difference that no grayscale image is built. Here, the events firing during the time history are collected only, therefore each occurred event is stored in the address event image (AEIt,l/r at time t for the left and right sensor) which is built with AEIt,L/R (u, v) = AEL/R (u, v, t − nΔt)
(4)
∀n ∈ [0, . . . , hmax ] ∧ AEL/R (u, v, t − nΔt) = 0.
If a new event fired at the same position, the old event will be overwritten. As well as the area-based grayscale image generation, the event-image-based approach collects events during the time period nΔt. The main difference is that the events are then directly used without grayscale conversion and the last occurred event within the time at a coordinate is valid. That is the reason why we call the collection-frame, address event image. For the correspondence search vectors are generated, which have a tri-state logic. The generation of the tri-state vector image AEI-Tri is done with AEI-Trit,L/R (u, v) = ⊗(AEIt,L/R (u + i, v + j))
(5)
∀AEIt,L/R (u, v) = 0 ∧ i ∈ [−m, . . . , m] ∧ j ∈ [−n, . . . , n]
where the tri-state function ⊗ concatenates the neighborhood m × n of a pixel AEIt,L/R (u, v) to a vector. After all vectors for an event image are generated the costs generation takes place. For the tri-state logic a function ξ is used to compare the values of the
u0
Left 1
2
3
4
5
un
6
v0 B
1 2 3
B
u0
2
3
4
5
un
6
v0
B
1
B
2
B
3
B
B
Right 1
B
B
B B
B
B
disp_range = 4
vn
vn
costs
TriState Logic: (AEIt,L(u5,v2)) = B … Background => 0 … On-Event => 1 … Off-Event => -1
-1001(-1)0001
4
(AEIt,R(u4,v2)) =
10-1-1(-1)0011
(AEIt,R(u3,v2)) =
0101(-1)-1001
3
(AEIt,R(u2,v2)) =
none
n.a.
(AEIt,R(u1,v2)) =
1100(-1)1000
5
Fig. 3. Event-image-based matching: The neighborhood of the matching candidate is encoded in a bitvector (tri-state logic) which is used for calculating the matching costs
680
J. Kogler, M. Humenberger, and C. Sulzbachner
vectors between left and right. The DSI is calculated with DSI-EBt (u, v, d) = ξ(AEI-Trit,L (u, v), AEI-Trit,R (u − d, v))) where ξ(p1 , p2 ) =
m×n i
1, p1 (i) <> p2 (i) 0, otherwise
.
(6)
(7)
Figure 3 illustrates with an example how the matching with the tri-state logic works. 4.5
Time-Based Approach
The usage of the former mentioned area-based and event-image-based stereo algorithms has shown a reduction of the advantages of the asynchronous data interface and the high temporal resolution of the silicon retina cameras. Due to this fact, a frame-less and therefore purely event-based stereo matching approach is presented here. It fully exploits the characteristics of the silicon retina technology. The proposed algorithm uses the time difference between events as the primary matching costs. Both silicon retina cameras are perfectly synchronized which means that correct matching events have exactly the same timestamp and polarity. Obviously, there are a few problems known from conventional stereo matching. For example if a series of events fire at the same time (with exactly the same timestamp) when movement is detected in the observed area in front of the camera. Even if the contrast between fore- and background is high enough so that, e.g. for a moving object, at one vertical border only on-events and at the other only off-events fire, there suppose to be more than one matching candidate. This is similar to conventional stereo matching when more than one matching candidate with the same matching costs exists. To overcome this problem, as well as in the approaches mentioned above, a neighborhood of the actual pixel will be taken into consideration. For silicon retina stereo vision, this neighborhood has two dimensions, space, as mentioned above, and time, as will be explained here. The time information for each event gets completely lost in the previous approaches. Here, the costs calculation is first done with ⎧ ⎪ ⎨wf(τ (AEL (u, v, t)) − τ (AER (u − d, v, tlast ))), (8) DSI-TBt (u, v, d) = AEL (u, v, t) = AER (u − d, v, tlast ) ⎪ ⎩ 0, no matching candidate with τ (AE(u, v, t)) = t,
(9)
where the time difference of each possible matching candidate within the disparity range is analyzed. The event time tlast of the right side describes the last valid time of the event with respect to the maximum time history.
Event-Based Stereo Matching Approaches
u u14
left 8
AEr(u+d,v,t-nΔt)=AEr(u-3,v,10-0·1ms)
u19
u22 9
black: off-event white: on-event 5, 9, 10,…: time stamps t (actual time stamp) = 10 Δt (time resolution) = 1ms hmax (history) = 5
u u14
10
right
u19 5
10
1
10
681
u22 7
AEl(u,v,t)=AEl(u,v,10)
d ∈ {0,1,..., d max } n ∈ {0,1,..., hmax }
disparity range
Fig. 4. Time-based matching: The time difference between event are weighted and used as matching costs
The weighting function is defined as hmax Δt − ΔtLR , wf(ΔtLR ) = ΔtLR hmax Δt · e− a ,
invers linear == true gaussian == true
,
(10)
where it has to be distinguished between inverse linear and Gaussian. Figure 4 shows an example which clarifies (8) and (9). In the first step, the matching of the events is carried out where for each generated event a corresponding event on the opposite side is searched within the disparity search range. For this search, all events of the current timestamp, as well as events from the past are used. This means also previous events of a certain time history nΔt and the same polarity are considered during the correspondence search. In this example, the off-event on the left side is searched in the right side. As can be seen, assuming the time to be a sufficient metric for costs calculation, the best matching candidate is the off-event with disparity 3 and timestamp 10. If this event would not be generated, the best match would be the event with disparity 6 and timestamp 5. In this case this event cannot be correct because it fired at a previous time than the event we are searching for. Thus, a weighting function is used to adapt the costs value (10) in a way that events near the actual event (in time) are more likely to be correct than events more in the past. After that, an aggregation step uses a defined neighborhood to include local information in the matching. If the aggregation step is not done because of any reason, the weighting function is unnecessary.
5
Experimental Results
For the evaluation of the different event-based stereo matching approaches a silicon retina stereo camera setup is used. The system consists of two silicon retina sensors with a baseline of 250mm, a resolution of 128×128 pixels, a pixel pitch of 40μm, and 12mm lenses. The stereo matching algorithm generates a disparity
682
J. Kogler, M. Humenberger, and C. Sulzbachner
Fig. 5. Rotating disk, (a) monochrome image, silicon retina event-image with a time history of (b) 1ms, (c) 10ms and (d) 100ms
map which is further used for distance calculation of the events. Therefore, a moving object which causes intensity changes and, thus, creates events is placed in front of the stereo sensor system at known distances. In detail, it is a rotating disk with a black and white pattern mounted at distances between 1.5m and 5.0m. All distances are measured with a point laser to have an exact ground truth value. Figure 5 shows the monochrome image (a) of the rotating disk, (b)(d) show the rotating disk with collected events for time history of 1ms, 10ms and 100ms. The rotating disk is placed in 0.5m intervals between 1.5m and 5.0m in front of the cameras. Before the stereo matching a calibration step is applied. In the first test the calculated disparities from the stereo matching algorithms are compared with the measured ground truth distances of the rotating disk. Table 1 lists the results of the stereo matching using the three introduced stereo matching approaches. For evaluation the average disparity of all pixels within a time history of 5ms is used. The table shows the average disparity, the calculated distance, and the average distance error between ground truth distance and processed distance. Additionally, an average error over all distances summarizes the results for each algorithm approach. Table 1. Calculated distances and average distance errors of all three different stereo matching approaches Area-based Event-image-based Time-based distances disp (px) z (m) err. (%) disp (px) z (m) err. (%) disp (px) z (m) err. (%) 1.5m 45.7 1.63 8.67 44.4 1.68 12.0 48.0 1.55 3.33 2.0m 35.7 2.09 4.50 35.3 2.11 5.50 36.6 2.04 2.00 2.5m 28.4 2.63 5.20 28.4 2.63 5.20 29.0 2.57 2.80 3.0m 23.6 3.16 5.33 23.8 3.13 4.33 24.0 3.11 3.67 3.5m 20.1 3.71 6.00 20.4 3.66 4.57 20.5 3.64 4.00 4.0m 17.7 4.21 5.25 18.0 4.14 3.50 17.9 4.17 4.25 4.5m 16.1 4.63 2.89 16.0 4.66 3.56 15.9 4.69 4.22 5.0m 14.7 5.07 1.40 14.8 5.04 0.81 14.4 5.18 3.60 average 4.91 4.93 3.48
Event-Based Stereo Matching Approaches
0.8
area-based event image-based time-based
0.6 detection rate
683
0.4
0.2
0
.5 s1 1m
m
m
s 1m
2.5
.0m s5 m 1
.5m s1 m 5
.5m s2 m 5
s 5m
m 5.0
1.5
1 event history and evaluation distance
m
m
m
s 0m
1
s 0m
2.5
s 0m
5.0
1
Fig. 6. Average detection rate of all matching algorithms within a time period of 2.5s
The results show an average distance error of 4.91% if the area-based approach is used followed by the event-image-based approach which achieves an average error of 4.93%. The time-based stereo matching algorithm achieves the best result with an average error of 3.48%. In the second test we have determined the detection rate of all three stereo matching approaches. For this test the disparity maps in the distances 1.5m, 2.5m and 5.0m with all stereo matching algorithms are calculated. Figure 6 shows the average detection rate calculated for a period of 2.5s. A disparity value within a range of -1 and +1 of the ground truth value is set as correct. Additionally, the detection rate is determined with three different time histories (1ms, 5ms and 10ms). Figure 6 shows that the area-based algorithm has the best average detection rate up to ∼55% for each time history and at the distance of 2.5m. Figure 7 shows the best detection rate measured for one timestamp within an analysis time of 2.5s. These results represent the best possible outcome of the stereo matching algorithms. Figure 7 depicts that the detection rate achieved in a distance of 5.0m is up to 100%. This can be led back to the fact that in a distance of 5.0m the rotating disk produces only a few events and it can easily happen that these few events are perfectly matched. In figure 6 and 7 the area-based approach generally achieves the better results. The area-based approach has the best average detection rates at the middle distance and lower time histories for the image generation. Taken the best detection rate under consideration, the area-based method delivers improved results with a marginal influence of the time history. The time-based algorithm depicts a bad detection rate with low time histories but it significantly increases for a rising time history. The result of the event-image-based approach lies between the detection rate of the other matching algorithms.
684
J. Kogler, M. Humenberger, and C. Sulzbachner
detection rate
area-based event image-based time-based 1
0.5
0
s 1m
1.5
m
m
.5 s2 1m
.0m s5 m 1
.5m s1 m 5
.5m s2 m 5
s 5m
m 5.0
m
m
s 0m
1.5
1 event history and evaluation distance
1
s 0m
m
2.5
s 0m
5.0
1
Fig. 7. Best detection rate measured for one timestamp within a time period of 2.5s
Summarizing all results the time-based approach has the best accuracy in consideration of the absolute distance estimation. In contrast, the area-based approach has a better detection rate. This leads to the conclusion that an algorithm combination of both approaches could result in a better stereo matching algorithm for silicon retina stereo camera systems.
6
Conclusion and Future Work
In this paper we presented the development and evaluation of different stereo matching algorithms for a silicon retina stereo vision system. The silicon retina technology differs in comparison to conventional image sensors and therefore existing algorithms can only be used with proper adaptations. Furthermore we showed that silicon retina data can be directly used for stereo matching as well. All introduced algorithm approaches are tested with real world data which delivered promising results. In a next step we will refine the algorithms and combine them in different ways to benefit from the strength of each single approach. Acknowledgment. This work is supported by the AAL JP project Grant CARE ”aal-2008-1-078”. The authors would like to thank all CARE participants working on the success of the project.
References 1. Belbachir, A.: Smart Cameras. Springer, Heidelberg (2010) 2. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision 47, 7–42 (2002) 3. Brown, M.Z., Burschka, D., Hager, G.D.: Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 993–1008 (2003)
Event-Based Stereo Matching Approaches
685
4. Banks, J., Bennamoun, M., Corke, P.: Non-parametric techniques for fast and robust stereo matching. In: Proceedings of the IEEE Region 10th Annual Conference on Speech and Image Technologies for Computing and Telecommunications, Brisbane/Australia, pp. 365–368 (1997) 5. Shi, J., Tomasi, C.: Good features to track. In: Proceedings of the IEEE Computer Vision and Pattern Recognition Conference, Seattle/USA, pp. 593–600 (1994) 6. Tang, B., AitBoudaoud, D., Matuszewski, B., Shark, L.: An efficient feature based matching algorithm for stereo images. In: Proceedings of the IEEE Geometric Modeling and Imaging Conference, London/UK, pp. 195–202 (2006) 7. Schraml, S., Sch¨ on, P., Milosevic, N.: Smartcam for real-time stereo vision address-event based embedded system. In: Ranchordas, A., Ara´ ujo, H., Vitri` a, J. (eds.) VISAPP 2007 - Proceedings of the Second International Conference on Computer Vision Theory and Applications, Barcelona/Spain, vol. 2, pp. 466–471 (2007) 8. Kogler, J., Sulzbachner, C., Humenberger, M., Eibensteiner, F.: Address-event based stereo vision with bio-inspired silicon retina imagers. In: Bhatti, A. (ed.) Advances in Theory and Applications of Stereo Vision, InTech Open Books (2011) 9. Mead, C., Mahowald, M.: A silicon model of early visual processing. Neural Networks Journal 1, 91–97 (1988) 10. Mahowald, M., Mead, C.: Silicon retina. Analog VLSI and Neural Systems, 257–278 (1989) 11. Sivilotti, M.: Wiring consideration in analog vlsi systems with application to field programmable networks. Phd-thesis, California Institute of Technology (1991) 12. Mahowald, M.: VLSI analogs of neuronal visual processing: a synthesis of form and function. Phd-thesis, California Institute of Technology (1992) 13. Lichtsteiner, P., Posch, C., Delbruck, T.: A 128x128 120db 30mw asynchronous vision sensor that responds to relative intensity change. In: Proceedings of the IEEE International Solid-State Circuits Conference, SanFrancisco/USA (2006) 14. Lichtsteiner, P., Posch, C., Delbruck, T.: A 128x128 120 db 15 µs latency asynchronous temporal constrast vision sensor. IEEE journal of solid-state circuits 43 (2008) 15. Posch, C., Matolin, D., Wohlgenannt, R.: A QVGA 143 dB Dynamic Range FrameFree PWM Image Sensor With Lossless Pixel-Level Video Compression and TimeDomain CDS. IEEE Journal of Solid-State Circuits 46, 259–275 (2011) 16. Zhang, Z.: A flexible new technique for camera calibration. Technical Report MSRTR9-71, Microsoft Research (2002)
A Variational Model for the Restoration of MR Images Corrupted by Blur and Rician Noise Pascal Getreuer1 , Melissa Tong2 , and Luminita A. Vese2 1
Centre de Math´ematiques et de Leurs Applications Ecole Normale Sup´erieure de Cachan 2 Department of Mathematics University of California, Los Angeles
Abstract. In this paper, we propose a variational model to restore images degraded by blur and Rician noise. This model uses total variation regularization with a fidelity term involving the Rician probability distribution. For its numerical solution, we apply and compare the L2 and Sobolev (H 1 ) gradient descents, and the iterative method called split Bregman (with a convexified fidelity term). Numerical results are shown on synthetic magnetic resonance imaging (MRI) data corrupted with Rician noise and Gaussian blur, both with known standard deviations.Theoretical analysis of the proposed model is briefly discussed.
1
Introduction
In this paper, we propose a variational model to denoise and deblur magnetic resonance imaging (MRI) data corrupted with Rician noise and Gaussian blur. This variational model consists of a total variation (TV) regularization with a fidelity term involving the Rician probability distribution. Basu, Fletcher and Whitaker [5] applied anisotropic diffusion in diffusion tensor MRI data, using a correction term derived from a maximum a posteriori (MAP) estimate. Descoteaux and Wiest-Daessle et al. [7] and Wiest-Daessle, Prima et al. [8] applied a non-local means filter to Rician denoising. Wang and Zhou [6] perform MRI denoising with a combination of TV and wavelet based regularization and a Gaussian noise model. Here we consider the MAP maximization problem with the Rician noise model (as in [5]) combined with total variation regularization. Our proposed model addresses both Rician denoising and deblurring. We investigate three numerical methods: two based on the L2 and Sobolev H 1 gradient descents applied to the original model, and a third one based on the split Bregman method [9] applied to a convexified version. Numerical results and comparisons on a MRI phantom are shown.
Work supported by National Science Foundation grants DMS-0714945, CCF/ITR Expeditions 0926127 (UCLA Center for Domain-Specific Computing), and by a NSF postdoctoral fellowship.
G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 686–698, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Variational Model for the Restoration of MR Images Corrupted
1.1
687
Rudin-Osher-Fatemi Model (ROF) as a MAP Estimate
Let f be a known degraded image and u the underlying clean image. The maximum a posteriori (MAP) estimate of u is the most likely value of u given f (assuming random variables): u ˆ = arg maxu P(u|f ). Applying Bayes’ theorem obtains max P(u|f ) ⇔ max P(u)P(f |u) ⇔ min − log P(u) − log P(f |u) . (1) u
u
u
The first term − log P(u) is called the prior on u; this term acts as a regularization or assumption on what u is likely to be. The second term − log P(f |u) describes the degradation process that produced f from u. The Rudin-Osher-Fatemi (ROF) restoration model [2] with blur is λ (2) min |Du| + (Ku − f )2 , 2 Ω u∈BV (Ω) Ω where f is assumed to be related to u by f = Ku+n, K is a linear blur operator, and n is white Gaussian noise. ROF can be seen to be a MAP estimate using the prior P(u) = exp(−α |Du|) and log P f (x)|u(x) dx − log P(f |u) = − Ω 2 1 2 = 2 (f (x) − Ku(x) dx + |Ω| 2 log(2πσ ), 2σ Ω so with λ = 1/(σ 2 α) we recover (2). Here, Ω |Du| is the total variation of u. Through this connection between ROF and MAP estimates, ROF can be reformulated to use other degradation models. Suppose that P f (x)|u(x) = exp −H Ku(x); f (x) (3) and that the f (x) are mutually independent over x. Then we obtain α |Du| + H(Ku; f ) dx . min u∈BV (Ω)
1.2
Ω
(4)
Ω
ROF with Rician Noise
The Rice or Rician distribution has probability density
r −(r2 + ν 2 ) rν I0 P(r; ν, σ) = 2 exp σ 2σ 2 σ2
(5)
where r, ν, σ > 0 and I0 is the modified Bessel function of the first kind with 2 order zero. If X ∼ N (ν cos θ, σ 2 ) and Y ∼ N (ν √ sin θ, σ ) are independent normal 2 random variables for any θ ∈ R, then R = X + Y 2 has Rician distribution, R ∼ Rician(ν, σ). Thus by (3) we have 2
f (x)2 + Ku(x) f (x)Ku(x) f (x) − log I0 Hσ Ku(x); f (x), σ = − log 2 . 2σ 2 σ2 σ
688
P. Getreuer, M. Tong, and L.A. Vese 3
3
2
2
1
0
K0 K1 K2 K3 K4
I0
0
I1
1
1
I2
I3
I4
2
3
4
0
0
1
2
3
t
4
t
Fig. 1. Modified Bessel functions
1.3
Bessel Functions
The probability density (5) involves the modified Bessel functions, so we briefly review their properties. The modified Bessel functions are the solutions of t2 y (t) + ty (t) − (t2 + n2 )y(t) = 0, t ≥ 0, n positive integer.
(6)
This equation has two linearly independent solutions In (t) and Kn (t), which are respectively the modified Bessel functions of the first and second kind. The functions In (t) are exponentially increasing while Kn (t) are exponentially decreasing, see Fig. 1. Some properties of In (t) are [1] d d 1 I0 (t) = I1 (t), I1 (t) = I0 (t) − I1 (t), dt t dt 1 π t cos θ −n e cos(nθ) dθ = ( 12 t)n . In (t) = i Jn (it) = π 0
2
Proposed Restoration Model and Implementation
We propose the following minimization method f f (Ku) f 2 + (Ku)2 −log 2 −log I0 ( dx. (7) inf F (u) = |Du|+λ )+ σ σ2 2σ 2 u∈BV (Ω) Ω Ω Although the minimization is not convex, the following existence and comparison results hold for the case K = I. We omit the proofs due to space constraints. Theorem 1. Suppose inf Ω f (x) = α > 0 and f ∈ L∞ (Ω). Then the minimization problem (7) for K = I admits at least one solution u ∈ BV (Ω) satisfying 0 ≤ u ≤ sup f. Ω
(8)
A Variational Model for the Restoration of MR Images Corrupted
689
Theorem 2. Let f1 and f2 be L∞ (Ω) functions such that 0 < α1 ≤ f1 ≤ β1 < ∞ and 0 < α2 ≤ f2 ≤ β2 < ∞. If f1 < f2 , then u1 ≤ u2 where u1 and u2 are solutions to (7) with K = I corresponding respectively to f = f1 and f = f2 . The denoising-deblurring minimization can also be analyzed theoretically under modified assumptions. To numerically solve the minimization problem, we use and compare the L2 and Sobolev (H 1 ) gradient descent methods, together with a split Bregman method. For gradient descent implementations, we consider an approximation of F (u) to remove the singularity at |∇u| = 0 in the Euler-Lagrange equations (using now ∇u for the gradient): f f (Ku) f 2 + (Ku)2 F (u) =
2 + |∇u|2 dx + λ − log 2 − log I0 ( ) + dx σ σ2 2σ 2 Ω Ω In general, gradient descent methods evolve a diffusion of the following form to steady state, ∂u(x, t) (9) = −∇F (u), ∂t where the gradient ∇F (u) is dependent on the function space considered. For the L2 and Sobolev (H 1 ) spaces, these gradients are F (u)v = ∇L2 F (u), vL2 ,
F (u)h
= ∇H 1 F (u), hL2 ,
∀v ∈ L2 ,
(10)
∀h ∈ H ,
(11)
1
where F (u)v and F (u)h are the directional derivatives of F at u in the direction of v ∈ L2 and h ∈ H 1 . They are related by [4] ∇H 1 F (u) = (I − Δ)−1 ∇L2 F (u). 2.1
(12)
L2 Gradient Descent Implementation
For our application, the L2 gradient descent is
f Ku f K ∗ Ku ∇u ∂u ∗ I1 ( σ2 ) · 2 =λ − +K . +∇· f Ku 2 ∂t σ2 σ I0 ( σ2 )
+ |∇u|2
(13)
Defining −1/2 n = 2 + (uni+1,j,k − uni,j,k )2 + (uni,j+1,k − uni,j,k )2 + (uni,j,k+1 − uni,j,k )2 , wi,j,k we discretize the L2 gradient descent as
f Kun n i,j,k un+1 K ∗ Kuni,j,k ) f λ i,j,k − ui,j,k ∗ I1 ( n σ2 − 2 (un+1 +K =λ − · 2 n i,j,k − ui,j,k ) f Ku 2 i,j,k dt σ σ σ I0 ( ) 2 σ
n+1 n n n (uni+1,j,k − un+1 +wi,j,k i,j,k ) − wi−1,j,k (ui,j,k − ui−1,j,k ) n+1 n n n +wi,j,k (uni,j+1,k − un+1 i,j,k ) − wi,j−1,k (ui,j,k − ui,j−1,k ) n+1 n n n +wi,j,k (uni,j,k+1 − un+1 i,j,k ) − wi,j,k−1 (ui,j,k − ui,j,k−1 ),
(14)
690
P. Getreuer, M. Tong, and L.A. Vese
∂u with initial condition u0 = f and a Neumann boundary condition ∂n |∂Ω = 0, where n is the unit normal to the boundary ∂Ω. Note that the second term on the right-hand-side rescales the timestep and is added to improve numerical stability.
2.2
Sobolev Gradient Descent Implementation
For the Sobolev gradient descent method, we consider
f Ku ∗ I ( ) Ku ∂u K f ∇u 1 2 σ = (I −cΔ)−1 λ − +K ∗ , (15) · 2 +∇· ∂t σ2 σ I0 ( fσKu
2 + |∇u|2 2 ) for some c > 0. For c = 1, the right hand side is the negative of the Sobolev gradient. The addition of c > 0 may lead to better results than when fixing c = 1, and for this reason, we add this parameter. We implement the method by Gni,j,k
f Kun i,j,k K ∗ Kuni,j,k ) f ∗ I1 ( σ2 · =λ − + K n f Kui,j,k 2 σ2 I0 ( ) σ 2 σ
n n +wi,j,k (uni+1,j,k − uni,j,k ) − wi−1,j,k (uni,j,k − uni−1,j,k ) n n (uni,j+1,k − uni,j,k ) − wi,j−1,k (uni,j,k − uni,j−1,k ) +wi,j,k
n n +wi,j,k (uni,j,k+1 − uni,j,k ) − wi,j,k−1 (uni,j,k − uni,j,k−1 )
and
(16)
n un+1 i,j,k − ui,j,k
= Wi,j,k (17) dt where Wi,j,k is the steady-state solution of the semi-implicit scheme +1 +1 +1 − c (Wi+1,j,k − 2Wi,j,k + Wi−1,j,k ) + (Wi,j+1,k − 2Wi,j,k + Wi,j−1,k ) Wi,j,k +1 + (Wi,j,k+1 − 2Wi,j,k + Wi,j,k−1 ) = Gni,j,k . (18) n 0 The scheme becomes un+1 i,j,k = ui,j,k + dt · Wi,j,k with initial conditions u = f and W 0 = 0 for the first iteration and W 0 equal to the previous W for all other iterations. We apply Neumann boundary conditions ∂W ∂n |∂Ω = 0.
2.3
Split Bregman Implementation
The split Bregman method [9] solves a convex minimization problem by operator splitting and then applying Bregman iteration to solve the split problem. Convex Approximation. The split Bregman method is only justified for convex objectives, so we must first approximate our objective (7) by a convex one. Let z = Ku and consider f as a fixed parameter. Then Hσ (z) =
f 2 + z2 − log I0 ( fσz2 ), 2σ 2
Hσ (z) =
fz z f I1 ( σ2 ) . − σ2 σ 2 I0 ( fσz2 )
(19)
A Variational Model for the Restoration of MR Images Corrupted 1
691
f =1 √ f = 2
0
f =2
H1 (z; f )
1
2
3
z
−1
f =4
−2
Fig. 2. Plot of H1 for different values of f
c˜ c
0.75 0.5
Inflection location 0.25 0 √
22
4
8
16
2048
f
Fig. 3. Inflection point c v.s. f
For efficiency, we approximate I1 /I0 by a cubic rational polynomial I1 (t) t3 + 0.950037t2 + 2.38944t ≈ 3 ≡ A(t). I0 (t) t + 1.48937t2 + 2.57541t + 4.65314
(20)
The selected coefficients minimize the L∞ error. We approximate Hσ (z) by ˜ (z) = z − f A( f z2 ). H σ σ2 σ2 σ
(21)
˜ σ (z)) by For the minimization, it is important to approximate Hσ (z) (and H a convex function. Since Hσ (z; f ) = H1 ( σz ; σf ), we may focus without loss of generality on σ = 1. √ H1 (0) is negative, and hence also H1 (z) for z sufficiently small, if f > 2 (see Fig. 2). It seems√H1 increases monotonically, which suggests H1 is nonconvex if ˜ 1 is nonconvex if and only if f 1.3955. and only if f > 2 and similarly H Let c be the inflection point of H1 such that H1 (c) = 0. Similarly, let c˜ be ˜ 1 (see Fig. 3). The maximum values are approximately the inflection point of H maxf c 0.8246 and maxf c˜ ≈ 0.8224. Let c¯ := 0.8246 such that c and c˜ are ˜ σ as always less than c¯. Then we can make a convex approximation of H ˜ σ (z) H if z ≥ c¯σ, ˜ Gσ (z) = (22) ˜ ˜ Hσ (¯ cσ) + Hσ (¯ cσ)(z − c¯σ) if z ≤ c¯σ.
692
P. Getreuer, M. Tong, and L.A. Vese
Implementing Split Bregman. In our case, we consider the splitting ˜ σ z(x), f (x) dxsubject to d = ∇u, z = Ku. (23) d(x) dx + λ G min d,z,u
Ω
Ω
Bregman iteration solves (23) by a sequence of unconstrained problems 1 ˜ σ (z, f ) dx + γ1 d − ∇u − b1 2 + γ2 z − Ku − b2 2 , min G |d| dx + 2 2 d,z,u λ Ω 2 2 Ω (24) where b1 and b2 are variables related to the Bregman iteration algorithm. The joint minimization over d, z, u, is approximated by alternatingly minimizing one variable at a time, fixing z and u and minimizing over d, fixing d and u and minimizing over z, and so on. The split Bregman algorithm for solving (23) is ⎧ Initialize u = f, d = b1 = 0, z = b2 = 0 ⎪ ⎪ ⎪ ⎪ while “not converged” ⎪ ⎪ ⎪ ⎪ Solve the d subproblem ⎨ Solve the z subproblem ⎪ ⎪ Solve the u subproblem ⎪ ⎪ ⎪ ⎪ b1 := b1 + ∇u − d ⎪ ⎪ ⎩ b2 := b2 + Ku − z.
(25)
We now discuss the solution to the three variable subproblems. – The d subproblem, with z and u fixed, is λγ1 d − ∇u − b1 22 . min |d| dx + d 2 Its solution is a vectorial shrinkage d(x) =
∇u(x) + b1 (x) |∇u(x) + b1 (x)| − |∇u(x) + b1 (x)|
1 + , λγ1
(26)
where (s)+ := max{s, 0}. – The z subproblem, with d and u fixed, decouples over x to ˜ σ z(x), f (x) + γ2 z(x) − Ku(x) − b2 (x) 2 . min G 2 z(x)≥0 The subproblem objective is strictly convex, thus it has exactly one mini˜ (z ; f )+γ2 (z −Ku−b2 ) is strictly monotone mizer z (x) and its derivative G σ increasing. Thus the sign at z = c¯σ implies + ˜ (¯ ˜ (¯ z = y − γ12 G if G cσ − y) ≥ 0, σ cσ; f ) σ cσ; f ) + γ2 (¯ (27) ˜ σ (¯ if G cσ; f ) + γ2 (¯ cσ − y) < 0, z > c¯σ where y = Ku + b2 . In the latter case, we must solve ( σ12 + γ2 )z −
f A( f z2 ) = γ2 y, σ2 σ
z > c¯σ.
(28)
A Variational Model for the Restoration of MR Images Corrupted
693
We approximate z (x) by one iteration of Newton’s method, using the value of z(x) from the previous Bregman iteration as the initialization. – The u subproblem, with d and z fixed, is min u
γ1 γ2 2 ∇u − d + b1 22 + Ku − z + b2 2 . 2 2
The optimal u is obtained in the Fourier domain as uˆ =
γ2 γ1
¯ ϕ·(z−b ˆ 2 )ˆ− ∇(d−b1 ) ˆ , γ2 ¯ ˆ ϕ· ˆ ϕ− ˆ Δ γ1
where multiplications and divisions are pointwise. To avoid boundary artifacts, the volume should be extended symmetrically. For symmetric ϕ (the convolution kernel defining K), the computational costs can be reduced by using discrete cosine transforms [3].
3
Numerical Examples
We perform restoration experiments on a synthetic T1 MRI volume (see Fig. 4) obtained from BrainWeb.1 The L2 and Sobolev gradient methods were implemented with Matlab using the stopping condition |F (uk+1 ) − F (uk )| < 10−4 |F (uk )|.
(29)
Split Bregman was implemented in C using 40 Bregman iterations. The data is restored with the same K and σ used to produce the input data. The experiment computation times and RMSE (root mean square error) results are summarized in Table 1. The noise coefficient σ could be estimated from the input data f , but we assume here that this is known.
Fig. 4. Three slices of the clean synthetic T1 MRI volume
First, we demonstrate the methods when the given image is corrupted with Rician noise but not blurred, K = I and σ = 0.08. L2 gradient descent was performed with λ = 0.1, fixed timestep dt = 0.1, and required 33 iterations to satisfy (29). Sobolev gradient descent was performed with λ = 0.15, c = 1.5, fixed 1
http://mouldy.bic.mni.mcgill.ca/brainweb
694
P. Getreuer, M. Tong, and L.A. Vese Table 1. Comparison of computation time and RMSE values Experiment Fig. 6
Fig. 7
Fig. 8
Method Time (s) L2 Gradient Descent (Matlab) 1091 Sobolev Gradient Descent (Matlab) 833 Split Bregman (C) 171 L2 Gradient Descent (Matlab) 791 Sobolev Gradient Descent (Matlab) 929 Split Bregman (C) 201 L2 Gradient Descent (Matlab) 950 Sobolev Gradient Descent (Matlab) 1419 Split Bregman (C) 212
RMSE 0.039935 0.034541 0.029299 0.037068 0.032020 0.030904 0.024279 0.029659 0.028305
timestep dt = 0.05, and required 16 iterations. Split Bregman was performed with λ = 0.2, γ1 = 5, γ2 = 2, and 40 iterations. Fig. 5 shows the decrease of the energy functionals. Note that because different values of λ were used, energy values are not directly comparable. Fig. 6 shows the results.
L2 gradient descent Energy (×105 )
Sobolev gradient descent Energy (×106 )
Split Bregman Energy (×106 ) −1.5
2 −0.75
−1.6
0 −1.00
−1.7
−2 4 8 12 16 20 24 28 32
Iteration
−1.25
4
8
12
Iteration
16
0
10
20
30
40
Iteration
Fig. 5. Energy versus iteration for denoising experiments
The next two experiments are joint denoising and deblurring problems, where the data is both blurred with a Gaussian kernel and corrupted with Rician noise. In Fig. 7, σ = 0.08 and K is Gaussian blur with standard deviation 0.6 voxels. L2 gradient descent was performed with λ = 0.2, fixed timestep dt = 0.1, and required 21 iterations. Sobolev gradient descent was performed with λ = 0.25, c = 1.25, fixed timestep dt = 0.05, and required 12 iterations. Split Bregman was performed with λ = 0.4, γ1 = 2, γ2 = 2, and 40 iterations. Fig. 8 shows an experiment with heavier blur, where σ = 0.02 and K is Gaussian blur with standard deviation 1.5 voxels. L2 gradient descent was performed with λ = 0.4, fixed timestep dt = 0.1, and required 28 iterations. Sobolev gradient descent was performed with λ = 1.1, c = 5, fixed timestep dt = 0.001, and required 30 iterations. Split Bregman was performed with λ = 9, γ1 = 2, γ2 = 2, and 40 iterations.
A Variational Model for the Restoration of MR Images Corrupted
Input data f
L2 gradient descent
(RMSE 0.046872)
Error histogram Freq.
−0.25
(RMSE 0.039935)
Error histogram Freq.
0
uexact − u
Sobolev gradient descent
0.25
(RMSE 0.034541)
Error histogram Freq.
−0.25
695
−0.25
Split Bregman
0
uexact − u
0.25
(RMSE 0.029299)
Error histogram Freq.
0
uexact − u
0.25
−0.25
0
uexact − u
Fig. 6. Denoising experiment with σ = 0.08 and no blurring
0.25
696
P. Getreuer, M. Tong, and L.A. Vese
Input data f
L2 gradient descent
(RMSE 0.094718)
Error histogram Freq.
−0.25
Error histogram Freq.
0
uexact − u
Sobolev gradient descent
0.25
(RMSE 0.032020)
Error histogram Freq.
−0.25
(RMSE 0.037068)
−0.25
Split Bregman
0
uexact − u
0.25
(RMSE 0.030904)
Error histogram Freq.
0
uexact − u
0.25
−0.25
0
uexact − u
0.25
Fig. 7. Restoration experiment with σ = 0.08 and blur of standard deviation 0.6
A Variational Model for the Restoration of MR Images Corrupted
Input data f
L2 gradient descent
(RMSE 0.046872)
Error histogram Freq.
−0.25
(RMSE 0.024279)
Error histogram Freq.
0
uexact − u
Sobolev gradient descent
0.25
(RMSE 0.029659)
Error histogram Freq.
−0.25
697
−0.25
Split Bregman
0
uexact − u
0.25
(RMSE 0.028305)
Error histogram Freq.
0
uexact − u
0.25
−0.25
0
uexact − u
0.25
Fig. 8. Restoration experiment with σ = 0.02 and blur of standard deviation 1.5
698
4
P. Getreuer, M. Tong, and L.A. Vese
Conclusions
We proposed total variation-based minimizations for restoring MR images corrupted by Rician noise and blur. We solve the TV-Rician minimization problem using L2 and H 1 gradient descent and split Bregman methods. Numerical experiments were performed on synthetic MRI data corrupted with blur and Rician noise, and the restored images as well as time and RMSE values were given, illustrating the efficiency of proposed algorithms. Such preprocessing steps can improve segmentation, an essential step in medical image analysis.
References 1. Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, pp. 355–379. Dover, New York (1965) 2. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 3. Martucci, S.: Symmetric convolution and the discrete sine and cosine transforms. IEEE Trans. Sig. Processing SP-42, 1038–1051 (1994) 4. Neuberger, J.W.: Sobolev gradients and differential equations. Springer Lecture Notes in Mathematics, vol. 1670 (1997) 5. Basu, S., Fletcher, T., Whitaker, R.T.: Rician noise removal in diffusion tensor MRI. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4190, pp. 117–125. Springer, Heidelberg (2006) 6. Wang, Y., Zhou, H.: Total Variation Wavelet-Based Medical Image Denoising. Hindawi Publishing Corporation International Journal of Biomedical Imaging 2006, 1–6 (2006) 7. Descoteaux, M., Wiest-Daessl´e, N., Prima, S., Barillot, C., Deriche, R.: Impact of rician adapted non-local means filtering on HARDI. In: Metaxas, D., Axel, L., Fichtinger, G., Sz´ekely, G. (eds.) MICCAI 2008, Part II. LNCS, vol. 5242, pp. 122– 130. Springer, Heidelberg (2008) 8. Wiest-Daessl´e, N., Prima, S., Coup´e, P., Morrissey, S.P., Barillot, C.: Rician noise removal by non-local means filtering for low signal-to-noise ratio MRI: Applications to DT-MRI. In: Metaxas, D., Axel, L., Fichtinger, G., Sz´ekely, G. (eds.) MICCAI 2008, Part II. LNCS, vol. 5242, pp. 171–179. Springer, Heidelberg (2008) 9. Goldstein, T., Osher, S.: The Split Bregman Method for L1 Regularized Problems. SIAM J. Imaging Sci. 2(2), 323–343 (2009)
Robust Classification of Curvilinear and Surface-Like Structures in 3d Point Cloud Data Mahsa Kamali1 , Matei Stroila2 , Jason Cho1 , Eric Shaffer1 , and John C. Hart1 1
University of Illinois Urbana-Champaign 2 NAVTEQ
Abstract. The classification of 3d point cloud data is an important component of applications such as map generation and architectural modeling. However, the complexity of the scenes together with the level of noise in the data acquired through mobile laser range-scanning make this task quite difficult. We propose a novel classification method that relies on a combination of edge, node, and relative density information within an Associative Markov Network framework. The main application of our work is the classification of the structures within a point cloud into curvilinear, surface-like, and noise components. We are able to robustly extract complicated structures such as tree branches. The measures taken to ensure the robustness of our method generalize and can be leveraged in noise reduction applications as well. We compare our work with another state of the art classification technique, namely Directional Associative Markov Network, and show that our method can achieve significantly higher accuracy in the classification of the 3d point clouds.
1
Introduction
In the past decade, robust classification of point cloud data has been a topic of great interest, particularly in navigation environments and forest inventories, due to the wider availability of laser scanners in our times [1], [2]. Classification of data in such scenarios is a very challenging task for two key reasons. Firstly, trees and many architectural structures lack a well-defined global geometry, making it difficult to learn a model or prototype from given data. Secondly, diverse geometrical structures stand out at proper local scales. Hence, the classification algorithm needs to take in consideration the scale-hierarchical nature of the local arrangements of these structures. One typical approach for fast data labeling is applying a local feature space clustering method such as Mean-Shift, K-Means, or X-Means on the data. These techniques obviously suffer from the limitation of using only local statistics which can easily lead to incorrect inference. In our early experiments, such techniques created entirely inadequate classifications. A more refined approach consists in using both local features and neighborhood consistency for adequate point cloud classification. A representative tool, G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 699–708, 2011. c Springer-Verlag Berlin Heidelberg 2011
700
M. Kamali et al.
widely used for this purpose, is a graphical model called Associative Markov Network (AMN) [3], [4], and [5]. This is the framework used in this paper. In general, AMN is a graphical model that is applied to a point cloud in order to associate each of its nodes (points) with a label. This classification process not only considers features associated with each node, but also reflects how well the guessed label of the node is consistent with its neighbors. As of today, there are quite few studies on the construction of a suitable graphical model for accurate point cloud data classification. This is the focus of our paper. We examined several features of the point cloud data and found their best combination for robust classification of connected structures within this data using AMNs. The paper outline is as follows. We review relevant previous work in Section 2. In Section 3, we state the classification problem and the AMN solution framework. Section 4 describes our novel solution. Several results of our method are presented in Section 5. We finally conclude in Section 6.
2
Previous Work
Yuri Boykov et al. [4] survey the use of min-cut/max-flow algorithms for solving graph based classification problems. While the surveyed methods focus on solving the min-cut problem itself, they do not consider what features are most suitable for the particular purpose of point cloud structure classification. Local features that are most commonly used for classification of point cloud data are point intensity, point coordinates, neighborhood density, and principal component analysis (PCA) on a local neighborhood [6]. All the AMN based methods mentioned in this paper use one or a combination of these features as local information in order to perform the classification. Needless to mention, if a local feature is computed inaccurately, these algorithms fail to generate a viable classification. Another challenging problem that AMN based methods face is choosing a correct neighborhood size for local computations. Unnikrishnan et al. [1] and Lalonde et al. [7] describe a method for choosing the best neighborhood size for calculating the normal at each point on a surface. They indicate that classification based on local features is more accurate when using PCA information extracted at the proper scale. Munoz et al. [3] use Directional Associative Markov Networks (DAMN) in order to extract structures that have a finite set of well-defined direction. This approach cannot be used for the extraction of the branches of a tree since these structures do not follow a specific finite set of directions. Xiong et al. [8] and Munoz et al. [9] create newer subcategories of AMN which hierarchically take advantage of the contextual information to further enhance the classification. However, the mentioned technique fails when dealing with a twisted linear or curved surface which does not follow a specified global orientation. Shapovalov et al. [10] use a method that partially relies on AMNs for extracting connected segments in point clouds. While their work shows promising results, it is not able to classify data on arbitrary bases.
Robust Classification of Curvilinear and Surface-Like Structures
701
As discussed in this section, the existing AMN based techniques do not focus on choosing good features for extracting connected structures from point clouds, such as tree branches or other curved natural structures. In this paper, we investigate several different combinations of local features to leverage the AMN technique for robust classification. Furthermore, we enhance the numerical stability of this technique. We present our method in detail in the consecutive sections.
3
Data Classification Using Markov Networks
In this section, we formally state the classification problem and show how an AMN based classifier can be used to generate a solution. 3.1
Problem Description
As described in [3], we have a set of n random variables Y = {Y1 , . . . , Yn }, and a scoring function. The goal is to find the assignment of values of y = {y1 , . . . , yn } to Y that maximizes our scoring function. In our problem, Yi are the 3d points and the values they can take correspond to their labels. Each point has a set of corresponding features, x, which are extracted from the data (e.g. geometry, color, intensity, PCA). The effect of the observed features on the labels is modeled by the conditional distribution Pw (y|x), which is parameterized by a vector w. Note that y is a vector of length n, thus it is the joint distribution of all node labels that is conditioned on all the features. In the following subsections, we explain how the distribution Pw (y|x) is modeled for our task and how w is estimated. Estimation of w can be formulated in terms of a supervised learning problem. Therefore the classification will have two main steps: 1) learning w from previously labeled data, and 2) inference for a new non-labeled scene based on the learned model Pw (y|x). 3.2
Associative Markov Networks
An Associative Markov Network (AMN) defines the joint distribution of random variables Yi conditioned on the observed features x. The problem is associated with a weighted graph, where the random variables Yi (each variable is a 3d point in our problem) correspond to nodes, and each edge defines the interaction between each variable with its surrounding nodes. Each node has potential φi (yi ) and each edge has potential φij (yi , yj ). The potentials convert the features to numerical scores indicating their affinity to the labels. Construction of this graph (including the way potentials are defined) is task dependent and non-trivial. For our task of interest, we propose to build this graph by searching an -ball neighborhood of each node and creating an edge between that node and each node that fall inside the ball. The dependence of the potentials on their features x = {xi , xij } is modeled to be log-linear, where xi ∈ Rdn is a node feature (such as 3d position, intensity, color, eigenvectors) and xij ∈ Rde is an edge feature (for example, the angle
702
M. Kamali et al.
between the dominant eigenvectors on the nodes). Therefore, when to a node is assigned a label k, the log of the node’s potential is modeled as: log φi (k) = wnk · xi In the AMN framework, a variant of the Potts model is used, where the assignment of differently labeled nodes that share an edge is penalized, and the assignment of nodes with same labels is accepted. Therefore: ∀k , log φij (k, k) ≥ 0 ∀k = l , log φij (k, l) = wek.l · xij = 0 We also need to mention that the weights and features are constrained to be non-negative. The log of the joint conditional probability function is:
log Pw (y, x) =
N K i=1 k=1
(wnk , xi )yik +
K
(wek,k · xij )yik yjk − log Zw (x),
(ij)∈E k=1
N where Zw (x) = y P artitionF unction i=1 φ(yi ) ij∈E φij (yi , yj ). Finding the assignment which maximizes this probability is, in general, an NP-hard problem [3]. However, with tasks involving only two labels, it is possible to compute the solution efficiently [4]. While this paper works with a 2-label problem (for instance, branches versus leaves), we mention that extending the problem to cover more labels is still possible if approximate solutions are tolerated. One example to achieve such approximation is α-expansion [11]. For details on estimating the parameter vector w, please refer to [12]. We follow the sub-gradient method to maximize the score function, as explained in [3].
4
Geometry Driven Associative Markov Networks
Munoz et al. [3] assert that regular AMNs do not produce promising classification because AMNs do not consider global coordinate directions. The authors use position and principal directions as features of each point in data. They claim that, although direction is a crucial piece of information for classification of the point data, it should not be directly used in the AMN. They suggest that, when there is no dominant direction in the neighborhood of a point, AMN only relies on a spatial coherence assumption and labels all the nodes in that neighborhood as the same class. The assumption of clean outstanding direction in a neighborhood is not met in most LiDAR datasets due to noise. In order to cope with this problem, [3] modifies the original AMN to reduce its sensitivity to local directions and leverages global directional information instead. This is achieved by first quantizing the set of all possible directions and then considering the points from the entire cloud that have the same discretized direction to be in the same class. This is useful when the structure of the objects consists of a few dominant directions.
Robust Classification of Curvilinear and Surface-Like Structures
703
* DijI
* DA * di
nj
* de
ni
* dj DAMN RDAMN
1 j
v
v 0j nj
vi1
Training Input 2 i
v
v 2j
v1j
vi0
β
v 0j
γ
ni
vi2
α
v 2j
Fig. 1. Comparing edge features between DAMN and RDAMN on an artificially generated point set: (Left) Input data (structured points(white) and noise(grey)) (Top Middle) DA : global direction A, Dij : edge direction, (Bottom Middle) N : eigenvector corresponding to smallest lambda on SVD (PCA) and r: eigenvector corresponding to largest lambda on SVD (PCA), (Right) top DAMN generalization, bottom RDAMN generalization
Examples given in [3] include tree trunks and poles that are mainly vertical structures or power lines that are mainly horizontal structures. However, the assumption of having a few dominant directions does not apply to objects like trees (except for their trunk). The branches and leaves in a tree are quite complex in shape and grow in arbitrary directions. Therefore, in this work, we leverage the continuity of directions to aid the classification process. 4.1
Relative Directional Associative Markov Networks (RDAMN)
In this paper, we propose to use edge based directional information as opposed to DAMN. As visualized in Figure 1, we use the angular difference between points at the two sides of an edge as edge features for AMN. In Figures 1 and 2, we show how RDAMN outperforms in separating noise from structure. 4.2
Adequate Feature Combination
There are many possible local properties that one can consider within AMN based classification scheme. We have explored the ones listed below:
704
M. Kamali et al.
Fig. 2. Comparing DAMN with RDAMN outcome for extracting regular structure (grey) vs. noise (white); left and right: Velodyne HDL-64E dataset, middle: artificial dataset
– Position: 3d coordinates pi = (xi , yi , zi ) for i = 1, 2, . . . , n. – Intensity: The amount of light reflected back. – Local Density ρi : The relative density of an ball around the i-th point. Given that I is an indicator function that has value 1 if the condition is met i , where and 0 otherwise, the relative density is defined as: ρi = 1 N n k=1 Nk n n Ni = k=1 Ipi −pk ≤ – Orientation: Eigenvectors vi ∈ R3 and eigenvalues λi ∈ R for the locally computed covariance matrix Ci : Ci vi = λi vi . Here, pi = N1i nk=1 Ipi −pk ≤ pk , n Ci = N1i 1 Ipi −pk ≤ pk − ppk − pT . Since vi ∈ R3 , we have three eigenvalues λi0 ≤ λi1 ≤ λi2 . We use Ei = (λi0 , λi1 − λi0 , λi2 − λi0 )/λi0 as (sphereness, surfaceness, cylinderness). Although there exists several outstanding properties that can be used as features, we need to choose the most relevant subset to avoid over-fitting. Below is the list of the seven combinations that we used on our inputs: T vj2 , 4) Ei and ρi and vi0 , vi1 , vi2 , 5) Ei and 1) Ei , 2) Ei and ρi , 3) ρi and vi2 T T vj2 . vi0 , vi1 , vi2 , 6) Ei and ρi and vi2 vj2 , 7) Ei and vi2 Each combination of features was tested on 18 different point clouds from different scenes. The results indicated that the seventh feature combination was T vj2 , providing a good balance between over-fitting the most successful: Ei and vi2 and under-fitting. Figure 3 shows the labeled tree branches versus leaves in point cloud data using this choice.
Robust Classification of Curvilinear and Surface-Like Structures
705
Fig. 3. Classification results on an Optech LYNX dataset. Grey: points with surface or sphere-like neighborhoods; White: points with curvilinear or cylinder-like neighborhoods.
4.3
Logarithmic Barrier
One condition in Markov Random Fields probability model is that the weights vector w must be positive. In [3], this is achieved by replacing negative weights with zero throughout the optimization iterations. However, such sharp change in w affects the numerical stability of the optimization. In fact, we observed convergence difficulty with that implementation. We propose to use the logarithmic barrier method [13] that penalizes weights close to zero. This is a smooth way of maintaining positive weights.
5
Experimental Results
In order to test our algorithm on a wide variety of scenarios, we use input data gathered by Velodyne HDL-64E and Optech LYNX mobile LiDAR sensors, and also by Leica ScanStation2 stationary LiDAR sensor. Each system has distinct properties. The HDL-64E LiDAR data is dense but noisy, while the LYNX data is sparser but much less noisy. We use a KD-Tree data structure to extract -ball neighborhoods of points, computed in parallel and reasonably fast. For principal component analysis, we use GNU GSL1 ’s optimized mathematics library for efficient singular value decomposition. Regarding the performance, for labeling roughly between 50,000 to 100,000 points on an Intel core i7 Q720 Windows 7 system, it takes approximately 3 to 5 minutes for learning and 15 to 30 seconds for inference. For our quantitative experimental comparison, we used sub-sections of data from the 2011 3DIMPVT challenge dataset2 , which was acquired by Leica 1 2
http://www.gnu.org/software/gsl http://www.3dimpvt.org/challenge.php
706
M. Kamali et al.
Fig. 4. (left) Data 2 (right) Data 4 training input (sections from Leica ScanStation2 dataset), viewed from two different angles
Fig. 5. (left) Algorithm labeling based on learned inputs from Figure 4, on Leica ScanStation2 dataset sections. (Right) white: correct labeling, grey: incorrect labeling. (Top two rows) Data 2 RDAMN on top row and DAMN at the bottom row, (Bottom two rows) Data 4 RDAMN on top row and DAMN at the bottom row.
Robust Classification of Curvilinear and Surface-Like Structures
707
Table 1. Three way cross validation correct label count RDAMN vs. DAMN Training Data Test on Points Count Correct Node Count RDAMN DAMN Data 2 12291 9819 3852 Data 1 Data 3 39503 36277 24318 Data 1 32945 21078 13560 Data 2 Data 3 39503 24430 15339 Data 1 32945 28217 25273 Data 3 Data 2 12291 10157 5733
ScanStation2. It contains a number of accurate range scans in New York City. The dataset contains human-made structures as well as some points caused by sensor noise. We chose a goal of removing noise from these structures. We define structures as all the structures that have reasonably a smooth surface. This automatically excludes all the visually scattered vegetation points. We used one dataset as reference data for training both our RDAMN algorithm and the DAMN algorithm [3], and three other datasets as test data. To ensure fairness, we ran a 3 way cross validation by alternating the dataset for which the training occurred. The 3 data sets are hand labeled and hence the learning can still vary a bit from set to set. Table reftbl:table1 lists the quantitative comparison between DAMN and our method, RDAMN. Figures 4 and 5 show two samples from our test data.
6
Conclusion
In this paper we introduced a technique for classification of LiDAR point cloud data, using the AMN framework. Namely, we examined several local features of the point cloud data and found their best combination for robust classification of connected structures within this data. This technique has many real applications; in particularl noise removal from LiDAR data for modeling or other industrial purposes. This work can be extended in several ways. Regarding the local features, the use of robust principal component analysis [14] looks promising when dealing with noisy data. Additionally, more recent LiDAR datasets have fairly accurate color information as a result of using cameras and advanced sensor fusion algorithms. Since diverse regular structures tend to have locally constant color, this feature could significantly improve the classification.
References 1. Unnikrishnan, R., Lalonde, J.F., Vandapel, N., Hebert, M.: Scale selection for the analysis of point-sampled curves. In: Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT 2006), pp. 1026–1033. IEEE Computer Society, Washington, DC, USA (2006)
708
M. Kamali et al.
2. Golovinskiy, A., Kim, V.G., Funkhouser, T.: Shape-based recognition of 3D point clouds in urban environments. In: International Conference on Computer Vision, ICCV (2009) 3. Munoz, D., Vandapel, N., Hebert, M.: Directional associative markov network for 3-d point cloud classification. In: Fourth International Symposium on 3D Data Processing, Visualization and Transmission (2008) 4. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1124–1137 (2004) 5. Munoz, D., Vandapel, N., Hebert, M.: Onboard contextual classification of 3-d point clouds with learned high-order markov random fields. In: Proceedings of the 2009 IEEE international conference on Robotics and Automation, ICRA 2009, pp. 4273–4280. IEEE Press, Piscataway (2009) 6. Jolliffe, I.: Principal component analysis. Springer series in statistics. Springer, Heidelberg (2002) 7. Lalonde, J.F., Unnikrishnan, R., Vandapel, N., Hebert, M.: Scale selection for classification of point-sampled 3-d surfaces. In: Proceedings of the Fifth International Conference on 3-D Digital Imaging and Modeling, pp. 285–292. IEEE Computer Society, Washington, DC, USA (2005) 8. Xiong, X., Munoz, D., Bagnell, J.A.D., Hebert, M.: 3-d scene analysis via sequenced predictions over points and regions. In: IEEE International Conference on Robotics and Automation, ICRA (2011) 9. Munoz, D., Bagnell, J.A.D., Vandapel, N., Hebert, M.: Contextual classification with functional max-margin markov networks. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR (2009) 10. Shapovalov, R., Velizhev, A., Barinova, O.: Non-associative markov networks for 3d point cloud classification. Networks XXXVIII, 103–108 (2010) 11. Boykov, Y., Veksler, O., Zabih, R.: Efficient approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence (2001) 12. Taskar, B., Chatalbashev, V., Koller, D.: Learning associative markov networks. In: International Conference on Machine Learning 13. Bertsekas, D., Nedi´c, A., Ozdaglar, A.: Convex analysis and optimization. Athena Scientific optimization and computation series. Athena Scientific, Belmont (2003) 14. Cand`es, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58, 11:1–11:37 (2011)
Orthographic Stereo Correlator on the Terrain Model for Apollo Metric Images Taemin Kim1, Kyle Husmann1, Zachary Moratto1 and Ara V. Nefian1,2 1
NASA Ames Research Center, Moffett Field, CA, 94035 2 Carnegie Mellon University
Abstract. A stereo correlation method on the object domain is proposed to generate the accurate and dense Digital Elevation Models (DEMs) from lunar orbital imagery. The NASA Ames Intelligent Robotics Group (IRG) aims to produce high-quality terrain reconstructions of the Moon from Apollo Metric Camera (AMC) data. In particular, IRG makes use of a stereo vision process, the Ames Stereo Pipeline (ASP), to automatically generate DEMs from consecutive AMC image pairs. Given camera parameters of an image pair from bundle adjustment in ASP, a correlation window is defined on the terrain with the predefined surface normal of a post rather than image domain. The squared error of back-projected images on the local terrain is minimized with respect to the post elevation. This single dimensional optimization is solved efficiently and improves the accuracy of the elevation estimate.
1
Introduction
Topographical maps are an essential tool for scientists interested in exploring and learning more about planetary bodies like the moon or mars. These maps allow scientists do everything from identifying geological phenomena to identifying potential landing sites for probes or spacecraft. Satellites and other spacecraft that visit planetary bodies of interest are usually equipped with a variety of sensors, some of which can be used to recover the topography of the planetary surface. LiDAR (Light Detection And Ranging) sensors give sparse (but highly accurate) measurements at periodic points called “posts.” Raw images that are captured as the satellite orbits the planetary body can also be processed to create highly detailed topographical maps. By registering and aligning these two data sets, maps can be created that are both dense and accurate. Given two images of the same scene taken from slightly different perspectives, the relative shift of objects from frame to frame (known as “disparity”) is related to distance of the object: far objects appear to move less than close objects. This phenomenon should be familiar, since the human brain uses this relationship to create depth perception from the differences in perspective seen by both eyes. Similarly, depth information from orbital imagery can be recovered in areas where the images overlap by matching points in the images that correspond to the same 3D location and measuring their disparity. By using the disparity along with the location and orientation of the spacecraft as well as a mathematical model for the lens of the camera, the precise G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 709–717, 2011. © Springer-Verlag Berlin Heidelberg 2011
710
T. Kim et al.
location of the 3D point can be calculated. This technique of using matching points between images to calculate 3D locations is known as “stereophotogrammetry” [6]. Before computers automated this task, points in orbital images were manually matched using mechanical stereoplotters. Now, computers can perform this once long and arduous process with minimal human interaction. For example, the Ames Stereo Pipeline (ASP) [2, 5] is a collection of cartographic and stereogrammetric tools for automatically producing digital elevation maps (DEMs) from orbital images acquired with the Apollo Metric Camera (AMC) during Apollo 15-17 (Figure 1). A new stereo correlation method is proposed on the terrain model to improve the accuracy of the DEM. This paper will address this problem by finding robust elevation from an image pair that minimizes the discrepancy between two back-projected patches on the local planar terrain. The proposed method determines the elevation by minimizing the squared error of the back-projected image patches. The accuracy of DEMs produced by IRG will thus be improved. The reconstructed DEMs from lunar orbital imagery are presented. The paper is outlined as follows: ASP is introduced briefly in section 2. Orthographic stereo correlation method is proposed in section 3. Experimental results are shown to validate the proposed method. The paper concludes in section 5.
(a) The Mapping Cameras System
(b) Apollo 15 Orbit 33
Fig. 1. AMC Data System. (a) The AMC captures a series of pictures of the Moon's surface. (b) Satellite station positions for Apollo Orbit 33 visualized in Google Moon.
2
Ames Stereo Pipeline
The Ames Stereo Pipeline (ASP) is the stereogrammetric platform that was designed to process stereo imagery captured by NASA spacecraft and produce cartographic products since the majority of the AMC images have stereo companions [10]. The entire stereo correlation process, from an image pair to DEM, can be viewed as a multistage pipeline (Figure 2). Preprocessing includes the registration to align image pairs and filtering to enhance the images for better matching. Triangulation is used at the last step
Orthographic Stereo Correlator on the Terrain Model for Apollo Metric Images
711
to generate a DEM from the correspondences. Bundle adjustment and stereo correlation are reviewed briefly here for the context although they are introduced in [18].
2.1
Bundle Adjustment
Bundle Adjustment (BA) corrects the three-dimensional postures of cameras and the locations of the objects simultaneously to minimize the error between the estimated location of the objects and their actual location in the images. Camera position and orientation errors affect the accuracy of DEMs produced by the ASP. If they are not corrected, these uncertainties will result in systematic errors in the overall position and slope of the DEMs. BA ensures that observations in multiple different images of a single ground feature are self-consistent. In BA the position and orientation of each camera station are determined jointly with the three-dimensional position of image tie-points chosen in the overlapping regions between images. Tie-points are automatically extracted using the SURF robust feature extraction algorithm [14]. Outliers are rejected using the random sample consensus (RANSAC) method [15]. The BA in ASP determines the best camera parameters that minimize the reprojection error [16]. The Levenberg-Marquardt algorithm is used to optimize the cost funciton [17].
Fig. 2. Dataflow of the Ames Stereo Pipeline. Preprocessing includes the registration and Laplacian of Gaussian (LoG) filtering of the image pair. A stereo correlator (disparity initialization and sub-pixel refinement) constructs the disparity map based on normalized cross correlation. DEMs are generated by a triangulation method in which corrected camera poses are obtained by bundle adjustment.
2.2
Stereo Correlation
Stereo correlation computes pixel correspondences of the image pair (Figure 3). The map of these correspondences is called a disparity map. The best match is determined by minimizing a cost function that compares the two windows in the image pair. The normalized cross correlation is robust to slight lighting and contrast variation in between a pair of images [11]. For large images, this is computationally very expensive,
712
T. Kim et al.
so the correlation process is split into two stages. (1) In the initialization step coarse correspondences are obtained by a multi-scale search that is highly optimized for speed (Figure 3c). (2) Correlation itself is carried out by sliding a small, square template window from the left image over the specified search region of the right image (Figure 3d-f).
(a) Left Image
(b) Right Image
(c) Initialized Disparity Map
(d) Parabola Refined Map
(e) Cauchy Refined Map
(f) Bayesian Refined Map
Fig. 3. Stereo Correlation. (a-b) An image pair. (c-f) Horizontal disparity maps. (c) The fast discrete correlator constructs the coarse disparity map. (d-f) Refined disparity maps from the initialized map. (f) Bayesian sub-pixel correlator generates a smoother map than the others.
Refining the initialized disparity map to sub-pixel accuracy is crucial and necessary for processing real-world data sets [13]. The Bayesian expectation maximization (EM) weighted affine adaptive window correlator was developed to produce high quality stereo matches that exhibit a high degree of immunity to image noise (Figure 3f). The Bayesian EM sub-pixel correlator also features a deformable template window that can be rotated, scaled, and translated as it zeros in on the correct match. This affine-adaptive window is essential for computing accurate matches on crater or canyon walls, and on other areas with significant perspective distortion due to foreshortening. A Bayesian model that treats the parameters as random variables was developed in an EM framework [21]. This statistical model includes a Gaussian mixture component to
Orthographic Stereo Correlator on the Terrain Model for Apollo Metric Images
(a) Left Image
(b) Right Image
713
(c) Rendered DEM
Fig. 4. Generation of DEM. (a) and (b) Apollo Metric Camera image pair of Apollo 15 site. (c) A DEM of Hadley Rille is rendered from an image pair.
model image noise that is the basis for the robustness of the algorithm. The resulting DEM is obtained by triangulation method (Figure 4).
3
Orthographic Stereo Correlation
To improve the DEMs produced by IRG, we proposed to define error measure in object domain rather than image domain for lunar orbital images. We propose to use a planar approximation of lunar terrain, which define the relationship of two views by a homography representation. To model lunar reflectance, we propose to use a linear reflectance approximation. The statistical behavior of the photons is model by the normal distribution to derive the cost function that compares the two view windows. The proposed method will replace the pair-wise sub-pixel refinement and triangulation currently used in ASP. A linear approximation of the lunar terrain and reflectance simplifies the correlation function. The lunar surface is smooth, but not flat (Figure 5a) and has its own reflectance (Figure 5c). The terrain is assumed to be locally planar because the correlation patches are determined to be small enough to cover the planar region (Figure 5b). Similarly, the lunar reflectance is assumed to be locally linear because geometric and photometric changes are small enough to form a linear relationship (Figure 5d). A planar approximation of the terrain provides a homography representation of two-view correspondences. Suppose we have a textured plane that approximates the local terrain (Figure 6). Let π be the surface plane with the normal vector n and distance d from the origin, i.e., π : nT x = d . Let I be the orbital image and f be the orthographic image of I onto π . The surface albedo having linear relationship with I , i.e., f (z ) = b + cI ( z ) for all z ∈ π . Suppose we have an image pair viewing the same scene on π with projection matrices P1 and P2 .
714
T. Kim et al.
P1
P2
P1
P2
ʌ (a) Smooth Terrain
(b) Planar Terrain
I
(c) Lunar reflectance
a b c
(d) Linear reflectance
Fig. 5. Linear Approximation of Lunar Terrain and Reflectance. (a) Smooth local terrain in a small field of view is approximated by (b) planar one. (c) The lunar reflectance model is approximated by a linear function. (d) The linear reflectance is a linear function of surface albedo (A).
H1
n
H2 f
I1 I2
Fig. 6. Homography Representation of Stere Correspondences. Homographies of the image planes induced by the local plane π, H i : f i a I i for i = 1, 2 , determine correspondences of the image pair.
To determine the geometric and reflectance parameters, we propose to minimize the squared error of all corresponding patches on the terrain. From the fact that the quantity of incident photons from the scene follows the normal distribution, the negative log-likelihood of scene radiance is derived:
L(b, c, d ; τ) = ∫ g (z − τ) { f1 (z) − f 2 (z)} dz . 2
(1)
where f i (z ) = bi + ci I i (z ) , b and c are coefficient vectors of linear reflectance, and
〈•〉 τ = g (τ)* • is the local Gaussian convolution operator. The optimizer of (1) may not exist in a closed form, but some parameters are simply optimized from the other parameters. To minimize the squared error, we propose to employ an alternating
Orthographic Stereo Correlator on the Terrain Model for Apollo Metric Images
715
optimization scheme. First, we solve for b and c because there is a closed-form solution given d in (1). Then we solve for d in (1), while keeping surface albedo constant with the solutions from (1).The initialized disparity maps from ASP can be used to calculate the initial value of d .
4
Experimental Results
The orthographic stereo correlator is implemented based on the NASA Vision Workbench (VW). The NASA VW is a general purpose image processing and computer vision library developed by the IRG at the NASA Ames Research Center. The linear reflectance coefficients are estimated in the nested optimization process. Since this is a single dimensional optimization with respect to the elevation, we adopted a hybrid golden section and parabolic interpolation method. Figure 7 shows an image pair from Apollo metric images. A Gaussian window with scaling factor 5 is used for correlation window by 33*33 pixels. Back-projected images are generated from an image pair with bicubic interpolation. The surface normal is determined by the radial direction of the local terrain because the lunar terrain is smooth enough. Figure 8a shows a stereo DEM constructed from integer disparity map. We can observe the pixel locking effect in the initial DEM with many holes. With this initial DEM, a refined DEM is reconstructed in Figure 8b. As you can see in the figure, the refined DEM gets rid of pixel locking artifact completely. There are small bumps and holes because the initial DEM has the large error and optimizer falls in local minima. Even though the surface normal is assumed roughly to have the radial direction of each post, the reconstructed terrain is reconstructed smooth enough and robust.
(a) AS15-M-1090
(b) AS15-M-1091
Fig. 7. A Stereo Image Pair. Two images from Apollo 15 Orbit 33 imagery have different illumination and noises such as dust and missing data.
716
T. Kim et al.
(a) Stereo Initial DEM
(a) Refined Dense DEM
Fig. 8. Dense DEM Reconstruction. Color-mapped, hill-shaded DEM was reconstructed from from an image pair in Figure 7.
5
Conclusion
An orthographic stereo correlation method was proposed to reconstruct the accurate and dense Digital Elevation Models (DEMs) from lunar orbital imagery. The proposed method addresses this problem by making use of linearity in geometry and photometry to improve both the accuracy and robustness of the stereo correlation process. Given camera parameters of an image pair and an initial DEM, the DEM is refined to acquire the sub-pixel accuracy. A correlation window on the terrain with the predefined surface normal of a post is used to define the squared error of back-projected images on the local terrain. The squared error is minimized with respect to the post elevation. This forms a single dimensional optimization by nesting the linear reflectance optimization. For the future work, the surface normal can be estimated together with the elevation.
Orthographic Stereo Correlator on the Terrain Model for Apollo Metric Images
717
Acknowledgement. This research was supported by an appointment to the NASA Postdoctoral Program at the Ames Research Center, administered by Oak Ridge Associated Universities through a contract with NASA.
References 1. Noble, S.K., et al.: The Lunar Mapping and Modeling Project. LPI Contributions 1515, 48 (2009) 2. Mordohai, P., Medioni, G.: Dense multiple view stereo with general camera placement using tensor voting. In: IEEE 3DPVT, pp. 725–732 (2004) 3. Thompson, C.M.: Robust photo-topography by fusing shape-from-shading and stereo, Massachusetts Institute of Technology (1992) 4. Price, K.: Annotated computer vision bibliography. Institute for Robotic and Intelligent System School of Engineering (IRIS), University of Southern California (1995) 5. Heipke, C., Piechullek, C.: Toward surface reconstruction using multi-image shape from shading. In: ISPRS, pp. 361–369 (1994) 6. Cryer, J.E., Tsai, P.S., Shah, M.: Integration of shape from shading and stereo. Pattern recognition 28(7), 1033–1043 (1995) 7. Hapke, B.: Bidirectional reflectance spectroscopy, 1, Theory. J. Geophys. Res. 86, 3039–3054 (1981) 8. Minnaert, M.: Photometry of the Moon. Planets and Satellites, 213 (1961) 9. Buratti, B., Veverka, J.: Voyager photometry of Rhea, Dione, Tethys, Enceladus and Mimas. Icarus 58(2), 254–264 (1984) 10. Broxton, M.J., et al.: The Ames Stereo Pipeline: NASA’s Open Source Automated Stereogrammetry Software, NASA Ames Research Center (2009) 11. Menard, C.: Robust Stereo and Adaptive Matching in Correlation Scale-Space, Institute of Automation, Vienna Institute of Technology (1997) 12. Sun, C.: Fast stereo matching using rectangular subregioning and 3D maximum-surface techniques. International Journal of Computer Vision 47(1), 99–117 (2002) 13. Nefian, A., et al.: A Bayesian Formulation for Subpixel Refinement in Stereo Orbital Imagery. In: International Conference on Image Processing, Cairo, Egypt (2009) 14. Bay, H., et al.: Speeded-up robust features (SURF). Computer Vision and Image Understanding 110(3), 346–359 (2008) 15. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. Assoc. Comp. Mach. 24(6), 381–395 (1981) 16. Triggs, B., et al.: Bundle adjustment - a modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) ICCV-WS 1999. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000) 17. Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge University Press, Cambridge (2003) 18. Kim, T., Moratto, Z., Nefian, A.V.: Robust Mosaicking of Stereo Digital Elevation Models from the Ames Stereo Pipeline 19. Pedersini, F., Pigazzini, P., Sarti, A., Tubaro, S.: 3D Area Matching with Arbitrary Multiview Geometry. In: EURASIP Signal Processing: Image Communication - Special Issue on 3D Video Technology, vol. 14(1-2), pp. 71–94. Elsevier, Amsterdam (October 1998) 20. Broxton, M.J., Nefian, A.V., Moratto, Z., Kim, T., Lundy, M., Segal, A.V.: 3D lunar terrain reconstruction from apollo images. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Kuno, Y., Wang, J., Wang, J.-X., Wang, J., Pajarola, R., Lindstrom, P., Hinkenjann, A., Encarnação, M.L., Silva, C.T., Coming, D. (eds.) ISVC 2009. LNCS, vol. 5875, pp. 710–719. Springer, Heidelberg (2009) 21. Nefian, A., Husmann, K., Broxton, M., To, V., Lundy, M., Hancher, M.: A Bayesian Formulation for Sub-pixel Refinement in Stereo Orbital Imagery. In: Proceedings of the 2009 IEEE ICIP (2009)
Collaborative Track Analysis, Data Cleansing, and Labeling George Kamberov1, Gerda Kamberova2, Matt Burlick1 , Lazaros Karydas1 , and Bart Luczynski1 1
2
Department of Computer Science, Stevens Institute of Technology Hoboken, NJ 07030, USA {gkambero,mburlick,lkarydas,bluczyns}@stevens.edu Department of Computer Science, Hofstra University, Hempstead, NY 11549, USA [email protected]
Abstract. Tracking output is a very attractive source of labeled data sets that, in turn, could be used to train other systems for tracking, detection, recognition and categorization. In this context, long tracking sequences are of particular importance because they provide richer information, multiple views, wider range of appearances. This paper addresses two obstacles to the use of tracking data for training: noise in the tracking data and the unreliability and slow pace of hand labeling. The paper introduces a criterion for detecting inconsistencies (noise) in large data collections and a method for selecting typical representatives of consistent collections. Those are used to build a pipeline which cleanses the tracking data and employs instantaneous (shotgun) labeling of vast numbers of images. The shotgun labeled data is shown to compare favorably with hand labeled data when used in classification tasks. The framework is collaborative – it involves a human-in-the loop but it is designed to minimize the burden on the human.
1 Introduction In this paper: (i) We propose a framework for detection and cleansing of tracking output noise due to data association errors and to spurious track births or premature expirations. The framework is based on clustering with respect to a space-time distance and the notions of initial and terminal appearances of a tracked object; (ii) We describe a procedure for labeling the tracking output with minimal human intervention. The procedure uses information-theoretic approach to detect inconsistencies in the tracker output and to extract an iconic representation for each consistent track; and (iii) We apply the framework and the labeling procedure in the design and construction of a pipeline for massive (shotgun) labeling of objects from video-based trackers. The pipeline has three principal stages. The first stage, described in Section 3.1, is designed to correct as much as possible association errors. We use a track segmentation method, based on a space (features)-time metric. The second stage, described in Section 3.2, is designed to address the tracker noise caused by the births of spurious targets produced when the trackers ”lose their targets” and generate new tracks midstream. This leads to explosion in the number of tracks and thus can have a very detrimental effect on the performance of human-labelers. In the final stage, we test the consistency of the consolidated tracks and then select automatically iconic representatives from the consistent G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 718–727, 2011. c Springer-Verlag Berlin Heidelberg 2011
Collaborative Track Analysis, Data Cleansing, and Labeling
719
Fig. 1. (Left) Shotgun Pipeline data flow: from raw tracks to consolidated consistent tracks – note that several frames were deleted because there was a miss-match in the final and initial appearances of the constituent tracklets. (Right) A diagram of the Shotgun Pipeline architecture.
tracks. The iconic representatives are labeled manually and their labels are propagated automatically throughout the consolidated tracks.See Figure 1 for a pipeline overview. We use tracking data from a three-hour surveillance video of vessels in a busy estuary, captured by a distant on-shore camera, to test the framework and the pipeline for data cleansing and analysis and target labeling in a cluttered environment The experimental results are reported in Section 4. 1.1 Related Work There has been a considerable interest in detecting and cleaning mislabeled data for training in the AI and data mining communities, [2], [1], [6], [8], and for face detection and recognition, [17]. These approaches boil down to taking an already labeled data set and applying cross validation of some categorization/recogntion method based on pre-selected set of classifiers or by using case based learning. Semi-supervised learning (SSL) can be used to minimize the tedious hand-labeling task [3,4,11,10,19] by starting with a small kernel of reliably hand labeled data and a large collection of unlabeled data. The initial labeled data is used to learn a classifier; the classifier is applied to label part of the unlabeled data; then a new classifier is learned from all the currently labeled data. The process is iterative. The success of an SSL approach depends very much on the selection of: initial training sets, features, and suitable similarity functions and kernels. In contract, we present a method to detect the data categories and to label large
720
G. Kamberov et al.
sets of data with minimal human intervention. To select the categories represented in the data and to assign labels, we develop an automatic method for extraction of a small number of most informative exemplars – called iconics. The exemplar-based analysis of data is a staple of active learning (AL). In a typical AL framework, the machine learning algorithm uses a small initial set of labeled data and a dialogue with an expert to select (and label) the data from which it learns and to set by hand appropriate tuning parameters [13,7,14,5,15,18,12]. Our method does not have tuning parameters and does not rely on an initial correctly labeled set.
2 Track Consistency Criterion and Iconics Selection 2.1 Background For noise detection, track segmentation and selection of iconic representatives, we exploit mutual and disjoint information together with a space-time metric built from the feature metric used by the tracker. The principal idea is to search for noise by measuring when two feature vectors in a sequence of tracking data are more different than alike. We use the term random variable to denote, both, scalar or vector-valued random variables. Since we work only with discrete variables and finite sample spaces, the sample space could be enumerated, the problem re-mapped, and a uni-variate notation could be used as appropriate. We follow standard conventions: we use P to denote a probability measure; random variables are denoted by capital letters and their values (samples), by lower case letters. For a discrete random variable Xwith a sample space X , the point mass function is denoted by pX , pX (x) = P [X = x], x ∈ X , and the entropy, H(X) is given by H(X) = − pX (x) log pX (x). x∈X Since we deal only with discrete random variables, without danger of misunderstanding, we use the term ”random variable” to mean a ”discrete random variable” and ”probability distribution” to refer to the point mass function. The joint probability distribution of two random variables, Xand Y , with sample spaces X and Y, respectively, is defined by, pX,Y (x, y) = P [X = x, Y = y], for x ∈ X , y ∈ Y. When there is no confusion, we will omit the subscripts, and thus use p(x, y) instead of pX,Y (x, y). Next, we summarize the relevant definitions from information theory. Let Xand Y be two random variables with finite sample spaces X and Y, respectively, and m = |X | and n = |Y|. We enumerate the elements of the sample spaces, X = {x1 , . . . , xm }, Y = {y1 , . . . , yn }. The joint Shannon entropy, H(X, Y ), the mutual information, I(X, Y ), and the disjoint information, D(X, Y ), are defined as follows H(X, Y ) = −
m n i=1 j=1
p(xi , yj ) log p(xi , yj )
(1)
Collaborative Track Analysis, Data Cleansing, and Labeling
I(X, Y ) = H(X) + H(Y ) − H(X, Y ), D(X, Y ) = H(X, Y ) − I(X, Y ).
721
(2) (3)
2.2 Detection of Inconsistent Instances in Data Collections Given a feature vector x = (x1 , x2 , . . . , xn ), for example an image or a histogram, it is interpreted as a sample of size n of a random variable X, and one can infer the probability distribution pX from the sample. In the case of two feature vectors x and y, by inferring the probability distributions of the corresponding random variables X and Y , one can answer the question whether the collections x and y are more different than they are alike by comparing the disjoint and mutual information of X and Y . When there is no danger of confusion, we say, for short, ”the disjoint and mutual information of the feature vectors” and write D(x, y) and I(x, y), instead of ”the mutual and disjoint information, D(X, Y ) and I(X, Y ), of the corresponding random variables X and Y .” More generally, we say that the collection S of feature vectors is possibly inconsistent, if there exist feature vectors x and y in S for which the disjoint information is sufficiently higher than the mutual, i.e., D(x, y) − I(x, y) > α, for some threshold α ≥ 0. To detect possible class noise (misclassifications) in the collection S, one can attempt to search for appropriate threshold levels. Instead, we exploit a metric d over all possible feature vectors, d : X → [0, ∞) and consider the collection inconsistent if a most distant pair of features with respect to the metric is more different than alike in terms of the disjoint/mutual information. Definition 1: We will say that the collection of feature vectors, S, is potentially inconsistent (contains class noise or misclassifications), if a most distant pair of feature and y defined by vectors x ) = arg max d(x, y) ( x, y (x,y )∈S×S
(4)
ˆ ) > I(ˆ ˆ ). D(ˆ x, y x, y
(5)
satisfy In particular, a track τ = {xt1 , . . . , xtn } is a time stamped collection (a stream) of feature vectors endowed with a natural metric d(xti , xtj ) = gf (xti , xtj ) + c2 |ti − tj |2 . Here gf is a metric in feature space and 1 gf (xti , xtj ) c = max ti
(6)
722
G. Kamberov et al.
Fig. 2. The evolution of the appearance of an object in a track has three stages: initial, B(τ ), mature, M (τ ), and terminal, T (τ ). Furthermore, the initial and terminal stages are partitioned into into unstable and stable clusters of feature vectors (appearances), respectively.
We introduce a criterion for track consistency. The criterion may require a humanin-the loop to inspect no more than two frames in the track. Let xt and xt be as far as possible in the track τ , i.e., (xt , xt ) = arg max d(x, y) (x,y )∈τ ×τ
(7)
A track τ is automatically consistent, if I(xt , xt ) > D(xt , xt ), where (xt , xt ), satisfies (7). If a track is not automatically consistent, then the final decision whether it is consistent or not is made by a human-in-the-loop. The human inspects only the two images (track frames) corresponding to xt and xt . Thus inconsistent tracks are diagnosed by a human-in-the loop (or another external decision maker) but this intervention is kept to a minimum. With this criterion at hand, we cleanse tracking data by simply discarding inconsistent tracks. 2.3 Iconic Images and Track Consistency To label in one shot all the images corresponding to a consistent track , we extract automatically a single iconic image (IIM) from each consistent track. The IIM has to be informative; and not overwhelming for the human. To assure that the representative of a noise free collection is optimal in information theoretic sense, we propose to use as an iconic feature vector (IFV) a feature vector that minimizes the average pairwise disjoint information over all other feature vectors in the in the collection. ˜ ∈ S is an iconic feature vector of the collection S , if Definition 2: A feature vector x 1 ˜ = arg min D(x, y). x |S| x y ∈S
(8)
2.4 Initial and Terminal Appearance of a Track We model the evolution of the appearance of an object in a track τ = {xt1 , . . . , xtn } as a three-stage process: initial stage B(τ ) = {xt1 , . . . , xtφ }, a mature(stable)stage
Collaborative Track Analysis, Data Cleansing, and Labeling
723
Fig. 3. (Left)A sample frame from the surveillance video with two tracked objects. (Right) The tracked objects’ background subtraction-based representations; the histograms of these representations are the feature vectors of the objects in this frame.
M (τ ) = {xt(φ+1) , . . . , xt(n−φ−1) }, and a terminal stage T (τ ) = {xt(n−φ) , . . . , xtn }. See Figure 2. We assume that the initial stage and the terminal stage have at most φ frames. For tracks obtained from a background subtraction tracker, which uses running averages to estimate the background layer, φ equals the size of the history window. When the size of the history window is not known or is not informative, for example difference-based trackers which use single frame differences we set φ to a third of the length of the track. The concepts of stable and unstable initial and terminal clusters are introduced to delineate the relatively unstable appearance of an object soon after its birth and just before its expiration. To determine the initial and the terminal stable and unstable appearance clusters, we use K-medoids, K = 2, to cluster the object feature vectors in the track. The metric used in this clustering is the feature space metric gf (·, ·), from equation (6). By definition, the stable appearance cluster in the initial (terminal) stage is the one for which most of the objects are closer (in time-order) to the interior, mature stage M (τ ) of the track. Formally, let B1 and B2 be are the two clusters of B(τ ). The relative ordering of B1 and B2 is with respect to to their median time stamps, i.e., med({t|xt ∈ B1 }) ≤ med({t | xt ∈ B2 }).
(9)
And similarly, let T1 and T2 be the two clusters of terminal features then T1 and T2 are reverse-ordered with respect to their members median time stamps. Thus, med({t|xt ∈ T2 }) ≤ med({t|xt ∈ T1 }).
(10)
Then, B2 and T2 represent the stable appearance initial and terminal clusters of the track. Definition 3: Let μ(B) and μ(T ) represent the medoid feature centers for the clusters B2 and T2 , respectively, the initial and the final appearance feature vectors S(τ ) and E(τ ) are the stable feature vectors closest to these medoids. Namely, S(τ ) = argmin (gf (xt , μ(B))) xt ∈B2
(11)
E(τ ) = argmin (gf (xt , μ(T ))) xt ∈T2
(12)
724
G. Kamberov et al.
Table 1. Binary Categorization (recognition success rates out of 10 runs) using hand-labeled training data HL: 285 hand-labeled images for training (95 per each of the three categories), and then Shotgun-labeled training data SL: 4680 Shotgun labeled features (1560 per category) by examining only 216 images. For the definition of success rate SR see equation (13). Vessel
Mean of True Positives HL SL Speedboat 80.87% 85.25% Ferry 92.25% 96.50% Cruise 84.00% 84.87
Std of True Positives HL SL 11.00% 4.00% 2.00% 2.00% 6.00% 3.00%
Success Rate SR HL SL 78.75% 81.25% 91.25% 96.25% 73.75% 81.25%
3 Shotgun Pipeline We are now ready to build a pipeline that will be used to label in one sweep the images in multiple tracks. The input to the pipeline, Figure 1, are a collection of tracks {τi }κi=1 extracted from a surveillance video over some period of time. The pipeline stages are: track segmentation/deconstruction; consolidation; consistency check and iconic extraction and a single shotgun labeling of all the consistent tracks. 3.1 Track Segmentation We segment each track into tracklets using unsupervised agglomerative clustering and the metric d defined in (6). If the task was to really understand the possible optimal segmentation of the track into consistent tracklets, then one needs to develop an adequate stopping criterion. This is not our goal. Here we are only concerned with creating few relatively long tracks that are consistent. We use gap statistic driven clustering [16] to obtain an initial segmentation of each track into tracklets each of which is more consistent than the original track. At the end of the segmentation stage we obtain, completely automatically, all the resulting tracklets {τi1 , . . . , τni }κi=1 . We thereat each one of them as a track and extract their initial and terminal feature vectors {S(τi1 ), . . . , S(τni )}κi=1 and {T (τi1 ), . . . , T (τni )}κi=1 . This is a crude segmentation, it could result into oversegmentation, and by the very nature of the gap statistic clustering the results could be slightly different for different segmentation runs. Over-segmentation can lead to inefficient labeling. Thus the next stage of the pipeline is devoted to consolidation of the possibly over-segmented tracks. 3.2 Consolidating Trajectories The input in this stage is the collection of all tracklets. To consolidate together tracklets, we follow the network model and the linear-programming optimization introduced in [9]. The occlusion nodes in the network correspond to the terminal and initial nodes of the different tracklets. The appearance similarity component in the cost function is computed using the feature space metric gf (·, ·) and in particular, if the terminating tracklet τia is in occlusion with the new-born tracklet τjb the appearance component of the metric is gf (T (τia ), S(τjb )). Here T (τia ) and S(τjb ) are the corresponding terminal and initial appearance vectors. Furthermore even after possible consolidation we
Collaborative Track Analysis, Data Cleansing, and Labeling
725
Table 2. Multi-category confusion matrices using: (Left) Training with hand-labeled data (HL) and (Right) Training with shotgun-labeled data (SL) Training on the HL data. Training on the SL data. Vessel Speedboat Ferry Cruise Speedboat Ferry Cruise Speedboat 34 3 4 35 0 6 Ferry 2 38 1 2 39 0 Cruise 14 1 26 10 0 31
discard the unstable terminal features from the terminating tracklet and the unstable initial segments from the new tracklet.
4 Applications and Experiments In this section we present the results obtained by applying our framework to: detect errors in tracker output, cleanse tracker output, and label cleansed tracking output so that it can be used for training categorization engines. The experiments were performed on a three hours long surveillance video of shipping traffic in an estuary. Due to the width of the river and the shipping lanes patterns, the camera is far away from the targets (the distances target-to-camera are in the range 380 meters to 1100 meters). We used a background-subtraction based tracker to process the video. Due to the low resolution and the dynamic appearance of the river surface, feature vectors using SIFT or HOGs descriptors are not applicable to this tracking scenario, since many of the tracked objects are no more than four pixels wide. Instead, we opt to use simpler features in the form of histograms. (See Figure 3). Tracker Errors Detection. The tracker detected ninety one tracks (# of feature vectors 24,738). These tracks were tested for consistency using the Consistency Stage of the Shotgun Pipeline. Thirty one tracks (accounting for over 12,364 feature vectors) were found inconsistent (objects of clearly different categories were mixed in the same track). The whole consistency test of the original tracks required that a human-in-the-loop looked a 96 images (two images per each not automatically consistent track, such tracks are automatically flagged). Hand Labeling Data vs Shotgun. A team of five students (two PhD students in computer vision and three undergraduate students with no prior experience) hand-labeled, independently, the tracking results. All reported difficulties assigning proper labels due to the low image resolution and clutter. Only a very small number of the targets and the feature vectors were hand-labeled unambiguously and correctly. The whole set was processed by the Shotgun Pipeline. After de-construction and consolidation (fully automatic), a human-in-the loop was asked to examine 105 consolidated tracks. Ultimately 44 tracks were found inconsistent and 94 were found to be consistent. This allowed us to label 10, 900 feature vectors at the total cost of examining 304 images. Hand Labeling Data vs Shotgun in Recognition and Categorization. To test the performance of Shotgun labeled data in recognition and categorization tasks, we created
726
G. Kamberov et al.
Table 3. Statistics of the multi-class classification rates over all categories for the hand-labeled and shotgun-labeled experiments. By definition, the classification rate is the percent of correctly labeled data. Training Data Set Mean Std Hand-labeled (HL) 85.00% 3.40% Shotgun-labeled (SL) 86.30% 2.60%
Median 85.40% 86.20%
the following data sets of feature vectors of objects in the following categories: speed boat, ferry, and cruise ship: (i) HL: 285 hand-labeled feature vectors for training (95 per each of the three categories); (ii) Test: A testing set: 123 hand-labeled feature vectors (41 per each of the three categories); (iii) SL: 4680 Shotgun labeled features (1560 per category) by examining a total of 216 images. The HL and SL training sets were used to train three neural networks to perform binary categorization tasks (a Speed boat recognition NN, a Ferry recognition NN, and a Cruise ship recognition NN ). The results using the hand labeled set HL and the Shotgun -labeled training set SL are shown in Table 1. We define the success rate, SR, as: SR =
True Positives + True Negatives . True Positives + True Negatives + False Positives + False Negatives
(13)
To test the performance in a multi category scenario we trained another NN. The results are shown in Tables 2 and 3. The results indicate that we can replace massive hand-labeling with the less demanding Shotgun labeling while improving the quality of categorization. Our future work centers on the improvement of the selection of iconic images and on developing methods to exploit inconsistent video streams data as negative response training data.
References 1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. In: Machine Learning, pp. 37–66 (1991) 2. Brodley, C.E., Friedl, M.A.: Identifying mislabeled training data. Jour. Artificial Intelligence Res. 11, 131–167 (1999) 3. Chapelle, O., Zien, A., Sch¨olkopf, B. (eds.): Semi-Supervised Learning. MIT Press, Cambridge (2006) 4. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised on-line boosting for robust tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008) 5. Hauptmann, E.G., Hao Lin, W., Yan, R., Yang, J., Yu Chen, M.: Extreme video retrieval: joint maximization of human and computer performance. In: ACM Multimedia, pp. 385– 394. ACM Press, New York (2006) 6. He, D., Zhu, X., Wu, X.: Error detection and uncertainty modeling for imprecise data. In: Proc. 21st International Conference on Tools with Artificial Intelligence, pp. 792–795 (2009) 7. Hoogs, A., Rittscher, J., Stein, G., Schmiederer, J.: Video content annotation using visual analysis and a large semantic knowledgebase. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. II: 327–334 (2003)
Collaborative Track Analysis, Data Cleansing, and Labeling
727
8. Jacob, M., Kuscher, A., Plauth, M., Thiele, C.: Automated data augmentation services using text mining, data cleansing and web crawling techniques. In: Proc. IEEE Congress on Services - Part I, pp. 136–143 (2008) 9. Jiang, H., Fels, S., Little, J.: A linear programming approach for multiple object tracking. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 10. Lefort, R., Fablet, R., Boucher, J.-M.: Weakly supervised classification of objects in images using soft random forests. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 185–198. Springer, Heidelberg (2010) 11. Leistner, C., Grabner, H., Bischof, H.: Semi-supervised boosting using visual similarity learning. In: Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2008) 12. Lin, H., Bilmes, J.: How to select a good training-data subset for transcription: Submodular active selection for sequences. In: Proc. 10th Annual Conference of the International Speech Communication Association (2009) 13. Muslea, I.M., Minton, S., Knoblock, C.A.: Active + semi-supervised learning = robust multiview learning. In: Proc. 19th International Conference on Machine Learning, pp. 435–442 (2002) 14. Nguyen, H.T., Smeulders, A.: Active learning using pre-clustering. In: Proc. the 21st International Conference on Machine Learning, pp. 623–630. ACM Press, New York (2004) 15. Qi, G.-J., Hua, X.-S., Rui, Y., Tang, J., Zhang, H.-J.: Two-dimensional active learning for image classification. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 16. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2), 411–423 (2001) 17. Venkataraman, S., Metaxas, D., Fradkin, D., Kulikowski, C., Muchnik, I.: Distinguishing mislabeled data from correctly labeled data in classifier design. In: Proc. 16th IEEE Int. Conf. on Tools With Artificial Intelligence, pp. 668–672 (2004) 18. Vijayanarasimhan, S., Grauman, K.: What’s it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2262–2269 (2009) 19. Yan, R., Naphade, M.: Semi-supervised cross feature learning for semantic concept detection in video. In: Proc. IEEE Computer Vision and Pattern Recognition, pp. 657–663 (2005)
Time to Collision and Collision Risk Estimation from Local Scale and Motion Shrinivas Pundlik, Eli Peli, and Gang Luo Schepens Eye Research Institute, Harvard Medical School, Boston, MA {shrinivas.pundlik,eli.peli,gang.luo}@schepens.harvard.edu
Abstract. Computer-vision based collision risk assessment is important in collision detection and obstacle avoidance tasks. We present an approach to determine both time to collision (TTC) and collision risk for semi-rigid obstacles from videos obtained with an uncalibrated camera. TTC for a body moving relative to the camera can be calculated using the ratio of its image size and its time derivative. In order to compute this ratio, we utilize the local scale change and motion information obtained from detection and tracking of feature points, wherein lies the chief novelty of our approach. Using the same local scale change and motion information, we also propose a measure of collision risk for obstacles moving along different trajectories relative to the camera optical axis. Using videos of pedestrians captured in a controlled experimental setup, in which ground truth can be established, we demonstrate the accuracy of our TTC and collision risk estimation approach for different walking trajectories.
1 Introduction Time to collision or time to contact (TTC) is a quantity of interest to many fields, ranging from experimental psychology to robotics. Generally, TTC for two bodies in space is the ratio of the distance between them and their relative speed. In the context of video cameras, TTC can be defined as the time required for an object in the real world to reach the camera plane, assuming that the relative speed remains fixed during that time period. While TTC is the ratio of distance and speed, using a pinhole camera model, it becomes equivalent to the computation of the ratio of an object’s size on an imaging plane to its time derivative. It has been suggested that analogous processing takes place in the human visual system while performing tasks involving TTC computation, such as avoiding collisions or catching a moving object [1-3]. One of the chief advantages of formulating TTC in terms of object dilation over time is that TTC can thus be obtained entirely from image based data, without having to actually measure physical quantities such as distance and velocity. Consequently, the need for complicated and computationally expensive camera calibration processes, 3D reconstruction of the scene, and camera ego-motion estimation is eliminated. Due to its relative simplicity and computational efficiency, the idea of TTC estimation is ideally suited for real-time systems, where quick decisions have to be made in the face of impending collisions. For this reason, computer-vision based TTC estimation approaches can be useful for obstacle avoidance and collision detection by vehicles or individuals. G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 728–737, 2011. © Springer-Verlag Berlin Heidelberg 2011
Time to Collision and Collision Risk Estimation from Local Scale and Motion
729
The ratio of the object size in the image and its rate of expansion has been previously used for estimation of TTC, for example, computing scale changes over a closed contour using image area moments [4], motion field [5], or affine shape parameters [6]. Accurate initialization is a big challenge in using contours for determining the interest region in the image. This points toward a more general problem of accurately determining object size in the image in order to perform TTC estimation. Image segmentation and object recognition algorithms are complex and thus computationally expensive, and erroneous segmentation can lead to highly inaccurate TTC estimates. To overcome the difficulty of object size determination, TTC estimation could be reformulated in terms of motion field and its derivatives [7, 8], image gradients [9, 10], residual motion from planar parallax [11], scaled depth [12], scale invariant feature matching [13], or solving parametric equations of object motion [14]. A number of previous approaches assume that obstacles are planar rigid bodies in motion relative to the camera along its optical axis. Some approaches, such as [9], are more appropriate when an entire plane moves with respect to the camera and produce inaccurate TTC estimations when a smaller rigid body in front of a static background approaches the camera (a more recent version of this work [10] includes an object segmentation and multi-scale fusion steps which improve TTC estimation results, but still assumes the objects are rigid bodies). Such assumptions fail in situations where semi-rigid obstacles such as pedestrians are involved. Another challenge facing a typical TTC estimation approach is with the case of object motion that is at an angle with the camera axis and not directly towards it. Among the approaches mentioned in this section, very few have dealt with semi-rigid obstacles such as pedestrians, and those that do, show results using only virtual reality scenes [13]. In addition to estimating the TTC accurately for variety of motion trajectories, in applications like collision detection devices, it is also important to determine whether the obstacle moving along a trajectory would indeed collide with the camera platform. This leads to the concept of a collision envelope or a safety zone around the camera, and any object trajectory with a potential to penetrate this zone would then be considered risky. In this paper, we present an approach for TTC and collision risk estimation in the case of semi-rigidly moving obstacles using feature points. The novelty of our approach is that the computation of TTC is based on aggregation of local scale change and motion information to obtain a global value for an object. In addition to TTC, the approach can also predict the collision risk for a given object trajectory relative to the camera. This collision risk is the probability of collision, and can be tailored to different collision warning scenarios by setting an acceptable threshold for different applications. We demonstrate the effectiveness of our approach by estimating TTC and collision risk using videos of pedestrians walking along different trajectories captured from an uncalibrated camera. Processing in our approach proceeds in the following manner. Detection and tracking of feature points is performed on the input image sequence. This provides us with the sparse motion information present in the scene. Scale change computation is performed in the neighborhood of each point feature, and a set of feature points where there is an increase in the local scale between two frames of a sequence is obtained. For this set of feature points, an affine motion model is fitted, leading to a group of features associated with a potential obstacle. The use of feature points and affine motion model provide flexibility to represent a semi-rigidly moving obstacle. From the
730
S. Pundlik, E. Peli, and G. Luo
features associated with the obstacle, TTC and collision risk is estimated. The following sections describe the details of our approach and the experimental results.
2 TTC Estimation Using Feature Points Feature point tracking forms the basis of our approach [15]. Feature points are effective in quantifying the motion in a local image patch. Unlike dense motion field computation, or Scale Invariant Features (SIFT) [16], feature points can be detected and tracked in an efficient manner. Another advantage of using feature points over a dense motion field is that textureless regions, where motion estimates tend to be erroneous, can be avoided. In real-world situations, there is usually enough texture to obtain a sufficient number of reliable feature point trajectories, thus providing some tolerance from the loss of some feature points due to scene changes over time. For a given image sequence (320x240 pixels), we detect and track features over a block of b frames. The detected features are based on the algorithm described in [15] and are known to be more suitable for tracking over an image sequence as compared to Harris corners. We use the pyramidal implementation of the Lucas-Kanade algorithm for feature tracking [17]. The pyramid levels are set at 3. Only those feature points with quality above a threshold value are selected for tracking. The threshold used in this work is set as 5% of the highest quality feature point selected in a given image. The size of the feature window is set at 5x5. Let pi( j ) be the ith feature point in the jth frame and n be the total number of feature points tracked through b frames. Once the features are detected and tracked, we compute feature point neighborhood using Delaunay triangulation. For every feature point, we now obtain a set of immediate spatial neighbors in the image that share an edge of the triangulation with that feature. Let this set of feature points in the local Delaunay neighborhood for p i( j ) be denoted by D ( pi( j ) ) . Once the feature points are tracked and a local neighborhood is established, the next step is to compute local scale change for each feature point. As the object approaches the camera, it gradually increases in size. Consider a rigid body moving along the camera’s optical axis. As it approaches the camera, the distance between neighboring feature points increases. This is the low-level manifestation of change in scale. Even if the motion does not take place strictly along the optical axis, the amount by which the feature points dilate could be significant (though not quite as much as in the previous case). We use this observation to compute the local scale change between two frames of a sequence, and it is given by
s
( j) i
=
∑ (
p k( j ) ∈D p i( j )
)
(p
( j) i
∑ (
− p k( j ) − pi( j − b ) − p k( j −b )
p k( j ) ∈D p i( j )
)
p
( j −b ) i
−p
( j −b ) k
)
,
(1)
where si( j ) is the normalized local scale change value for the ith feature point in the jth frame. Hence, for each point we obtain a positive scale change value when the neighborhood of the point feature expands and a negative value when it shrinks. We threshold
Time to Collision and Collision Risk Estimation from Local Scale and Motion
731
the local scale values to obtain a set of feature points that show a significant degree of increase in scale. Let this threshold be denoted by λs, such that we are interested in feature points for which si( j ) > λs (λs = 0.1max( si( j ) ) was used for this work).
Fig. 1. Frames 50, 95, and 120 of a sequence in which a person walks approximately along the optical axis. Based on affine motion and local scale change, the point features are grouped as those belonging to the moving person (white diamonds) and the background (black asterisks).
Once we obtain a set of feature points undergoing local scale increase, we process the motion information associated with these feature points. First, a Random Sample Consensus (RANSAC) algorithm is applied to this set of feature points to perform grouping based on affine motion. For a feature group, this also serves as an outlier rejection step. The next step is to include ungrouped feature points (excluded in the previous scale based thresholding step) into the newly-formed feature groups based on their motion compatibility. This step results in a stable set of feature points, denoted by F (j), that are associated with a moving body. Fig. 1 shows three frames of our test sequence in which a pedestrian approaches the camera. The feature points F(j), associated with the walking person and the background are overlaid. Although the above approach is designed to handle multiple groups of feature points, in this paper we focus on a single dominant group that belongs to the pedestrian’s body for TTC computation. The TTC value for each point feature neighborhood associated with the ( j)
moving body is now given by ti
= 1 / si( j ) . For a feature group, the TTC value t (j)
(for the jth frame), is the median of the TTC values obtained from all the feature points belonging to the group. The TTC value obtained here is in terms of frames remaining before collision. It could be converted to seconds using the video frame rate. For this work, we capture test videos at 30 frames per second. One way of evaluating the collision risk for an obstacle would be to assign a threshold value for TTC such that estimates below this value represent a high collision risk. But this does not take into consideration the point of intersection of the obstacle trajectory with the camera plane, and it could be far from the camera center to be considered a collision risk. In order to evaluate the risk posed by collisions from different obstacle trajectories, we propose an additional measure, which we call the collision risk factor. This factor is based on the ratio of the local scale change and local motion between two frames. For an obstacle moving along the camera optical axis this ratio would be high, while for motion along trajectories that are at increasing angles with the camera optical axis this ratio would be smaller. For the ith point fea( j)
ture, the ratio of local scale change and motion is given by d i
= si( j ) / vi , where
732
S. Pundlik, E. Peli, and G. Luo
vi = pi( j ) − pi( j −b ) is the magnitude of the motion vector of the ith feature point between the jth and (j-b)th frame. The collision risk associated with an object is the local scale change to motion ratio for the corresponding feature group. For an obstacle in the jth frame of an image sequence, the collision risk factor is given by
c( j) =
1 d i( j ) , ∑ ( j ) m ∀pi ∈F ( j )
(2)
where m is the number of features belonging to the moving object. The collision risk factor in Eq. (2) represents the idea that as an object approaches the camera along (or close to) the optical axis, large local scale change values lead to larger values of the ratio d as compared to other obstacle trajectories (where the magnitude of lateral motion is typically larger than the local scale change). The value of c(j) in such situations increases while the TTC value decreases. Hence, both collision risk factor and TTC when combined present a robust measure of collision risk. It should be noted that collision risk is a purely image-based quantity (no physical units), and along with TTC it can be used for issuing collision warnings by setting an appropriate threshold.
3 Experimental Results We present experimental results of testing our approach using videos of 2 pedestrians walking along different predefined trajectories, acting as potential obstacles for which TTC and collision risk are estimated. The goal of such an experimental setup is to simulate real world conditions as closely as possible without resorting to the use of synthetic sequences, while obtaining the ground truth for quantitative comparison. 3.1 Experimental Setup Fig. 2-(a) shows the detailed schematic of the experimental setup. It consists of two cameras capturing videos in a large room, approximately 20x80 feet. Camera 1 is set up at location C1 along baseline1 to capture the videos to be processed by our TTC estimation algorithm. Another baseline (baseline2) is established 204 inches (17 feet) away from baseline1. A person walks along the 11 trajectories defined by lines R5-L5 to L5-R5, passing through a center point C, which is about 8.5 feet away from Camera 1. On each side of the optical axis of Camera 1, the five trajectories make increasing angles of 10, 20, 30, 37.5, and 45 degrees with the center line C-C1 (see Fig. 2-(a)). While capturing the videos, the trajectory lines were not explicitly drawn on the floor. Only the points marked on the two baselines and the center point C were placed on the ground and these markers were used for guidance by the pedestrians. In order to obtain the ground truth world positions of the pedestrians with respect to Camera 1, we captured profile views simultaneously from Camera 2 (both the cameras are synchronized). The perpendicular distance between the line C-C1 and Camera 2 was about 58 feet. A larger distance minimizes the effect of depth for different trajectories and ensures a sufficiently large camera FOV to cover the entire sequence of walks. All the physical distances in this setup were obtained from a standard measuring tape.
Time to Collision and Collision Risk Estimation from Local Scale and Motion
733
Fig. 2. (a): A schematic (top view) of the experimental setup. (b): For a pedestrian on the C-C1 trajectory, plots of pedestrian position as seen from Camera 2, world distance from Camera 1, instantaneous velocity (computed over 10 frames), and the derived ground truth TTC values.
In order to obtain a value of image distance for the corresponding world distance, we measured known distances on the centerline C-C1. We also determined the FOV and the angle per pixel for Camera 2. Based on the known and computed quantities from the experimental setup, and the a priori knowledge of the trajectory being currently traced, we derived a model to estimate the position of the pedestrian in the real world with respect to Camera 1 using Camera 2 image positions. The mathematical details of this procedure are left out of this paper due to space constraints. We tested our ground truth measurement procedure by estimating the world locations (with respect to Camera 1) of 10 known world points in the experimental setup. We obtained a maximum error of about 5 inches and an average error of about 2.6 inches. The estimation error progressively increased for points with increasing distance from the centerline and baseline1. For the locations where ground truth measurement would significantly affect the accuracy of the validation process (such as points near baseline1), the mean error in ground truth measurement was 1.5 inches. We manually labeled the location of the pedestrian in images obtained from Camera 2 by marking 6 points along the upper body profile and computing the mean x coordinate value. The ground truth computation procedure outputs the corresponding world location and instantaneous velocity (obtained by differentiating the distance over 10 frames or 1/3 of a second). Once the distance from baseline1 and the instantaneous velocity are known, ground truth TTC values for each frame of the sequence can be computed. Fig. 2-(b) shows the plots of the intermediate quantities involved in computing the ground truth TTC values. The figure shows that the image and world positions for C-C1 trajectory are linear for the most part and the velocity recorded is approximately constant (40 inches/sec.). The small variations in the velocity are due to variable walking speed, error in labeling procedure, and due to error in the ground truth computation procedure. The velocity (along the walking direction) decreases sharply in the later part of the trajectory because the pedestrian slows down as he approaches the camera, while the TTC values (time to baseline 1) increase for this phase of the walk.
734
S. Pundlik, E. Peli, and G. Luo
3.2 TTC Estimation Results TTC estimation results of our algorithm along with the corresponding ground truth values for three trajectories (out of possible 22 for both the pedestrians) are shown in Fig. 3-(a),(b), and (c). Each plot is also superimposed with some frames of the corresponding sequence to show the change in the appearance of the pedestrian over time. The plot in Fig. 3-(a) shows the case where the person approaches the camera head-on along the C-C1 trajectory. Fig. 3-(b) and (c) show the results when the pedestrian walks with an angle of approximately 10 and 30 degrees with the optical axis, respectively. The estimated TTC values follow the same trend as the ground truth values. More importantly, at lower TTC ranges (as the relative distance between the pedestrian and the camera decreases), the estimates follow the ground truth more closely. This is a desired property because the estimates need to be more accurate when the obstacle is perceived to be close to the camera. Our algorithm can also handle variable relative velocity. At the very end of the C-C1 trajectory, the person slows down before coming to a halt. The estimated TTC values start increasing corresponding to this change. The plot in Fig. 3-(d) shows the mean collision risk per feature point (belonging to the pedestrian) measured over the last 1 second of the run (i.e., 1 sec before the pedestrian stops just a few inches away from the camera) for all the trajectories for both pedestrians. The curves show increased risk for the C-C1 trajectory, as desired.
(d)
Fig. 3. Plots of TTC estimates (smoothed with a temporal window of 3 frames), the ground truth values and three representative frames demonstrating pedestrian position for trajectories C-C1 for pedestrian 1 (a), L1-R1 (b) and R3-L3 (c) for pedestrian 2. A plot of collision risk associated with the obstacle for the last 1 second for each trajectory is shown in (d).
Time to Collision and Collision Risk Estimation from Local Scale and Motion
735
Also, the values decrease progressively on either side of the central line as the trajectory angles increase, indicating lower risk of collision when an obstacle moves along these trajectories. Since the collision risk measured in Eq. (2) is normalized by the number of feature points associated with the object, the plot also shows that the collision risk is not directly dependent on the size of the obstacle in the image (area covered in the image as the pedestrian come close to the camera). Also, Fig. 3 shows that even though the TTC for different trajectories converge at relatively close values at the end of the run, the corresponding collision risk values are significantly different.
Fig. 4. (a): Plot of TTC estimation error (mean and std. error) per trajectory. (b): Plot of TTC estimation error (mean and std. error) for all trajectories for different TTC ranges.
Fig. 4 shows the performance of the proposed approach by providing some quantitative analysis regarding the error in TTC estimation. Fig. 4-(a) shows the mean estimation error and the standard error for each trajectory. It can be seen that for the trajectories closer to the optical axis of Camera 1, the TTC is overestimated (mean error is about 0.5 seconds). It should be noted that we left out TTC values that were more than 3.5s while generating Fig. 4, as these translate to a distance between the camera and the pedestrian larger than about 13 feet. The estimates at locations farther away from the camera tend to be less reliable since the apparent scale change is smaller compared to those obtained at locations closer to the camera. To get a better idea about the performance of our algorithm, Fig. 4-(b) shows the mean estimation error and standard error for all the trajectories for different TTC ranges. It can be seen that the mean and standard error for the estimations are lower for lower TTC ranges for both the pedestrians and they become progressively larger for higher TTC ranges (i.e., larger distance from the camera). This plot quantifies our earlier claim that estimates of our approach are more accurate as the obstacle nears camera. For comparison, we also implemented the direct method presented in [9] for the case of motion along the optical axis of a plane perpendicular to the optical axis (with 8x8 block averaging). For our experimental setup, only the C-C1 trajectory loosely fits both the criteria. Combining the error in the TTC estimates from both pedestrians, the mean and standard error of 4.9s and 1.3s was obtained, as compared with our values of 0.3s and 0.4s respectively (seen in Fig. 4-a).
736
S. Pundlik, E. Peli, and G. Luo
4 Conclusions We have presented an approach for TTC and collision risk estimation from local scale change and motion information in the scene using feature tracking. The proposed approach can accurately estimate TTC and collision risk for semi-rigid obstacles moving along different trajectories relative to the camera’s optical axis with varying speeds, especially when they are close to the camera. The collision risk factor, which is insensitive to the obstacle’s size on the image, is highest for the walking trajectory along the camera optical axis and it progressively reduces for the others. Even though the use of feature points affords us a lot of flexibility in terms of handling semi-rigid motion along arbitrary trajectories, there are certain limitations of the current approach. Brightness constancy assumption is implicit in feature tracking. Also, the approach is not robust in tackling scenarios where the pedestrians suddenly appear in the scene very close to the camera. Current experimental setup did not allow us to test a variety of different potential obstacles or moving camera scenarios (along with inplane and out-of-plane rotations) since our goal was to thoroughly evaluate the TTC and collision risk estimation using controlled experimentation. In spite of this, the results presented here show that our approach has the potential to be effective in complex real-world scenarios. Future work includes extension of the current algorithm to handle more complex real world scenarios involving multiple obstacles, moving camera, and variable lighting conditions. Acknowledgements. This work was supported in part by DoD grant W81XWH-10-10980, DM090201 and by NIH grant R01 EY12890.
References 1. Lee, D.N.: A theory of the visual control of braking based on information about time-tocollision Perception 5, 437–459 (1976) 2. Tresilian, J.R.: Visually timed action: time-out for ’tau’? Trends in Cognitive Sciences 3, 301–310 (1999) 3. Luo, G., Woods, R., Peli, E.: Collision judgment when using an augmented vision head mounted display device. Investigative Ophthalmology and Visual Science 50, 4509–4515 (2009) 4. Cipolla, R., Blake, A.: Surface orientation and time to contact from divergence and deformation. In: Sandini, G. (ed.) ECCV 1992. LNCS, vol. 588, pp. 187–202. Springer, Heidelberg (1992) 5. Ancona, N., Poggio, T.: Optical flow from 1d correlation: Application to a simple time to crash detector. International Journal of Computer Vision 14, 131–146 (1995) 6. Alenya, G., Negre, A., Crowley, J.L.: A Comparison of Three Methods for Measure of Time to Contact. In: IEEE/RSJ Conference on Intelligent Robots and Systems, pp. 1–6 (2009) 7. Meyer, F.G.: Time-to-collision from first order models of the motion field. IEEE Transactions on Robotics and Automation 10, 792–798 (1994) 8. Camus, T.A.: Calculating time-to-contact using real time quantized optical flow. MaxPlanck-Institut fur Biologische Kybernetik Technical Report (1995)
Time to Collision and Collision Risk Estimation from Local Scale and Motion
737
9. Horn, B.K.P., Fang, Y., Masaki, I.: Time to Contact Relative to a Planar Surface. In: IEEE Intelligent Vehicle Symposium, pp. 68–74 (2007) 10. Horn, B.K.P., Fang, Y., Masaki, I.: Hierarchical framework for direct gradient-based timeto-contact estimation. In: IEEE Intelligent Vehicle Symposium, pp. 1394–1400 (2009) 11. Lourakis, M., Orphanoudakis, S.: Using planar parallax to estimate the time-to-contact. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 640–645 (1999) 12. Colombo, C., DelBimbo, A.: Generalized bounds for time to collision from first order image motion. In: IEEE International Conference on Computer Vision, pp. 220–226 (1999) 13. Negre, A., Braillon, C., Crowley, J.L., Laugier, C.: Real time time to collision from variation of intrinsic scale. In: Proceedings of the International Symposium on Experimental Robotics, pp. 75–84 (2006) 14. Muller, D., Pauli, J., Nunn, C., Gormer, S., Muller-Schneiders, S.: Time to Contact Estimation Using Interest Points. In: IEEE Conference on Intelligent Transportation Systems, pp. 1–6 (2009) 15. Shi, J., Tomasi, C.: Good Features to Track. In: IEEE Conference On Computer Vision And Pattern Recognition, pp. 593–600 (1994) 16. Lowe, D.: Distinctive image features from scale invariant keypoints. International Journal of Computer Vision 60, 75–84 (2004) 17. Bouguet, J.Y.: Pyramidal implementation of the lucas-kanade feature tracker (2000)
Visual Tracking Based on Log-Euclidean Riemannian Sparse Representation Yi Wu1 , Haibin Ling1 , Erik Blasch2 , Li Bai3 , and Genshe Chen 1
3
Center for Data Analytics and Biomedical Informatics, Computer and Information Science Department, Temple University, Philadelphia, PA, USA 2 Air Force Research Lab/SNAA, OH, USA Electrical and Computer Engineering Department, Temple University, Philadelphia, PA, USA {wuyi,hbling,lbai}@temple.edu, [email protected]
Abstract. Recently, sparse representation has been utilized in many computer vision tasks and adapted for visual tracking. Sparsity-based visual tracking is formulated as searching candidates with minimal reconstruction errors from a template subspace with sparsity constraints in the approximation coefficients. However, an intensity template is easily corrupted by noise and not robust for target tracking under a dynamic environment. The recently proposed covariance region descriptor has been proven robust and versatile for a modest computational cost. Further, the covariance matrix enables efficient fusion of different types of features, where the spatial and statistical properties as well as their correlation are characterized, and its dimension is small. Although the covariance matrix lies on Riemannian manifolds, its log-transformation can be measured on a Euclidean subspace. Based on the covariance region descriptor and using the sparse representation, we propose a novel tracking approach on the Log-Euclidean Riemannian subspace. Specifically, the target region is characterized by a covariance matrix which is then log-transformed from the Riemannian manifold to the Euclidean subspace. After that, the target tracking problem is integrated under a sparse approximation framework, where the sparsity is achieved by solving an 1 -regularization problem. Then the candidate with the smallest approximation is taken as the tracked target. For target propagation, we use the Bayesian state inference framework, which propagates sample distributions over time using the particle filter algorithm. To evaluate our method, we have collected several video sequences and the experimental results show that our tracker can achieve robustly and reliably target tracking.
1 Introduction Visual tracking is an important problem in computer vision and has many applications in surveillance, robotics, human computer interaction, and medical image analysis [9]. The challenges in designing a robust visual tracking algorithm are mainly caused by the contamination of target appearance and shapes by the presence of noise, occlusion, varying viewpoints, background clutter, and illumination changes. Many tracking algorithms have been proposed to overcome these difficulties. For example, in [27], the sum of squared difference (SSD) has been used as a cost function for the tracking problem. Later, a gradient descent algorithm was used to find the target template with the G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 738–747, 2011. c Springer-Verlag Berlin Heidelberg 2011
Visual Tracking Based on Log-Euclidean Riemannian Sparse Representation
739
minimum cost [29]. Also, the mean-shift algorithm was adopted to find the optimal solution [28]. Another view is to treat tracking as a state sequence estimation problem and use the sequential Bayesian inference coupled with Monte Carlo sampling for the solution [1] and extensions included the incorporation of an appearance-adaptive model [31]. Recently, motivated by the breakthrough advances in compressive sensing [26,15], Mei and Ling [3] introduced the sparse representation for robust visual tracking. The tracking problem is formulated as finding the 1 minimization of a sparse representation of the target candidate using templates. The advantage of using the sparse representation lies in its robustness to background clutter and occlusions. A tracking candidate is approximated sparsely as a linear combination of target templates and trivial templates, each of them with only one nonzero element. The sparsity is achieved by solving an 1 minimization problem with non-negativity constraints during tracking. Inspired by this work, further investigation has been conducted in [18,19] for improvement in different aspects. For example, in [18] the group sparsity is integrated and high-dimensional image features are used for improving tracking robustness. In [19], the less expensive 2 minimization is used to bound the 1 approximation error and then speed up the particle resampling process without sacrificing accuracy. Other improvements support multisensor fusion [30] and wide-area imagery [16]. The sparse representation appearance models, adopted in the above mentioned tracking approaches, while straightforward, are sometimes sensitive to the environmental variations as well as pose changes. As a result, these trackers lack a competent object description criterion that captures both statistical and spatial properties of the object appearance. The covariance region descriptor [13] is proposed to characterize the object appearance. Such a descriptor captures the correlations among extracted features inside an object region and is robust to the variations in illumination, view, and pose. A covariance descriptor has been applied to many computer vision tasks, such as object classification [20,21], human detection [22,23], face recognition [24], action recognition [25], and tracking [14,8,17]. In the recently proposed covariance tracking approach [14], a brute force search approach is adopted and the model is updated by the Riemannian mean under the affine-invariant metric [12]. Then the probabilistic covariance tracking approach is proposed in [6] and extended to multi-part based representation in [7]. Further, to reduce the computational cost for covariance model update, an incremental covariance tensor learning approach is proposed in [8]. Based on the Log-Euclidean Riemannian metric [10], Li et al. [11] presented an online subspace learning algorithm which models the appearance changes by incrementally learning an eigenspace representation for each mode of the target through adaptively updating the sample mean and eigenbasis. There are three main motivations for our work: (1) the prowess of covariance descriptor as appearance models [13] to address feature correlations, (2) the effectiveness of particle filters [1,4] to account for environmental variations, and (3) the flexibility of sparse representation [3] to utilize compressed sensing. Based on the covariance region descriptor and using the sparse representation, we propose a novel tracking approach on the Log-Euclidean Riemannian subspace. Specifically, the target region is characterized by a covariance matrix which is then log-transformed from the Riemannian manifold
740
Y. Wu et al.
to the Euclidean subspace. After that, the target tracking problem is integrated under a sparse approximation framework. The sparsity is achieved by solving an 1 -regularized least squares problem. Then the candidate with the smallest projection error is taken as the tracking target. Finally, tracking is continued using a Bayesian state inference framework in which a particle filter is used to propagate sample distributions over time. To evaluate our method, we tested the proposed approach on several sequences and observed promising tracking performances (i.e. minimum squared error) in comparison with several other trackers. The rest of the paper is organized as follows. In the next section the proposed LogEuclidean Riemannian Sparse Representation (LRSR) is discussed. After that, the particle filter algorithm is reviewed in Section 3. Experimental results are reported in Section 4. We conclude this paper in Section 5.
2 Log-Euclidean Riemannian Sparse Representation The covariance matrix enables efficient fusion of different types of features in small dimensionality. The spatial and statistical properties as well as their correlation are characterized in this descriptor. Although the covariance matrix lies on Riemannian manifolds, its log-transformation can be measured on a Euclidean subspace. Based on the covariance region descriptor and using the sparse representation, we present a novel tracking approach based on the proposed Log-Euclidean Riemannian Sparse Representation (LRSR). 2.1 Covariance Descriptor The covariance region descriptor [13] efficiently fuses heterogeneous features while residing in a low dimensional feature space. In particular, for a window that contains an object of interest, the object is described using the covariance matrix of features collected in the window. Such a descriptor naturally captures the statistical properties is the feature distribution as well as the interactions between different features. In the following we first review the covariance descriptor used in the LRSR tracking method. Let I be an image defined on a grid Λ of size W × H, and F ∈ RW ×H×d be the d-dimensional feature image extracted from I: F (x, y) = Φ(I, x, y), where Φ extracts from image I various features such as color, gradients, filter responses, etc. For a given rectangle R ⊂ Λ, let {fi }N i=1 be the d-dimensional feature points inside R. Then, the appearance of R is represented by the d × d covariance matrix by the following calculation C(R) =
N 1 (fi − μ)(fi − μ)T , N − 1 n=1
where N is the number of pixels in the region R and μ is the mean of the feature points. In our work on visual tracking, we define the feature extraction function Φ(I, x, y) as
Φ(I, x, y) = (x, y, R(x, y), G(x, y), B(x, y), Ix (x, y), Iy (x, y)) ,
Visual Tracking Based on Log-Euclidean Riemannian Sparse Representation
741
where (x, y) is the pixel location, R, G, B indicate three color channels and Ix , Iy are the intensity gradients. As a result, we achieve a covariance descriptor as a 7 × 7 symmetric matrix, (i.e., d = 7 in our case). The element (i, j) of C represents the correlation between feature i and feature j. When the extracted d-dimensional feature includes the pixel’s coordinate, the covariance descriptor encodes the spatial information of features. With the help of integral images, the covariance descriptor can be calculated efficiently [13]. When d(d+1)/2 integral images are constructed, the covariance descriptor of any rectangular region can be computed independent of the region size. 2.2 Riemannian Geometry for Covariance Matrix The covariance matrices are well known to form a connected Riemannian manifold1, which can be locally approximated by a hyperplane. The Log-Euclidean Riemannian (LR) metric [10] was recently introduced for Symmetric Positive-Definite matrices (SPD). From the LR metric, SPD matrices lie in a Lie group G and the tangent space at the identity element in G forms a Lie algebra H, which forms a vector space. As a result, the distance between two points X and Y on the manifold under the Log-Euclidean Riemannian metric can be easily calculated by log(X) − log(Y). The Riemannian mean of several elements on the manifold is simply an arithmetic mean of matrix logarithms. Such a representation has been previously used for visual tracking [14] and has been combined with particle filter [8] for further robustness. In the following, we further extend the framework by integrating the sparsity constraint. 2.3 Target Representation Inspired by recent work on sparse representation for visual tracking [3], we use a linear subspace representation to model the appearance of a tracking target. We follow notations in [3] whenever applicable. However, instead working on appearance directly, we use the covariance representation described above. Let y ∈ RD , D = d × d (we concatenate elements of the log-transformed covariance matrix into a vector) be the appearance of a tracking target. The appearance is approximated by using a low dimensional subspace spanned by a set of target templates T, y ≈ Ta = a1 t1 + a2 t2 + · · · + an tn ,
(1)
where T = [t1 , · · · , tn ] ∈ RD×n containing n target templates. In addition, a = (a1 , a2 , · · · , an ) ∈ Rn are approximation coefficients. At initialization, the first target template is manually selected from the first frame and the rest target templates are created by perturbation of one pixel in four possible directions at the corner points of the first template in the first frame. Thus, we can create all the target templates (10 for our experiments) at initialization. 1
We enforce non-singularity for these matrices, which is common for image patches since feature vectors from different regions are rarely identical.
742
Y. Wu et al.
2.4 Target Inference through 1 Minimization There are many ways to solve the linear approximation in (1). Traditional solutions using least squares approximation have been shown to be less impressive [3,18] than the sparsity constrained version. Intuitively, sparsity has been recently intensively exploited for discriminability and robustness against appearance corruption [5]. Then, following the work in [3], (1) is reformulated to take into account approximation residuals, a ∧ + =T c, (2) y = [T, I] e where I is a D × D identity matrix containing D so called trivial templates. Each trivial template has only one nonzero element, which encodes the corruption at the corresponding pixel location. Accordingly, e = (e1 , e2, ·· · , eD ) ∈ RD are called trivial a coefficients, T+ = [T, I] ∈ RD×(n+D) and c = ∈ Rn+D . The trivial templates e and coefficients are included to deal with image contaminations such as occlusion. Note that although we use the same term ”template” as in [3], it means the D-dimensional a covariance representation which is different than the original appearance template. In the above trivial coefficients approximation, the residual error implies how likely a candidate y comes from previously learned knowledge (i.e. T). In particular, the approximation can represent a target candidate through a linear combination of the template set composed of both target templates and trivial templates. Usually the number of target templates is much smaller than D, which is the number of trivial templates. The intuition is that, a good candidate should be have a sparse representation, which is not usually true for bad tenmplates. The sparsity leads to a sparse coefficient vector and the coefficients corresponding to trivial templates are close to zero. These trivial coefficients can also be used to model occlusion and cluttering, as previously experimented for face recognition. Now the task is to solve the system in (2) with sparse solutions. Toward this end, a 1 -regularization term is added to achieve the sparsity. In other words, the problem turns out to be a 1 -regularized least squares problem min T+ c − y 22 +λ c 1 , c
(3)
where ·1 and ·2 indicate the 1 and 2 norms respectively. The solution to the above ˆ a regularization, denoted as ˆc = , can then be used to infer the likelihood of y being a ˆe tracking target. In particular, the candidate ε(y) with the minimum reconstruction error is chosen: (4) ε(y) = y − Tˆ a22 . Note in the above error measurement, the coefficients from trivial templates are ignored, since they represent corruptions such as noise or occlusion. Such reconstruction errors will also be used for calculating the observation likelihood used for propagating tracking over frames (see the next section). To solve (3), we use the recently proposed approach in [2] that sequentially minimizes a quadratic local surrogate of the cost. With an efficient Cholesky-based implementation, the algorithm has been shown to be at least as fast as approaches based on
Visual Tracking Based on Log-Euclidean Riemannian Sparse Representation
743
Algorithm 1. LRSR tracking 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
At t = 0, initialize template set T Initialize particles for t = 1, 2, . . . do for each sample i do Propagate particles xit with respect to the proposal q (xt |x1:t−1 , y1:t ) = p (yt |xt ). Compute the transformed target candidate yit from xit . Compute the covariance representation of for each candidate. Calculate the likelihood p(yit |xit ) via (4)(5). end for Locate the target based on the Maximum Likelihood estimation. Resample particles. end for
soft thresholding, while achieving a higher accuracy. Mairal et. al.’s algorithm is based on stochastic approximations and converges almost surely to a stationary point of the objective function and is significantly faster than previous approaches, such as the one used in [3].
3 Combination with the Particle Filter Framework Following the work in [3] and [8], we combine the proposed sparse covariance representation with the particle filter (PF) [1] for visual tracking. PF models in visual tracking, such as a Bayesian sequential inference problem, where the task is to find the tracking state sequences based on observation sequences. PF uses a Bayesian sequential importance sampling technique to approximate the posterior distribution of state variables for a dynamic system. Such distributions are usually too complicated to be modeled easily with a Gaussian assumption. A PF provides a convenient framework for estimating and propagating the posterior probability density function of state variables regardless of the underlying distribution. The framework mainly composes of two steps: prediction and update, as described below. We use xt to denote target states that describe the location and pose of a tracking target at time t, and we use yt for the observations at time t. Furthermore, we denote p xt y1:t−1 as the predicting distribution of xt given all available observations (i.e., appearances for tracking) y1:t−1 = {y1 , y2 , · · · , yt−1 } up to time t−1. The distribution can be recursively computed as p(xt |y1:t−1 ) = p(xt |xt−1 )p(xt−1 |y1:t−1 )dxt−1 . Once the prediction and the observation yt at time t are available, the state distribution can be updated using the Bayes rule p (yt |xt ) p xt y1:t−1 , p (xt |y1:t ) = p yt y1:t−1
744
Y. Wu et al.
Fig. 1. Tracking comparison results of different algorithms on sequence VIVID (#8, #65, #89, #114, #158). The results of CPF, L1 and LRSR are shown in the rows from top to bottom, respectively.
where p (yt |xt ) denotes the observation likelihood. The posterior p (xt |y1:t ) is approximated by a finite set of np weighted samples {(xit , wti ) : i = 1, · · · , np }, where wti is the importance weight for sample xit . The samples are drawn from the so called proposal distribution q (xt |x1:t−1 , y1:t ) and the weights of the samples are updated according to the following formula: p yt xit p xit xit−1 i i . wt = wt−1 q (xt |x1:t−1 , y1:t ) To avoid degeneracy, resampling is applied to generate a set of equally weighted particles according to their importance weights. In the above two steps, we need to model the observation likelihood and the proposal distribution, which are based on the proposed sparse covariance representation. Specifically, for the observation likelihood p (yt |xt ), the reconstruction error ε(yt ) is used and we have (5) p (yt |xt ) ∝ exp(−γε(yt )), where γ is constant controlling the shape of the distribution. A common choice of proposal density is by taking q (xt |x1:t−1 , y1:t ) = p (yt |xt ). Consequently, the weights become the local likelihood associated with each state wti ∝ p(yt |xit ). Finally, the Maximum Likelihood (ML) is performed to estimate current target state. An outline of our tracking algorithm is shown in Algorithm 1. Note that the difference with previously proposed 1 -tracker is mainly on the target representation.
4 Experiments Our LRSR tracker was applied to many sequences. Here, we just present some representative results. We compared the proposed tracker with other two trackers: L1
Visual Tracking Based on Log-Euclidean Riemannian Sparse Representation
745
Fig. 2. Tracking comparison results of different algorithms on sequence carNoise (#34, #60, #130, #236, #290). The results of CPF, L1 and LRSR are shown in the rows from top to bottom, respectively.
tracker [3] and Color-based Particle Filtering tracker (CPF) [4]. In our experiments, for each tracker we used the same parameters for all of the test sequences. We first test our algorithm on the sequence from DARPA VIVID data collection [32], named as VIVID. The car in sequence VIVID is moving out of and into the shadow of trees frequently. Fig. 1 shows sampling tracking results using different schemes on this sequence. We can see that the appearance of the car is frequently changed and thus the L1 and CPF trackers could not follow the target; however, our proposed LRSR tracker can track the target throughout the sequence. To test the robustness to noise, sequence carNoise, which is corrupted by Gaussian noise, is used. The comparative results are shown in Fig. 2. We can see that L1 and CPF cannot follow the target. The poor performance results from their adopted appearance models which are not robust 1
0.8
Error
0.6
0.4
0.2
50
100
150 200 # carNoise
250
Fig. 3. The tracking error plot for the sequence carNoise. The error is measured using the Euclidian distance of two center points, which has been normalized by the size of the target from the ground truth. Green: CPF; Blue: L1; Red: LRSR.
746
Y. Wu et al.
to the noise. Note that the covariance descriptor is robust to the Gaussian noise and the improves the performance the LRSR tracker. To quantitatively evaluate our proposed tracker, we manually labeled the ground truth bounding box of the target in each frame for the sequence carNoise. The error is measured using the Euclidian distance of two center points, which has been normalized by the size of the target from the ground truth. Fig. 3 illustrates the tracking error plot for each algorithm. From this figure we can see that although all the compared tracking approaches cannot track the blurred target well, our proposed LRSR tracker can track the target robustly. The reason that LRSR tracker performs well is that it uses covariance descriptor to fuse different types of features, which improves the representation in the presence appearance corruption.
5 Conclusion In this paper, we have introduced a novel tracking approach on the proposed LogEuclidean Riemannian sparse representation (LRSR). Specifically, the target region is characterized by a covariance matrix which is then log-transformed from the Riemannian manifold to the Euclidean subspace. After that, the target tracking problem is integrated under a sparse approximation framework. The sparsity is achieved by solving an 1 -regularized least squares problem. Then the candidate with the smallest projection error is taken as the tracking target. Finally, tracking is continued using a Bayesian state inference framework in which a particle filter is used to propagate sample distributions over time. The experimental results show that our tracker can robustly track targets through occlusions, pose changes, and template variations. Acknowledgment. This work is supported in part by NSF Grants IIS-0916624 and IIS-1049032.
References 1. Isard, M., Blake, A.: Condensation-Conditional Density Propagation for Visual Tracking. Int’l Journal of Computer Vision 29, 5–28 (1998) 2. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online Learning for Matrix Factorization and Sparse Coding. J. Machine Learning Research 11, 19–60 (2010) 3. Mei, X., Ling, H.: Robust Visual Tracking using 1 Minimization. In: ICCV (2009) 4. P´erez, P., Hue, C., Vermaak, J., Gangnet, M.: Color-Based Probabilistic Tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 661– 675. Springer, Heidelberg (2002) 5. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust Face Recognition via Sparse Representation. IEEE T. Pattern Analysis and Machine Intelligence 31(1), 210–227 (2009) 6. Wu, Y., Wu, B., Liu, J., Lu, H.Q.: Probabilistic Tracking on Riemannian Manifolds. In: ICPR (2008) 7. Wu, Y., Wang, J.Q., Lu, H.Q.: Robust Bayesian tracking on Riemannian manifolds via fragments-based representation. In: ICASSP (2009) 8. Wu, Y., Cheng, J., Wang, J., Lu, H.: Real-time Visual Tracking via Incremental Covariance Tensor Learning. In: ICCV (2009) 9. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing Surveys 38(4) (2006)
Visual Tracking Based on Log-Euclidean Riemannian Sparse Representation
747
10. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric means in a novel vector space structure on symmetric positive-definite matrices. SIAM J. on Matrix Analysis and Applications 29(1) (2008) 11. Li, X., Hu, W., Zhang, Z., Zhang, X., Zhu, M., Cheng, J.: Visual tracking via incremental Log-Euclidean Riemannian subspace learning. In: CVPR (2008) 12. Pennec, X., Fillard, P., Ayache, N.: A Riemannian framework for tensor computing. Int’l Journal of Computer Vision 66(1), 41–66 (2006) 13. Tuzel, O., Porikli, F., Meer, P.: Region covariance: A fast descriptor for detection and classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 589–600. Springer, Heidelberg (2006) 14. Porikli, F., Tuzel, O., Meer, P.: Covariance tracking using model update based on Lie Algebra. In: CVPR, pp. 728–735 (2006) 15. Donoho, D.: Compressed sensing. IEEE T. Information Theory 52(4), 1289–1306 (2006) 16. Ling, H., Wu, Y., Blasch, E., Chen, G., Lang, H., Bai, L.: Evaluation of Visual Tracking in Extremely Low Frame Rate Wide Area Motion Imagery. Fusion (2011) 17. Chen, M., Pang, S.K., Cham, T.J., Goh, A.: Visual Tracking with Generative Template Model based on Riemannian Manifold of Covariances. Fusion (2011) 18. Liu, B., Yang, L., Huang, J., Meer, P., Gong, L., Kulikowski, C.: Robust and fast collaborative tracking with two stage sparse optimization. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 624–637. Springer, Heidelberg (2010) 19. Mei, X., Ling, H., Wu, Y., Blasch, E., Bai, L.: Minimum Error Bounded Efficient 1 Tracker with Occlusion Detection. In: CVPR (2011) 20. Hong, X., Chang, H., Shan, S., Chen, X., Gao, W.: Sigma set: A small second order statistical region descriptor. In: CVPR, pp. 1802–1809 (2009) 21. Tosato, D., Farenzena, M., Spera, M., Murino, V., Cristani, M.: Multi-class classification on Riemannian manifolds for video surveillance. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 378–391. Springer, Heidelberg (2010) 22. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on Riemannian manifolds. In: CVPR (2007) 23. Paisitkriangkrai, S., Shen, C., Zhang, J.: Fast pedestrian detection using a cascade of boosted covariance features. IEEE T. Circuits & Systems for Video Technology 18(8), 1140–1151 (2008) 24. Pang, Y., Yuan, Y., Li, X.: Gabor-based region covariance matrices for face recognition. IEEE T. Circuits & Systems for Video Technology 18(7), 989–993 (2008) 25. Guo, K., Ishwar, P., Konrad, J.: Action change detection in video by covariance matching of silhouette tunnels. In: ICASSP, pp. 1110–1113 (2010) 26. Cand`es, E., Romberg, J., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. on Pure and Applied Mathematics 59(8), 1207–1223 (2006) 27. Baker, S., Matthews, I.: Lucas-kanade 20 years on: A unifying framework. Int’l Journal of Computer Vision 56, 221–255 (2004) 28. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE T. Pattern Analysis and Machine Intelligence 25, 564–577 (2003) 29. Hager, G., Belhumeur, P.: Real-time tracking of image regions with changes in geometry and illumination. In: CVPR, pp. 403–410 (1996) 30. Wu, Y., Blasch, E., Chen, G., Bai, L., Ling, H.: Multiple Source Data Fusion via Sparse Representation for Robust Visual Tracking. Fusion (2011) 31. Zhou, S.K., Chellappa, R., Moghaddam, B.: Visual tracking and recognition using appearance-adaptive models in particle filters. IEEE T. Image Processing 11, 1491–1506 (2004) 32. https://www.sdms.afrl.af.mil/index.php?collection=video_ sample_set_2
Panoramic Background Generation and Abnormal Behavior Detection in PTZ Camera Networks Sang-Hyun Cho1 and Hang-Bong Kang2 1
Dept. of Computer Engineering, The Catholic University of Korea, #43-1 Yeokgok 2-dong, Wonmi-Gu, Bucheon, Gyeonggi-do, Korea [email protected] 2 Dept. of Digital Media, The Catholic University of Korea, #43-1 Yeokgok 2-dong, Wonmi-Gu, Bucheon, Gyeonggi-do, Korea [email protected]
Abstract. In this paper, we present a novel method to abnormal behavior detection in PTZ camera networks. To extract motion information of scene in moving camera environment, we use panoramic background which is generated by frames. In contrast with previous methods, we use MRF framework to integrate temporal and spatial information of frames to generate panoramic background. In addition, we introduce a panoramic activity map to detect abnormal behavior of people. The panoramic activity map is useful in various abnormal situation since it includes multiple features such as location, motion, direction and pace of objects. Experimental results from real sequences demonstrate the effectiveness of our method. Keywords: Background generation, Abnormal behavior detection, Panoramic background.
1
Introduction
Intelligent video surveillance is very important for the security of various sites including many public facilities. In the video surveillance system, PTZ cameras have recently become popular because they have capabilities of covering the wider view area and providing more detailed information than conventional cameras. In PTZ camera environment, object tracking and motion patterns in the scene are very important in analyzing the scene. Background subtraction is a widely used approach for detecting moving objects in videos[1]. Basically, the background image should be the scene with no moving objects. Several methods for generating background have been proposed in the recent literature. The frame-to-frame method uses the information of the overlapped region between the current observed frame and the previous ones[2]. Kang et al.[3] proposed an adaptive background generation method which used a geometric transform-based mosaic method. The frame-to-global method, on the other hand, uses a panorama map or an added camera to provide global information for foreground detection and tracking[1]. In these methods, only temporal information of video frames is used to generate the background. However, when the object’s motion is small and the direction of the motion is similar to the camera direction, the generated background has ghost effects: small motion of objects causes the foreground to be misperceived as G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 748–757, 2011. © Springer-Verlag Berlin Heidelberg 2011
Panoramic Background Generation and Abnormal Behavior Detection
749
background. To overcome this problem, we integrate the temporal and spatial information of video by MRF framework. Several methods have been proposed to eliminate ghost effects. Wan and Miao [4] propose the rearranging method of the blending area to avoid the blending by the moving objects. Shum and Szeliski [5] present a different method to eliminate ghost effects caused by small misalignments. Xiong [6] present a gradient domain approach for eliminating ghost effects. Many abnormal detection methods are based on a general pipeline-based framework: moving objects are first detected, classified and tracked over a certain number of frames, and finally, the resulting paths are used to distinguish “normal” objects from “abnormal” ones. However, they are sensitive to the detection and the tracking of the errors and are also computationally complex[7]. Johnson et al. [8] presented a vector quantization based approach for learning typical trajectories of pedestrians in the scene. Basharat et al. [9] uses object tracking to detect unusual events in image sequence. They use the speed, size and route of persons to model normal behavior. Zweng et al. [10] proposed an unexpected human behavior detection method based on motion information. They use an accumulated hitmap, crowd pace and crowd density as features for unexpected behavior detection of people. However, they do not consider the object’s motion direction which is crucial in detecting an abnormal behavior of people. To handle various abnormalities from multiple PTZ cameras, we extract multiple features such as optical flow, frequency of objects, and the location of objects from the reconstructed panoramic scene and then finally generate panoramic activity map. Based on the panoramic activity map, we detect abnormalities in location, pace and direction of people. In this paper, we present a novel method for abnormal behavior detection in PTZ camera networks.
2 2.1
Panoramic Background Generation and Abnormal Behavior Detection in PTZ Camera Networks Overview
The overview of our method is shown in Figure 1. In the training phase, we first generate a panoramic image as its background using temporal and spatial context of frames. Since the cameras used sometimes zoom in or out their targets, the input frames have different resolution. To solve this, we registered them using SURF[11] due to its scale- and rotation invariant properties. Then a normal panoramic activity
Fig. 1. Overview of our approach
750
S.-H. Cho and H.-B. Kang
map is constructed using multiple features. From the video input, we construct the actual panoramic activity map and detect abnormal behavior using the deviation of multiple motion based features between normal situation and actual situation. 2.2
Panoramic Background Generation Using Temporal and Spatial Context of Frames
To generate a panoramic image from the sequence of images, we use image stitching method proposed by M. Brown [12,13]. Figure 2 shows the result of image alignment of adjacent frames. However, as shown in Figure 2 (a), moving objects may cause “ghost” effects in the resulting image. “Ghost” is a set of connected points by means of background subtraction, which, however, does not match to any real moving object [2]. One approach to solve this problem is to select the key frames with no moving object. But it is not a simple task in some situations. To generate a ‘clean’ background image, many methods are proposed. However, most of them used only temporal information of video frames. So, the result is not desirable sometimes due to the objects existing only for a short period. Thus, we propose a new method to generate a ‘clean’ panoramic background by integrating temporal and spatial information of video frames. To extract temporally static pixels, we use temporal deviation of pixel value in the panoramic space. From static pixels, we construct candidate background as shown in Figure 2(b). To fill out the holes in the candidate background, we introduce a pixel memory and Markov Random Field (MRF) framework. The pixel memory preserves the temporal history of each pixel. MRF is used to model the relationship among pixels. MRF is a graphical model in which a set of random variables has a Markov property described by an undirected graph. From each pixel in the pixel memory, the best background pixel is chosen by potential function E , i.e., that is, the best background pixel is extracted by minimizing the potential function. We used a single node potential function φ ( P ) and a pair-wise node potential function ψ ( P, P ′) defined as follows;
φ ( P) =
−
1 2πσ
e
D ( P,CB ) 2σ 2
(1)
where D(⋅) is the Bhattacharyya distance and CB is extracted candidate background image.
ψ ( P , P ′) =
1 2πσ
D ( P , P′ ) 2 e 2σ −
(2)
Potential function φ represents pixel P which belongs to static background and ψ represents similarity between pixel P and its neighborhood P′ . Usually, high temporally frequent pixel color is the background pixel color in the sequence.
Panoramic Background Generation and Abnormal Behavior Detection
751
To reflect this, we introduce a temporal frequency term of pixel in the potential function. It is as follows:
ρ ( P) = 1 −
frequency of P
(3)
sequence length
Thus, the potential function of our proposed model can be expressed as follows.
E = ∑ αρ ( P ) + ∑ β Pφ ( P ) + ∑ ∑ γ P′ψ ( P, P ′) P
, where
P
(4)
P P′∈ N ( P )
α + β P + ∑ γ P′ = 1 P′∈N ( P )
α is the weight of temporal frequency, β P′ is the weight of single node potential function φ (⋅) and γ is the weight of pair-wise node potential function ψ (⋅) . α and P′
β P′ are empirically determined.
(a)
(b)
Fig. 2. (a) Result of Image alignment of adjacent frames (b) Candidate background image
Figure 3 shows the comparison result between related methods and our method. Background images obtained using mean, median mode show some defects in the resulting image due to the foreground pixel. However, our proposed method shows a desirable result because this it takes into account not only temporal but also spatial information of frames. 2.3
Abnormal Behavior Detection
In our system, abnormality is detected by the abnormal direction and location, time duration in the specific location and over pace of people. To detect abnormal behaviors, we construct a panoramic activity map in which objects’ location, direction and pace are shown. In our map, each pixel of panoramic activity map represents the foreground frequency by setting the corresponding points in the map to the number of consecutive pixels. The value of activity map indicates the duration of stay of an object in the scene at the given position. Motion direction provides desirable direction of movement of objects. To detect abnormal movement of people, we use the difference between the panoramic activity maps, maximum paces, maximum densities of object, and normal
752
S.-H. Cho and H.-B. Kang
motion directions. At first, the difference of the non-weighted panoramic activity maps is computed as
(a) original video
(b) mean
(c) median
(d) mode
(e) proposed
Fig. 3. Background generation result comparison. (a) original video frame (b) mean (c) median (d) mode (e) proposed.
Adiff
x, y
= Atx , y − Aox , y
(5)
, where At is the normal activity map and Ao is the actual computed activity map x,y
x, y
at location ( x, y ) . Then the difference is weighted with the information from the normal activity map.
Aunexp
x,y
⎧ ⎪⎛ α ⎪⎪⎜1 + ⎜ A max( t ) − Aox , y = ⎨⎝ ⎪ ⎪ ⎪⎩ 0
⎛ Adiff x , y ⎜ 2
⎞⎜⎝ ⎟ ⎟ ⎠
⎞ ⎟ ⎟ ⎠
if Aox , y > Atx , y
(6)
otherwise
where Ao = max ( At ) − 1 if Ao ≥ max ( At ) , max ( At ) is the maximum value in the x,y
x,y
normal panoramic activity map and α is a regulating parameter set to the value. To compute the pace of foreground objects, we use non-overlapping regions of blobs in the consecutive frames using the logical XOR operation. The overlapping regions grow with increased pace and can be computed as follows:
Panoramic Background Generation and Abnormal Behavior Detection w −1 h −1
pace =
∑∑ xor( I x =1 y =1 w −1 h −1
∑∑ or( I x =1 y =1
p x, y
753
, I xc, y )
(7) p x, y
, I xc, y )
where w is the width of the image, h is the height of the image, I p is the binary image of the previous frame and I c is the binary image of the current frame. Abnormalities of the moving pace are detected from people who are moving unexpectedly fast. The pace of people is modeled by computing the maximum pace of training video data and storing the maximum values. The density of people or crowed is calculated by counting the foreground pixels in an area based on the following equation: area w area h
densityv , w =
∑ ∑I x =0 y =0
x, y
area w ∗ area h
,
(8)
where v is the horizontal dimension of the area, w is the vertical dimension of the area, area w is the width of the area and area h is the height of the area. Abnormalities of the density of people occur when many people are found at a certain area. An area with a low density of people is not abnormal since people are not expected in all the areas. So, to limit the abnormality of density of people, we use maximum value of density of people for each area. Motion direction of people is calculated by optical flow algorithm. We put a grid of N particles over the foreground regions in the panoramic activity map. The desirable optical flow of each particle is modeled by Gaussian in the training phase. Abnormalities of direction of motion are detected by deviation between the current optical flow and the normal one modeled by Gaussian in the training phase.
3
Experimental Result
Our proposed method is tested at Intel core i5 750 PC machine with 320 × 240 size real sequences captured from three different view cameras. For each camera view, the current frame image is registered in our panoramic background when the rate of the overlapped region between the current frame and the panoramic background is less than 70%. Training phase results of each view is shown in Figure 4. Panoramic background of each view in normal situation is generated by the proposed method shown in Figure 4 (b). A current panoramic scene is reconstructed by projecting a current frame into panoramic background. The normal panoramic activity map is constructed from the panoramic training image sequence. Then normal multiple motion based features are encoded to the panoramic activity map as shown in Figures 4 (d) and (e). Particles are expressed by the blue circle in Figure 4 (e). The optical flow of scene is calculated at the location of these particles. The optical flow over the whole image is not required for abnormal behavior detection because enough information to detect abnormality is provided by panoramic activity map.
754
S.-H. Cho and H.-B. Kang
Abnormality in the flow of people is also detected by different motion directions in the foreground. For example, as shown in Figure 5, a man is walking to the opposite direction in training video. Although the location of the man is similar to that in the training video, the direction of the motion of a man is different from that in input video. As a result, we detect the abnormal behavior of a man. The pace difference is detected by the difference of the segmented binary foregrounds between adjacent frames. Figure 6 shows abnormal detection result by over pace of an object. For instance, the four men are running in the test sequence. The pace of foreground objects is calculated by Eq (7) in the reconstructed panoramic scene. As shown in Figure 6 (e), in most of the running sequences, the pace of the running people has higher value than that of people who is walking in the training sequence. To deal with occlusion, we assume that every camera in networks is synchronized. Figure 7 shows the result of abnormal detection by deviation between the normal activity map and the actual activity map in occlusion situation. Occlusions are one of the major problems for event detection but can be solved by using multiple cameras. Figure 7 shows a crowd in some region. The crowded people are hidden by thicket in view 2. Hence, abnormality is not detected by deviation between trained activity map and actual activity map in view 2. However, abnormality is detected by deviation between trained activity map (Figure 4 (d)) and actual activity map(Figure 7 (e)) in view 3 since view 3 is not in occlusion situation. view 1
view 2
(a) original video
(b) generated panoramic background
(c) reconstructed panoramic scene
(d) panoramic optical flow map
(e) generated panoramic activity map Fig. 4. Result of training phase of test sequence
view 3
Panoramic Background Generation and Abnormal Behavior Detection
(a) original video
(b) detected abnormal scene.
(c) panoramic activity map
Fig. 5. Abnormal motion direction detect result in view 1 of test sequence
(a) original video in view 2
(b)generated panoramic scene in view 2
(c) original video in view 3
(d) generated panoramic scene in view 3
(e) foreground pace comparison in view 3 Fig. 6. Abnormal pace detection result of test sequence
755
756
S.-H. Cho and H.-B. Kang
(a) original video in view 2
(b) generated panoramic scene in view 2
(c) original video in view 3
(d) generated panoramic scene in view 3
(e) generated hitmap in view 3 Fig. 7. Abnormal crowd detection result in occlusion situation in test sequence
4
Conclusions
In this paper, we present a novel method for abnormal behavior detection in PTZ camera networks. Our method generates a panoramic background image based on the MRF framework in which temporal and spatial information of video are integrated to reduce ghost effects. To detect abnormal behaviors on the pace, direction and density of people in the PTZ camera environment, we construct a panoramic activity map. The motion features are trained and used to detect abnormalities and compared to the
Panoramic Background Generation and Abnormal Behavior Detection
757
normal situation. In the future, we will extend our panoramic background generation method to region level instead of pixel and detect various social interactions among people. Acknowledgement. This work was supported by Defense Acquisition Program Administration and Agency for Defense Development under the contract UD1000011D.
References 1. Piccardi, M.: Background subtraction techniques: a review. In: 2004 IEEE International Conference on Systems, Man and Cybernetics, October 10-13, vol. 4, pp. 3099–3104 (2004) 2. Xue, K., Liu, Y., Chen, J., Li, Q.: Panoramic background model for PTZ camera. In: 2010 3rd International Congress on Image and Signal Processing (CISP), October 16-18, vol. 1, pp. 409–413 (2010) 3. Kang, S., Paik, J., Koschan, A., Abidi, B., Abidi, M.A.: Real-time video tracking using PTZ cameras. In: Proc. of SPIE 6th International Conference on Quality Control by Artificial Vision, Gatlinburg, TN, vol. 5132, pp. 103–111 (May 2003) 4. Wan, Y., Miao, Z.: Automatic panorama image mosaic and ghost eliminating. In: IEEE International Conference on Multimedia and Expo, 2008, pp. 945–948. IEEE, Los Alamitos (2008) 5. Shum, H.-Y., Szeliski, R.: Construction of panoramic image mosaics with global and local alignment. In: Panoramic Vision: Sensors, Theory, and Applications, pp. 227–268. Springer-Verlag New York, Inc., Secaucus (2001) 6. Xiong, Y.: Eliminating ghosting artifacts for panoramic images. In: 11th IEEE International Symposium on Multimedia, ISM 2009, pp. 432–437 (2009) 7. Ermis, E.B., Saligrama, V., Jodoin, P.-M., Konrad, J.: Abnormal behavior detection and behavior matching for networked cameras. In: Second ACM/IEEE International Conference on Distributed Smart Cameras, ICDSC 2008, September 7-11, pp. 1–10 (2008) 8. Johnson, N., Hogg, D.: Learning the distribution of object trajectories for event recognition. In: BMVC (1995) 9. Basharat, A., Gritai, A., Shah, M.: Learning object motion patterns for anomaly detection and improved object detection. In: CVPR 2008, pp. 1–8 (2008) 10. Zweng, A., Kampel, M.: Unexpected Human Behavior Recognition in Image Sequences Using Multiple Features. In: 20th International Conference on Pattern Recognition, 2010, pp. 368–371 (2010) 11. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: SURF: Speeded Up Robust Features. Computer Vision and Image Understanding (CVIU) 110(3), 346–359 (2008) 12. Brown, M., Lowe, D.: Automatic Panoramic Image Stitching using Invariant Features. International Journal of Computer Vision 74(1), 59–73 (2007) 13. Brown, M., Lowe, D.G.: Recognising Panoramas. In: Proceedings of the 9th International Conference on Computer Vision (ICCV 2003), Nice, France, pp. 1218–1225 (2003)
Computing Range Flow from Multi-modal Kinect Data Jens-Malte Gottfried1,3 , Janis Fehr1,3 , and Christoph S. Garbe2,3 1 2
Heidelberg Collaboratory for Image Processing (HCI), University of Heidelberg Interdisciplinary Center for Scientific Computing (IWR), University of Heidelberg 3 Intel Visual Computing Institute (IVCI), Saarland University, Saarbr¨ ucken
Abstract. In this paper, we present a framework for range flow estimation from Microsoft’s multi-modal imaging device Kinect. We address all essential stages of the flow computation process, starting from the calibration of the Kinect, over the alignment of the range and color channels, to the introduction of a novel multi-modal range flow algorithm which is robust against typical (technology dependent) range estimation artifacts.
1
Introduction
Recently, Microsoft’s novel low cost, multi-modal depth imaging device Kinect has drawn a lot of attention in the computer vision and related communities (such as robotics). In the rather short time span of its availability, Kinect already triggered numerous computer vision applications, mostly in the area of humancomputer-interaction. In this paper, we present a framework for the estimation of range flow fields from multi-modal (depth + color) video sequences captured by Kinect. The computation of optical flow [2] from 2D image sequences or range flow [11] from depth image sequences plays an important role in the middle layer of a wide range of computer vision algorithms, such as object tracking, camera motion estimation or gesture recognition. Flow estimation has been investigated for a long time and many sophisticated algorithms (especially for optical flow) have been introduced so far [2]. However, due to the Kinect technology, standard range flow algorithms cannot simply be used “out of the box” – which makes the computation of range flow fields from this device a non-trivial task. The main difficulties result from the facts that Kinect provides only uncalibrated data where color and depth channels (which are recorded by different cameras) are not aligned, that the depth channel may contain large areas of invalid values and that edges in the depth-map are not always stable. The main contribution of this paper is twofold: First, we introduce a novel channel alignment algorithm which largely reduces image areas without valid measurements compared to previous methods. Secondly, we extend existing range flow approaches to cope with invalid and unstable depth estimates. The proposed methods are intended to be applied to data captured with the Kinect G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 758–767, 2011. c Springer-Verlag Berlin Heidelberg 2011
Computing Range Flow from Multi-modal Kinect Data
759
Fig. 1. Kinect device and its raw data output. Top: Kinect device with projector (A), color cam (B) and IR cam (C); color and depth channel overlayed. Bottom (left to right): raw color image; pseudo-color coding of the 11bit depth image provided by the on chip depth estimation; raw IR image showing the projected point pattern.
device, but should work in any multi-modal setting where different cameras are used to capture the projected pattern and the color image. Related Work. To best of our knowledge, there has not been any publication on range flow estimation from Kinect data. However, there have been several approaches to solve some of the algorithmic steps in our method independently. We use the open source driver [6] to access and capture Kinect data. The calibration and alignment of color and depth channels has been addressed by [4]. The drawback of this method is, that the resulting data set still contains large areas of invalid values which results in poor flow estimation results (see section 5). The proposed method for the computation of the actual range flow is based on ideas introduced by [11], combined with ideas from [12]. The remainder of the paper is organized as follows: Section 2 provides a discussion of Kinect’s depth imaging technology and points out the main problems that have to be solved in order to use the raw data for range flow estimation. Section 3 introduces the calibration process and our novel algorithm for the alignment of depth and color channels, before the actual flow algorithm is discussed in section 4. Finally, results and experimental evaluations are presented in section 5.
2
A Brief Introduction to the Kinect Imaging Hardware
Kinect’s depth imaging device is based on a structured-illumination approach [3]. The range information is estimated from the distortion of a projected point pattern which is captured with a camera placed at a certain baseline distance from the projector. Figure 1 shows the hardware setup: Kinect uses an IR laser diode to project a fixed point pattern which is invisible to the human eye. In combination with an
760
J.-M. Gottfried, J. Fehr, and C.S. Garbe
IR camera, the estimation of the depth-map is computed directly in hardware at approximately 30 frames per second at VGA resolution1 . The depth resolution is approximately 1cm at 2m optimal operation range [6]. The main advantage of the Kinect concept is, that it allows a more or less dense depth estimation of a scene even if it contains objects with little or no texture. Additionally, a second camera captures VGA color images from a third position. 2.1
Limitations of Kinect Data
The main problem, at least from a computer vision perspective, is the computation in hardware which is not user accessible nor can it be circumvented altogether. Hence, the entire process is a “black-box”, with very little publicly known details on the obviously extensive post-processing. It most likely includes some sort of edge preserving smoothing and up-scaling of the depth-map. This has significant impact on the noise characteristics and correlations. The provided depth estimation is more or less dense, but there is a systematic problem common to all structured-light approaches which use a camera offset: there are regions where the projected pattern is shadowed by foreground objects – making it impossible to estimate the depth at these positions (see fig. 1, depth and IR image). Another problem is that depth values tend to be unstable and inaccurate at object boundaries. This is caused by the fact that the dense depthmap is interpolated from discrete values measured at the positions of projected point patterns. Finally, since there are two different cameras capturing color and depth images, these images are not necessarily aligned to each other (fig. 1 top right). In addition, the two cameras have slightly different focal length and the optical axes are not perfectly parallel.
3
Kinect Calibration and Data Alignment
Since we are proposing a multi-modal flow algorithm (see section 4), it is essential that the color and depth image information at a certain image location belong to the same object point. In this section, we introduce a novel alignment algorithm which is especially suitable for Kinect data. Our approach is based on previous methods by [4], but performs a more complex inverse mapping from the depth image onto the color image – whereas [4] uses the straight forward mapping of the color image onto the depth image. The advantage of our method is, that we are still able to compute at least the xy-flow for areas with invalid depth values, whereas all information is lost in such areas if one applies the original alignment approach. Figure 2 shows examples for both approaches. 1
Note that VGA resolution is probably a result of the extensive on-chip postprocessing. The true hardware resolution is bounded by the number of projected points, which is much lower than VGA.
Computing Range Flow from Multi-modal Kinect Data
3.1
761
Camera Calibration
In a first step, we perform a stereo-calibration of the cameras. Since camera calibration is a common task, we will not discuss this part in detail. We simply use a standard checker-board target with good IR reflection properties and extra illumination in the IR spectrum and apply a standard stereo-calibration procedure as provided in [7]. 3.2
Data Alignment
The actual data alignment algorithm is based on the assumption that the raw depth values provided by Kinect are linearly correlated with the point-wise disparity d between pixels in the color image and their corresponding raw depth values z (just like one would expect in a standard stereo setting). We use a PCA over the positions of the checker-board corners from the calibration process to obtain this linear map d(z) = a · z + b. The original approach in [4] showed, that it is easy to warp the color image I(x, y) such that it is aligned with the depth image Z(x, y) by use of the disparity field D(x, y) = d(Z(x, y)): ˜ y) = I(x + D(x, y), y) I(x,
(1)
˜ y) is the warped color image. Since x+ d may be fractional pixels, one where I(x, has to interpolate the integer pixel values of I. For regions, where Z(x, y) has no valid depth value, no value for I˜ can be computed. Such regions are visible e.g. at the shadow of the hand shown in fig. 1. They have to be marked, e.g. setting the color to black. Hence, the color image information in these regions is completely lost. Therefore, we propose not to modify the color image, but to invert the mapping and align the depth image to the color image. In order to do this, we have to compute the inverse disparity field D∗ (x, y) such that D(x, y) = −D∗ (x + D(x, y), y)
(2)
This problem is similar to the computation of the inverse optical flow h∗ (x) as proposed in [9, appendix A]. Here, we consider a special case of this approach because the disparity is known to be limited to the x-direction only (i.e. the y-component of the flow h vanishes). Using this simplification, the ideas of [9] may be reformulated (for the i-th pixel in x-direction, i.e. at position (xi , y)) as D(xj , y) p(xi , xj + D(xj , y)) j ∗ D (xi , y) = − (3) p(xi , xj + D(xj , y)) j
where the weighting function p is computed using p(x, x ) = max{0, r − |x − x |}.
(4)
The radius r specifies the region of influence of each pixel. Invalid depth values D(xj , y) have to be excluded from summation. For some target positions, the
762
J.-M. Gottfried, J. Fehr, and C.S. Garbe
denominator can become very small (i.e. if no depth values would be warped to this position). In this case we mark D∗ (xi , y) as invalid as well. Using D∗ we ˜ y) to I(x, y) in a similar way as compute the alignment of the depth map Z(x, before in (1): ˜ y) = Z(x + D∗ (x, y), y) (5) Z(x, Figure 2 shows a qualitative comparison of the results of our proposed method and the pure stereo calibration approach. Our calibration/alignment software will be published as open source together with this paper (link: [1]). Another approach to align the data could be a reprojection of the 3D data points given from the IR camera using the projection matrix known from the calibration step. The problem here is that the raw depth values (as given by the device) are not the real z-coordinate values but proportional to the point pattern disparity. Several different methods are proposed for computing z-values from this raw depth. Our proposed method uses the fact that the raw depth values are directly proportional to the pixel shift between the images and hence avoids the problems of computing accurate pixel z-coorinates.
Fig. 2. Comparison of calibration results. In gray overlay regions, the depth values are invalid. In black regions, there are neither valid color nor depth values. Columns (left to right): uncalibrated; pure stereo calibration (epipolar); aligned warping color image with D; aligned warping depth image with D∗
4
Range Flow
Range flow (syn.: Scene flow [14]) is the established term for the 2.5D extension of optical flow, describing the local 3D motion in an image sequence. Mathematically, the range flow field is a 3D vector field hR which is defined on a 2D image plane, i.e. hR : R2 → R3
hR (x, y) = (u, v, w)T
h(x, y) = (u, v)T ,
(6)
where the first two components (u, v) are identical to the motion description in standard 2D optical flow h and w encodes the motion in the depth direction.
Computing Range Flow from Multi-modal Kinect Data
763
Numerous algorithms have been proposed in literature to solve the standard optical flow (u, v) (e.g. see [2]). For the sake of simplicity, we base our proposed method on a refined standard method for global optical flow [12] (which it self is a reinterpretation of the classical flow paper by Horn and Schunk [5]). It should be noted that our proposed algorithm (see section 4.1) should also work with most other global methods (like [8] or [13]). However, since we focus on the range extension for Kinect data, we are keen to keep the 2D terms as simple as possible. Moreover, most of the “more advanced” techniques focus on increased sub-pixel accuracy, which is not very likely to be a useful property given the accuracy of real world Kinect data. The method by [12] applies a pixel-wise brightness constancy assumption called optical flow constraint (OFC/BCC) that may be formulated as ∇I · (u, v)T + It = 0 ⇔ (Ix , Iy , 0, It ) · (u, v, w, 1)T = 0
(7)
where I is the 2D image data (here: color image converted to gray-scale) and the indices denote derivation with respect to the specified variable. As proposed by [11], a similar term may be formulated for the depth data Z, adding the motion in depth direction w: ∇Z · (u, v)T + w + Zt = 0 ⇔ (Zx , Zy , 1, Zt ) · (u, v, w, 1)T = 0
(8)
This equation is called range flow motion constraint (RFMC). As one can see, the 2D terms in eqns. (7) and the depth term in (8) are based on the same principle. Using this fact, range flow estimation using color (converted to grayscale) images and depth data may be performed using any optical flow estimation algorithm with an additional data term incorporating eqn. (8). 4.1
Robust Flow Estimation
The multi-modal data which is computed from our data alignment algorithm (see section 3.2) has a dense color channel. Still, there can be invalid values in the depth channel and artifacts at object borders. Both artifacts do not remain constant in time, resulting in estimated depth changes even if there is no motion. Therefore it is essential to exclude these regions for a robust flow computation. Regions with invalid depth values may be recognized by simply thresholding the depth channel (in the raw depth output, these pixels are marked by the hardware with a depth integer value of 2047 =0x7FF, i.e. the largest 11 bit integer value). Object borders may be recognized by thresholding the edge strength (Zx2 + Zy2 ). If at least one of these threshold conditions is met, pixels are excluded from the RFMC data term (8), i.e. only the color image data is used for computation at this position. Since the linear filters used to compute the derivatives of the depth image usually have a width of 3 pixels (Sobel or Scharr filters [10]), this exclusion region has to be extended, e.g. using the morphologic dilation operator. A radius of 2 pixels showed to be sufficient. In the excluded regions, the value of w is interpolated from the valid neighbours by regularization of the range flow field.
764
J.-M. Gottfried, J. Fehr, and C.S. Garbe
Strong regularization leads to smooth flow fields but also causes blur effects at motion edges. Since motion edges often correspond to edges in the depth image, estimation results can be improved further by using the exclusion mask. Using strong regularization in valid and weak regularization in excluded regions yields much sharper motion edges and separation between fore- and background motion. This adaptive regularization is another benefit of using the depth image information. 4.2
Algorithm Summary
The final implementation of our approach is realized in a standard pyramid scheme with two levels of iteration. The outer iteration implements the multiscale image pyramid. The inner iteration (line 10 of algo. 1) recomputes the flow increments on the given input image pairs (I1 , I2 , Z1 , Z2 ), where the second has been warped with the flow h0R computed during the previous iteration. This is done by minimizing the following energy functional: (Ix u + Iy v + It )2 + λZ (x, y)(Zx u + Zy v + w + Zt )2 + λR (x, y)(|∇u|2 + |∇v|2 + |∇w|2 ) dx dy where
cz > 0 λZ (x, y) = 0
where M = 0 else
λR (x, y) =
(9)
cR1 > 0 where M = 0 cR2 > 0 else (10)
with cR1 cR2 . Using weak regularization in invalid regions (where M = 1) may be counterintuitive. As stated above, the mask M is also used as edge detection so weak regularisation in invalid regions leads to sharper motion borders on edges.
5
Evaluation
Unfortunately, there is no publicly available data base with ground truth range flow fields for Kinect data, as this is the case for 2D optical flow (e.g. as provided by [2]). Hence, no generally agreed and sufficiantly accurate methodology for quantitavely analyzing our algorithm is available. Therefore we give a qualitative discussion of our results only. 5.1
Results
Qualitative results. of our range flow estimation are shown in fig. 3. A moving hand sequence has been recorded. The rows in this figure represent different
Computing Range Flow from Multi-modal Kinect Data
765
Algorithm 1. Final Kinect Range Flow Algorithm 1: for all multi-modal image pairs I1 , I2 , Z1 , Z2 do 2: SCALE I1 , I2 , Z1 , Z2 and h0R accordingly 3: for all flow iterations do 4: WARP I2 , Z2 with h0R → I˜2 , Z˜2 5: if Z1 or Z˜2 invalid then 6: SET exclusion mask M = 0 7: else 8: SET exclusion mask M = 1 9: end if 10: COMPUTE FLOW hR using OFC (7) and RFMC (8) data terms with strong regularization where M = 1, OFC only with weak regularization where M = 0. 11: APPLY MEDIAN FILTER on hR to suppress outliers 12: SET hR → h0R 13: end for 14: end for
algorithm configurations, i.e. combinations of optional usage of the region validity masks and usage of the forward or inverse disparity (D or D∗ ) for warping. The first column shows the first frame of the image pair as used for range flow estimation. In the second column, the first two components of the range flow result hR , i.e., the standard optical flow h are visualized. The distance between the arrows is 16 px, an arrow legnth of 5 px corrensponds to a flow magnitude of 1. For better visualization of the flow contours, a hsv representation using flow angle as hue and flow magnitude as saturation has been drawn in the background. The third column shows the depth dimension of hR , i.e. w. The last column shows the raw depth data with exclusion mask. Since the depth value of invalid depth pixels is 0x7FF, these pixels appear bright. Note that the exclusion mask also consists of edges in the depth image, the contours of the hand are reproduced well. Since strong regularization is applied within connected valid depth regions, the flow result in the first two rows is smooth over the hand area as well as in the background. The hand contours are reproduced well, only low blurring effects at the borders are visible. Without usage of the masks, outliers as well in the (u, v) as in the depth channel w show up as visible in the last two rows. We did not yet tune the algorithms to be higly parallelized or using GPU computing, so the runtime is currently about half a minute per frame pair. Significant speed improvements are expected from such an optimized implementation. For testing, we used a very simple sequence that simulates the targeting application area of gesture recognition. Future research should focus on generating more realisting test sequences with given ground truth such that quantitative analysis becomes possible.
766
J.-M. Gottfried, J. Fehr, and C.S. Garbe
Fig. 3. Range flow estimation result of an image pair of a moving hand sequence. Rows (top to bottom): warp depth with D∗ , using masks; warping color with D, using masks; warp depth with D∗ , no masks; warping color with D, no masks; cf. algo. 1 Columns (left to right): gray image; optical flow h (hsv visualization with quiver); range flow depth component w (blue: positive, red: negative values); depth image with exclusion mask (red/bright=excluded) The exclusion mask is shown even if it has not been used to compute the flow.
6
Conclusions
In this paper, we have presented a novel framework for robust range flow estimation from multi-modal Kinect data. Our calibration and alignment algorithm with the back-ward mapping scheme provides a general and robust solution to register the color and depth channels provided by the hardware, which can also be applied to a wide range of other applications. The presented range flow estimation provides stable flow fields and is able to cope with the systematic errors induced by the hardware setting. Hence, our framework provides a useful middle-layer for the further development of high level algorithms for the Kinect.
Computing Range Flow from Multi-modal Kinect Data
767
Supplementary Material A website containing the captured Kinect data with different alignment strategies and the flow results discussed in fig. 3 (links to full video sequences) are available at http://hci.iwr.uni-heidelberg.de/Staff/jgottfri/papers/ flowKinect.php, also presenting further experiments using more sequences (targeting gesture recognition) and flow algorithms. Acknowledgements. The authors acknowledge financial support from the Intel Visual Computing Institute (IVCI, Saarbr¨ ucken) and from the ”Heidelberg Graduate School of Mathematical and Computational Methods for the Sciences“ (DFG GSC 220).
References 1. Web page with used data and experiments of this paper, http://hci.iwr.uni-heidelberg.de/Staff/jgottfri/papers/flowKinect.php 2. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. In: ICCV, pp. 1–8. IEEE, Los Alamitos (2007), http://vision.middlebury.edu/flow 3. Besl, P.J.: Active, optical range imaging sensors. Machine vision and applications 1(2), 127–152 (1988) 4. Burrus, N.: Kinect calibration - calibrating the depth and color camera, http://nicolas.burrus.name/index.php/Research/KinectCalibration 5. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1-3), 185– 203 (1981) 6. Martin, H.: Openkinect project - drivers and libraries for the xbox kinect device, http://openkinect.org 7. Opencv (open source computer vision) - a library of programming functions for real time computer vision, http://opencv.willowgarage.com 8. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision 67(2), 141–158 (2006) 9. Salgado, A., S´ anchez, J.: A temporal regularizer for large optical flow estimation. In: ICIP, pp. 1233–1236. IEEE, Los Alamitos (2006) 10. Scharr, H.: Optimale Operatoren in der digitalen Bildverarbeitung. Ph.D. thesis, Universit¨ at Heidelberg (2000) 11. Spies, H., J¨ ahne, B., Barron, J.L.: Range flow estimation. Computer Vision and Image Understanding 85(3), 209–231 (2002) 12. Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: CVPR, pp. 2432–2439. IEEE, Los Alamitos (2010) 13. Sun, D., Roth, S., Lewis, J.P., Black, M.J.: Learning optical flow. In: Forsyth, D.A., Torr, P.H.S., Zisserman, A. (eds.) ECCV 2008, Part III. LNCS, vol. 5304, pp. 83–97. Springer, Heidelberg (2008) 14. Vedula, S., Baker, S., Rander, P., Collins, R.T., Kanade, T.: Three-dimensional scene flow. In: ICCV, pp. 722–729 (1999)
Real-Time Object Tracking on iPhone Amin Heidari and Parham Aarabi Department of Electrical and Computer Engineering University of Toronto, Toronto, Canada [email protected], [email protected]
Abstract. A novel real-time object tracking algorithm is proposed which tracks objects in real-time on an iPhone platform. The system utilizes information such as image intensity, color, edges, and texture for matching different candidate tracks. The tracking system adapts to changes in target appearance and size (including resizing candidate tracks to a universal depth-independent size) while running at 10-15FPS tracking rate. Several experiments conducted on actual video are used to illustrate the proposed approach.
1
Introduction
Real-time tracking of arbitrary objects on mobile systems is still a challenging task in computer vision. Many successful and accurate object tracking approaches have been proposed [1]. However, most of them are not applicable for mobile systems because of two reasons. First, there is no fixed target appearance in the video stream from mobile systems. Second, the processing power of mobile systems is very limited compared to the machines usually used for computer vision tasks, While quick reactions are needed when using mobile systems for object tracking. Kernel-based feature tracking approaches [2,3,4] are mostly applied for realtime purposes since they can rapidly find the optimal matching location during target localization. In these methods, a model for target is built using target characteristics from various pixel-based features. Depending on the target model, a statistical inference method [1] in then used to track the evolution of the target state. In this work, kernel-based object tracking suggested by Dorin et al. [2,5] is used for tracking of arbitrary shaped objects using iPhone. The overall algorithm proceeds as follows. An accurate target model is built using distinctive pixel-based features such as intensity, RGB color channels, edges and texture [3,6,7]. Then, a smooth objective function is considered in consequent video frames. Finally, the mode of this objective function is evaluated using mean shift procedure [8]. The kernel-based object tracking method above works well as long as the appearance of the target does not change. This is not always guaranteed since various changes occur while tracking an object using iPhone’s camera in a real world scenario. So, the target model must be updated during tracking. To keep up only with important changes in target appearance, we do not update the target G. Bebis et al. (Eds.): ISVC 2011, Part I, LNCS 6938, pp. 768–777, 2011. c Springer-Verlag Berlin Heidelberg 2011
Real-Time Object Tracking on iPhone
769
model in all frames. Instead, updates are only applied in certain frames where the target is exactly localized in good conditions. This way, the tracker does not get confused with temporary changes, such as abrupt illumination changes and partial occlusions. Another issue with the kernel-based object tracking is that the computational cost of the tracking algorithm depends on the target size. Regarding that the iPhone 4 is powered by the Apple A4 chip (ARM v7, single core, 1GHz), this issue leads to variable tracking frame rate on iPhone depending on target size. In order to solve this problem, we down sample each frame based on the target size before processing the frame. So the frames are processed in full resolution if the target is far from iPhone. But as soon as the target becomes close enough, frames will be first down sampled and then used for target localization. The iPhone experimental results (section 5) show that this approach decreases the computational cost of object tracking and leads to a constant tracking frame rate on iPhone, independent of the target size. The proposed object tracking approach has been successfully implemented as a standalone iPhone application. The tracking is done at a frame rate of 10 ∼ 15FPS. In the following, we first give an overview on the kernel-based object tracking system (section 2). Then, we explain our approach for adapting to target changes during tracking on iPhone (section 3). In sections 4-5, we present our complete object tracking algorithm and iPhone experimental results. Finally, we conclude in section 6.
2 2.1
Kernel-Based Object Tracking Target Model
Suppose the tracking target is represented in a frame as a W × H rectangle centred at xt containing N pixels, {xi }i=1,2,...,N (Fig. 1). After extracting tracking features for these N pixels, they are mapped into a set {ui }i=1,2,...,N of N feature vectors in multidimensional tracking feature space. The most commonly used features in this work are image intensity, RGB color channels, edges and texture [3,6,7]. The target model, qˆ, is built as the histogram-based estimation of probability density function for the tracking feature space in Eqn. 1. This model is evaluated at the quantized multidimensional feature space u. N xi − xt 2 B(ui , u). k (1) qˆ(u) = Ct h t i=1 In Eqn. 1, B(ui , u) is the indicator of histogram bins as defined in Eqn. 2 and Ct is the histogram normalization constant. 1 if ui is in the histogram bin of u B(ui , u) = (2) 0 otherwise
770
A. Heidari and P. Aarabi
Fig. 1. Example of tracking target in a frame as a W × H rectangle. The dashed circle shows the area covered by the spatial kernel k(x2 ) with a bandwidth of ht . See text for further details.
k(x2 ) in Eqn. 1 is the profile of the spatial kernel [9] used to regularize the estimation of probability density function in the spatial domain. ht is the spatial bandwidth of this kernel (Fig. 1) which determines the target size. The Epanechnikov kernel profile kE (·) [8] shown in Eqn. 3 has been used in this work. kE (x) =
2.2
1−x 0
0≤x≤1 x>1
(3)
Target Localization
Let y be the location of a target candidate in the current frame and consider a candidate rectangle centred at y containing M pixels, {yi }i=1,2,...,M with corresponding feature vectors {vi }i=1,2,...,M . A model for this candidate is build in exactly the same way as it was done for the target in Eqn. 1. This model, called pˆy , is shown in Eqn. 4. yi − y 2 k pˆy (u) = Cc hc B(vi , u). i=1 M
(4)
The parameters of candidate model in Eqn. 4 are exactly the same as Eqn. 1 discussed in section 2.1 and Cc is the corresponding normalization constant. The value of hc controls the neighbourhood of the previous target location that is considered for target localization in current frame. A value of hc = 1.8ht has been used in this work according to normal movements of objects in movies. The sample based Bhattacharyya coefficient [2] defined in Eqn. 5 is used as the similarity measure between the candidate and target models. pˆy (u)ˆ q (u). (5) ρˆ(y) = ρˆ [ˆ py , qˆ] = all
u
To locatize the target in current frame, the objective function (5) should be maximized as a function of y in the neighbourhood of target location in previous
Real-Time Object Tracking on iPhone
771
frame. The linear approximation of (5) could be obtained by taylor expansion around yold as 1 qˆ(u) 1 ρˆ(y) ≈ . (6) pˆyold (u)ˆ q (u) + pˆy (u) 2 2 pˆyold (u) all
u
all
u
After substituting the expression for pˆy from (4) in (6) we will have M yi − y 2 Cc wi k ρˆ(y) ≈ constant + hc , 2 i=1 where wi =
all
u
(7)
B(vi , u)
qˆ(u) . pˆyold (u)
(8)
Eqn. 7 is the distribution estimate computed with kernel profile k(x) at y in current frame. The pixel locations {yi }i=1,2,...,M have been weighted by weights {wi }i=1,2,...,M in (8). Thus, the mode of this distribution could be found by mean shift iteration [8] as
M yold −yi 2 i=1 yi wi g hc
ynew = (9) , M yold −yi 2 i=1 wi g hc where g(x) = −k (x). Therefore, using Eqn. 9, the mode of the estimation of Bhattacharyya coefficient (7) could be computed by iteratively computing ynew in (9) and moving the kernel from yold to ynew at each iteration. The iterations continue until reaching some stopping criteria.
3
Adapting to Target Changes
Target appearance may change in so many ways during tracking. Some of these changes are temporary, such as partial occlusions, shape deformations and sudden illumination changes. On the other hand, there are changes that are not temporary, such as permanent changes in illumination conditions and changes in target size. In order to keep up with permanent target changes, the tracking system must detect such evolutions and adapt to them. The Bhattacharya coefficinet (5) is indicative of how well the target is localized. In all our tracking experiments on iPhone, a value above 0.9 for (5) is only achived when the target is exactly localized and is in perfect shape (there are no temporary changes). Therefore, the condition (ˆ ρ(y) > 0.9) has been used as update condition before applying target updates in sections 3.1 and 3.2. This update condition makes the target model robust to the temporary changes discussed before.
772
A. Heidari and P. Aarabi
3.1
Updating Target Size
The size of the target often changes over time. Thus, the spatial bandwidth the ht of the target in (1) has to be updated accordingly. Denote by hprev t bandwidth used for target model (1) in previous frame. Changes in target size are updated by computing a target model (1) qˆt at current location for several different values of ht . The Bhattacharya coefficient (5) between all these models and the target model qˆ is then evaluated. The values of ht used in this work are {hprev + kΔht | − 2 ≤ k ≤ +2} where Δht = 0.1hprev . The best target size, h∗t , t t yielding the maximum Bhattacharya coefficient, is retained. The new target size is obtained through the filtering in Eqn. 10 in order to avoid over sensitive hnew t bandwidth changes. A value of γ = 0.6 has been used for Eqn. 10 in this work. hnew = γh∗t + (1 − γ)hprev . t t
(10)
This approach is similar to what Dorin et al. suggested in [2]. They suggested to run the localization process three times for each frame in order to find the optimum bandwidth. Since there are no parallel processing resources available on iPhone, this approach will linearly decrease the resulting iPhone tracking frame rate. In our work, the target size is updated by only evaluating ρˆ(y) in (5) for different bandwidths at current target location. Thus, no additional localization tasks are performed for each frame and the resulting iPhone tracking frame rate will not decrease. 3.2
Updating Target Model
After updating the target size in section 3.1, the target model (1) is updated as qc (u) + (1 − μ)ˆ qold (u), qˆnew (u) = μˆ
(11)
where qˆold is the previous target model. qˆc is the target model (1) evaluated for current frame at the current target location with ht = hnew from (10). qˆnew is t the updated target model that is going to be used from now on. The parameter μ weights the contribution of the current target model. Thus, a forgetting process is evoked in the sense that the contribution of a specific frame decreases exponentially the further it lies in the past. A value of μ = 0.04 has been used in this work. 3.3
Adaptive Frame Resizing
According to [2], the total cost Ctotal for target localization in section 2 is approximately Ctotal ≈ Kh2t , where ht is the bandwidth of spatial kernel in target model (1), and K is a constant. Therefore, iPhone frame rate will decrease quadratically as the target’s dimentions becomes larger. This issue will significantly affect the performance
Real-Time Object Tracking on iPhone
773
of the tracker. The models used for target localization in (1) and (4) are based on the histogram of the tracking feature space. Thus, resizing the image will not affect the model as long as enough number pixels are available. Therefore, in order to avoid variations in the iPhone tracking frame rate, each frame is first resized using bilinear interpolation [10, p. 36]. Target localization procedure in section 2 is then applied to the resized image. The amount of resizing is such that the spatial kernel bandwidth ht in (1) will not be larger than ht,max . We have used a value of ht,max = 30 based on the tracking feature space.
4
Complete iPhone Object Tracking Algorithm
According to sections 2 and 3, the complete iPhone object tracking algorithm is summarized in Algorithms 1 and 2. Algorithm 1. Initialization of a target selected by user Input: Location xt and size ht of the target object, current frame. 1: Evaluate target model qˆ using (1). 2: y∗ ← xt . Output: Target location y∗ in current frame and target model qˆ.
Algorithm 2. Adaptive target localization in a new frame Input: Target location yold in previous frame, current frame, target model (ˆ q ) and size (ht ). 1: Resize frame based on target size ht (section 3.3). 2: while iterations < T do 3: Evaluate candidate model at yold using (4). 4: Derive weights {wi }i=1,2,...,M using (8). 5: Find ynew using (9). new − yold < then 6: if y 7: goto step 13. 8: else 9: yold ← ynew . 10: goto step 2. 11: end if 12: end while 13: y∗ ← ynew . 14: if ρˆ(y∗ ) > 0.90 then 15: Update target size ht based on section 3.1. 16: Update target model based on section 3.2. 17: end if qnew ) and size (hnew ). Output: Target location y∗ in current frame, new target model (ˆ t
Algorithm 1 is used whenever a new object, at location xt and with size ht is selected by iPhone user. After that, Algorithm 2 is used simultaneously to track the object until the user selects a new object.
774
A. Heidari and P. Aarabi
The parameter in Algorithm 2 controls termination of iterations. Iterations will stop if the difference between the estimations of target location in two iterations is less than . A value of = 1 has been used in this work in order to get the most accurate results. Values of < 1 are not used since they will lead to subpixel accuracy. The parameter T in Algorithm 2 is the maximum number of allowed iterations for target localization in each frame. In all our iPhone experiments, the number of used iterations were rarely more than 10 (section 5). Therefore, a value of T = 15 has been used in this work.
5
iPhone Object Tracking Experimental Results
The complete iPhone object tracking algorithm based on section 4 has been efficiently implemented on iPhone. The implementations have been optimized by using vector and matrix mathematic APIs from Apple’s Accelerate Framework [11]. The initialization in Algorithm 1 is done whenever the user selects a new tracking target by tapping on iPhone’s screen. After that, all captured frames are processed using Algorithm 2 until the user selects a new target for tracking. The application is standalone and the whole processing is done on iPhone. Figures 2-3 show the results of tracking a watch in a movie recorded with iPhone using the proposed object tracking algorithm. The movie has 328 frames of size 1280 × 720. Initial tracking target is the 80 × 80 rectangle containing the watch in frame 1 with ht = 56 as shown in Fig. 1. As shown in Figure 2, the illumination conditions of the movie are changing due to iPhone’s movements. Moreover, the watch itself undergoes affine deformations that change its appearance. Fig. 3 shows the number of iterations and the resulting Bhattacharya coefficient (5) for each frame of the movie. The Bhattacharya coefficient has been decreased around frames 40 and 160 due to the presence of other similar objects in the background. It has also decreased significantly after frame 250 since the watch has affine deformations. Another observation is Fig. 3 is that more number of iterations are used in frames where the target is in bad conditions (Bhattacharya coefficient is small). This is obvious since the tracking task will be more challenging for such frames and therefore more iterations are needed for convergence in Algortithm 2.
Fig. 2. Tracking results of a movie recorded with iPhone. The watch is being tracked with initial window of 80 × 80 and ht = 56 in frame 1 as shown in Fig. 1. Frames 1, 37, 127 and 328 are shown.
Real-Time Object Tracking on iPhone
1
Bhatt. Coeff.
Iterations
15
10
5
0
775
50
100
150
Frame
200
250
300
0.8 0.6 0.4 0.2 0
50
100
150
200
Frame
250
300
Fig. 3. Tracking results of the watch movie in Fig. 2. The graph on left side shows the number of iterations used in Algorithm 2 for target localization in each frame. The graph on right side shows the resulting Bhattacharya coefficient (5), for each frame. The horizontal axis represents the frame number in both graphs. See text for further details.
Fig. 4. Screen shots of the iPhone object tracking application. Tracking target is the lid. The number of iterations used in Algorithm 2, as well as the resulting Bhattacharya coefficient (5) and frame rate are shown on each screen shot.
Fig. 5. Screen shots of the iPhone object tracking application. Tracking target is the upper part of the phone. The number of iterations used in Algorithm 2, as well as the resulting Bhattacharya coefficient (5) and frame rate are shown on each screen shot.
776
A. Heidari and P. Aarabi
without frame resizing with frame resizing
30
iPhone frame rate
25
20
15
10
5
10
20
30
40
50
60
Target size ht (pixels)
70
80
90
100
Fig. 6. Effect of adaptive frame resizing described in section 3.3. The horizontal axis represents the target size ht (in pixels), and the vertical axis represents iPhone tracking frame rates for each target size. The Dashed curve is obtained when all frames are processed in full resolution. The solid curve is obtained when frames are resized according to the target size before processing. See text for further details.
Screen shots of the iPhone object tracking application are shown in Figures 4-5. iPhone captures frames of size 640 × 480 at a rate of 30FPS. In Fig. 4, the lid of a bottle is being tracked, and In Fig. 5, the upper part of the phone is the tracking target. Both targets are initially in a 50 × 50 rectangle with ht = 36 in (1). The number of iterations used in Algorithm 2, as well as the resulting Bhattacharya coefficient (5) and tracking frame rate are shown on each screen shot. iPhone tracking frame rates are calculated by considering the average number of captured frames that are missed while processing a frame. As shown, the tracking frame rate is 10 ∼ 15FPS which leads to smooth real time results on iPhone. For cases of partial occlusions, background clutter or abrupt illumination changes, smaller values for Bhattacharya Coefficient are obtained. However, the tracker has not lost the target in such challenging situations. Smaller movements of the target or iPhone lead to higher tracking frame rates since less number of iterations are used in Algorithm 2. The effect of adaptive frame resizing described in section 3.3 is investigated in Fig. 6, where the resulting iPhone tracking frame rates are plotted versus different values of target size ht . As shown in Fig. 6, without adaptive frame resizing, iPhone tracking frame rates decrease quadratically as the target size increases. While with adaptive frame resizing, iPhone tracking frame rate is almost in the range of 10 ∼ 15FPS for all target sizes.
6
Conclusion
In this paper we addressed the problem of tracking arbitrary shaped objects using iPhone. Kernel-based object tracking approach [2] has been used for localization
Real-Time Object Tracking on iPhone
777
of the target in video frames. Since the frames are captured using iPhone’s camera, the target’s characteristics change simultaneously during tracking. Therefore, an adaptive target model has been used instead of a fix model to keep up with changes in target’s appearance and size. In order to have a robust target model, model updates are only applied for frames that the target is exactly localized and is in perfect conditions. Regarding the limited processing resources on iPhone, the resulting iPhone tracking frame rate decreases quadratically as the target size increases. We have solved this problem by means of resizing the frames according to the target size. So the frame are processed in full resolution when the target is far from iPhone. But as soon as the target becomes close enough, frames will be first down sampled and then processed for target localization. The complete algorithm for tracking has been successfully implemented as a standalone iPhone application. Experimental results show that our approach leads to accurate tracking results with a smooth tracking frame rate of 10 ∼ 15FPS on iPhone.
References 1. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Computing Surveys 38 (2006) 2. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. on Pattern analysis and Machine Intelligence 25, 564–577 (2004) 3. Shi, J., Tomasi, C.: Good features to track. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600 (1994) 4. Tao, H., Sawhney, H., Kumar, R.: Object tracking with bayesian estimation of dynamic layer representations. IEEE Trans. on Pattern analysis and Machine Intelligence 24, 75–89 (2002) 5. Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE CVPR (2000) 6. Nummiaro, K., Koller-Meier, E., Gool, L.: Color features for tracking non-rigid objects. Special Issue on Visual Surveillance. Chinese Journal of Automation (2003) 7. Scharcanski, J., Venetsanopoulos, A.N.: Edge detection of color images using directional operators. IEEE Trans. on Circuits and Systems for Video Technology 7, 397–401 (1997) 8. Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. IEEE Trans. on Pattern analysis and Machine Intelligence 24, 603–619 (2002) 9. Wand, M.P., Jones, M.C.: Kernel Smoothing. Chapman and Hall, Boca Raton (1995) 10. Bovik, A.C.: Handbook of image and video processing. Academic Press, London (2005) 11. Apple: Accelerate framework reference (2010), http://developer.apple.com/ library/ios/#documentation/Accelerate/Reference/AccelerateFWRef/_ index.html
Author Index
Aarabi, Parham I-768 Abdelrahman, Mostafa II-607 Abdul-Massih, Michel II-627 Abidi, Mongi A. I-291 Abushakra, Ahmad II-310 Akimaliev, Marlen II-588 Ali, Asem II-607 Allen, James II-148 Ambrosch, Kristian I-168 Andrysco, Nathan II-239 Arigela, Saibabu II-75 Asari, Vijayan K. II-75, II-428 Attakitmongcol, K. II-436 Ayala, Orlando II-669
Babu, G.R. II-526 Bagley, B. I-461 Bai, Li I-738 Baltzakis, H. II-104 Barman, S.A. I-410 Batryn, J. I-461 Bauer, Christian I-214 Bebis, George II-516 Beer, Thomas II-681 Beichel, Reinhard I-214 Belhadj, Ziad I-236 Beneˇs, Bedˇrich II-239, II-627 Ben-Shahar, Ohad I-180 Berezowski, John I-508 Berry, David M. I-653 Besbes, Olfa I-236 Bezawada Raghupathy, Phanidhar II-180 Bimber, Oliver I-54, I-66 Birklbauer, Clemens I-66 Blasch, Erik I-738 Borst, Christoph W. II-45, II-180 Bosaghzadeh, A. II-545 Bottleson, Jeremy I-530 Boujemaa, Nozha I-236 Bradley, Elizabeth I-619 Branzan Albu, Alexandra II-259, II-701 Bryden, Aaron I-518
Bui, Alex I-1 Burch, Michael I-301, I-641 Burlick, Matt I-718 Camponez, Marcelo O. II-338 Camps, Octavia I-347 Cance, William II-55 Cerutti, Guillaume I-202 Cham, Tat-Jen I-78 Chan, Kwok-Ping I-596 Chau, Dennis II-13 Chavez, Aaron II-358 Cheesman, Tom I-653 Chelberg, David II-219, II-637 Chen, Genshe I-738 Chen, George I-551 Chen, Guang-Peng II-328 Chen, Jia II-408 Chen, Jianwen I-1 Chen, Xiankai I-551 Chen, Xiao I-431 Chen, Yang II-126, II-536 Chen, Yingju II-310 Cheng, Erkang II-486 Cheng, Irene I-508 Cheng, Shinko Y. II-126, II-536 Cheng, Ting-Wei II-190 Chien, Aichi I-392 Cho, Jason I-699 Cho, Sang-Hyun I-748 Cho, Woon I-291 Choe, Yoonsuck I-371, I-400 Choi, Byung-Uk II-578 Clark, C.M. I-461 Coming, Daniel S. II-33 Cong, Jason I-1 Coquin, Didier I-202 Cordes, Kai I-156 Danch, Daniel I-54 da Silva dos Santos, Carlos Demirci, M. Fatih II-588 Deng, Fuqin II-408 Denker, K. II-158
II-659
780
Author Index
Doretto, Gianfranco I-573 Dornaika, F. II-545 du Buf, J.M. Hans II-136 Duchaineau, Mark A. I-359 Ducrest, David L. II-45 Ehrmann, Alison I-653 Elgammal, Ahmed I-246 Elhabian, Shireen II-607 Fang, Hui I-102 Farag, Aly A. II-607 Febretti, Alessandro II-13 Fehr, Janis I-90, I-758 Feng, Zhan-Shen II-398 Fiorio, Christophe II-377 Forney, C. I-461 Forrester, J. I-461 Franz, M. II-158 Fraz, M.M. I-410 Fr¨ ohlich, Bernd I-269 Fukuda, Hisato II-116 Fukui, Kazuhiro II-555 Gambin, T. I-461 Gao, Yang II-328 Garbe, Christoph S. I-337, I-758 Garbereder, Gerrit II-681 Garc´ıa, Edwin R. II-627 Garg, Supriya I-629 Gat, Christopher II-701 Gaura, Jan II-567 Gdawiec, Krzysztof II-691 Geng, Zhao I-653 German, Daniel II-701 Getreuer, Pascal I-686 Ghandi, Nikhil II-219 Gibson, Christopher J. I-441 Gleicher, Michael I-518 Godin, Guy I-325 Gong, Yi I-281 Gonzalez, A. I-461 Goodman, Dean II-229 Gottfried, Jens-Malte I-758 Grammenos, D. II-104 Griguer, Yoram I-518 Grosse, Max I-54, I-66 Gruchalla, Kenny I-619 Grundh¨ ofer, Anselm I-54, I-66
Gschwandtner, Michael II-199 Gurney, Kevin Robert II-239 Gustafson, David II-358 Gutierrez, Marco Antonio II-659 Hamann, Bernd I-530 Hammal, Zakia I-586 Harris Jr., Frederick C. II-33 Hart, John C. I-102, I-699, II-85 Hatori, Yoshinori II-348 Haxhimusa, Yll II-280 Heidari, Amin I-768 Heinrich, Julian I-641 Heinrichs, Richard I-347 Hensler, J. II-158 Herout, Adam I-421 Hess-Flores, Mauricio I-359 Higgs, Eric II-701 Hirata Jr., Roberto II-659 Hoeppner, Daniel J. I-381 Hoppe, A. I-410 Hou, Jian II-398, II-597 Hsieh, Yu-Cheng II-190 Hu, Jing II-486 Huber, David II-126 Humenberger, Martin I-674 Husmann, Kyle I-709 Hussain, Muhammad II-516 Hwang, Sae II-320 Iandola, Forrest N. I-102, II-85 Imiya, Atsushi I-23, II-270 Inomata, Ryo I-325 Itoh, Hayato I-23 J¨ ahne, Bernd I-90 Jamal, Iqbal I-508 Jeong, Je-Chang II-95 Jeong, Jechang I-147 Jin, Liu I-551 Johansson, Mikael II-725 Johnson, Andrew II-13 Jones, M.D. II-249 Joy, Kenneth I. I-359 Jung, Kyungboo II-578 Kamali, Mahsa I-102, I-699, II-85 Kamberov, George I-718 Kamberova, Gerda I-718 Kambhamettu, Chandra II-669
Author Index Kampel, Martin II-446 Kang, Chaerin II-617 Kang, Hang-Bong I-748 Karkera, Nikhil II-148 Karydas, Lazaros I-718 Kashu, Koji II-270 Khosla, Deepak II-126, II-536 Kidsang, W. II-436 Kim, Jinwook II-715 Kim, Jonghwan II-387 Kim, Kyungnam II-126, II-536 Kim, Sung-Yeol I-291 Kim, Taemin I-709 Kim, Yonghoon I-147 Kim, Yoon-Ah II-617 Klima, Martin II-647 Knoblauch, Daniel I-359 Ko, Dong Wook II-578 Kobayashi, Yoshinori II-116, II-418 Kocamaz, Mehmet Kemal II-506 Koepnick, Steven II-33 Kogler, J¨ urgen I-674 Kolawole, Akintola II-496 Koneru, Ujwal II-209 Koschan, Andreas I-291 Kotarski, Wieslaw II-691 Koutlemanis, P. II-104 Koutsopoulos, Nikos I-259 Krumnikl, Michal II-567 Kuester, Falko I-359 Kuhlen, Torsten II-681 Kuijper, Arjan II-367 Kumar, Praveen II-526 Kumsawat, P. II-436 Kuno, Yoshinori II-116, II-418 Kuo, Yu-Tung I-484 Kurenov, Sergei II-55 Kwitt, Roland II-199 Kwon, Soon II-387 Lam, Roberto II-136 Lancaster, Nicholas II-33 Laramee, Robert S. I-653 Larsen, C. I-451 Leavenworth, William II-627 Lederman, Carl I-392 Lee, Byung-Uk II-617 Lee, Chung-Hee II-387 Lee, Dah-Jye I-541 Lee, Do-Kyung II-95
Lee, Dokyung I-147 Lee, Jeongkyu II-310 Lee, Sang Hwa II-578 Lehmann, Anke I-496 Lehr, J. I-461 Leigh, Jason II-13 Lenzen, Frank I-337 Lewandowski, Michal II-290 Li, Feng II-486 Li, Ya-Lin II-64 Liang, Zhiwen II-627 Lillywhite, Kirt I-541 Lim, Ser-Nam I-573 Lim, Young-Chul II-387 Lin, Albert Yu-Min II-229 Lin, Chung-Ching II-456 Ling, Haibin I-738, II-486 Lisowska, Agnieszka II-691 Liu, Damon Shing-Min II-190 Liu, Tianlun I-66 Liu, Zhuiguang I-530 Lowe, Richard J. I-192 Lu, Shuang I-23 Lu, Yan II-506 Luboschik, Martin I-472 Luczynski, Bart I-718 Luo, Gang I-728 Lux, Christopher I-269 Ma, Yingdong I-551 Macik, Miroslav II-647 Maier, Josef I-168 Mallikarjuna Rao, G. II-526 Mancas, Matei I-135 Mannan, Md. Abdul II-116 Mart´ınez, Francisco II-290 Mateevitsi, Victor A. II-13 McGinnis, Brad II-13 McVicker, W. I-461 Meisen, Tobias II-681 Mennillo, Laurent I-43 Mercat, Christian II-377 Mininni, Pablo I-619 Mishchenko, Ales II-476 Misterka, Jakub II-13 Mocanu, Bogdan I-607 Moeslund, T.B. I-451 Mohr, Daniel I-112 Mok, Seung Jun II-578 Montoya–Franco, Felipe I-664
781
782
Author Index
Moody-Davis, Asher I-43 Moratto, Zachary I-709 Morelli, Gianfranco II-229 Mourning, Chad II-219, II-637 Moxon, Jordan I-518 Mueller, Klaus I-629 Muhamad, Ghulam II-516 Muhammad, Najah II-516 M¨ uller, Oliver I-156 Navr´ atil, Jan I-421 Nebel, Jean-Christophe II-290 Nefian, Ara V. I-709 Ni, Karl I-347 Nishimoto, Arthur II-13 Nixon, Mark S. I-192 Novo, Alexandre II-229 Nykl, Scott II-219, II-637 Ofek, Eyal II-85 Ohkawa, Yasuhiro II-555 Omer, Ido II-85 Ostermann, J¨ orn I-156 Owen, Christopher G. I-410 Owen, G. Scott I-431 Padalkar, Kshitij I-629 Pan, Guodong I-596 Pan, Ling-Yan II-328 Pande, Amit II-526 Pankov, Sergey II-168 Parag, Toufiq I-246 Parishani, Hossein II-669 Pasqual, Ajith I-313 Passarinho, Corn´elia Janayna P. II-466 Peli, Eli I-728 Pereira de Paula, Luis Roberto II-659 Perera, Samunda I-313 Peskin, Adele P. I-381 Petrauskiene, V. II-300 Phillips Jr., George N. I-518 Pirri, Fiora I-135 Pizzoli, Matia I-135 Platzer, Christopher II-627 Popescu, Voicu II-239 Prachyabrued, Mores II-45 Pree, Wolfgang II-199 Prieto, Flavio I-664
Punak, Sukitti II-55 Pundlik, Shrinivas I-728 Qi, Nai-Ming II-398, II-597 Qureshi, Haroon I-54 Radloff, Axel I-472 Ragulskiene, J. II-300 Ragulskis, M. II-300 Rasmussen, Christopher II-506 Rast, Mark I-619 Razdan, Anshuman II-209 Redkar, Sangram II-209 Reimer, Paul II-259 Reinhard, Rudolf II-681 Remagnino, P. I-410 Rieux, Fr´ed´eric II-377 Rilk, Markus I-563 Rittscher, Jens I-573 Rohith, M.V. II-669 Rosebrock, Dennis I-563 Rosen, Paul II-239 Rosenbaum, Ren´e I-530 Rosenhahn, Bodo I-156 Rossol, Nathaniel I-508 Rotkin, Seth II-24 Roup´e, Mattias II-725 Rudnicka, Alicja R. I-410 Rusdi Syamsuddin, Muhammad II-715 Sablatnig, Robert II-280 Sakai, Tomoya I-23, II-270 Sakyte, E. II-300 Salles, Evandro Ottoni T. II-338, II-466 Sarcinelli-Filho, M´ ario II-338, II-466 Savidis, Anthony I-259 Sch¨ afer, Henrik I-337 Schmauder, Hansj¨ org I-301 Schulten, Klaus II-1 Schulze, J¨ urgen P. II-24, II-229 Schumann, Heidrun I-472, I-496 Seifert, Robert I-641 Serna–Morales, Andr´es F. I-664 Sgambati, Matthew R. II-33 Shaffer, Eric I-699 Shalunts, Gayane II-280 Shang, Lifeng I-596 Shang, Lin II-328 Shemesh, Michal I-180 Singh, Rahul I-43
Author Index Sips, Mike I-472 Skelly, Luke J. I-347 Slavik, Pavel II-647 Smith, T. I-461 Sojka, Eduard II-567 Spehr, Jens I-563 Srikaew, A. II-436 Staadt, Oliver I-496 Stone, John E. II-1 Stroila, Matei I-699 Stuelten, Christina H. I-381 Sulzbachner, Christoph I-674 Sun, Shanhui I-214 Suryanto, Chendra Hadi II-555 Tapu, Ruxandra I-224 Tavakkoli, Alireza II-496 Teng, Xiao I-78 Terabayashi, Kenji I-325 Th´evenon, J´erˆ ome II-290 Tomari, Razali II-418 Tominski, Christian I-496 Tong, Melissa I-686 Tougne, Laure I-202 Tsai, Wen-Hsiang I-484, II-64 Tzanetakis, George II-259 Uhl, Andreas II-199 Umeda, Kazunori I-325 Umlauf, G. II-158 Uyyanonvara, B. I-410 Vacavant, Antoine I-202 Vandivort, Kirby L. II-1 Vanek, Juraj I-421 Vasile, Alexandru N. I-347 Vassilieva, Natalia II-476 Velastin, Sergio II-290 Vese, Luminita A. I-1, I-392, I-686
Vijaya Kumari, G. II-526 Villasenor, John I-1 Wahl, Friedrich M. I-563 Wang, Cui II-348 Wang, Lian-Ping II-669 Wang, Michael Yu II-408 Wang, Yuan-Fang I-281 Wanner, Sven I-90 Weber, Philip P. II-229 Weiskopf, Daniel I-301, I-641 White, J. I-461 Wolf, Marilyn II-456 Wood, Zo¨e J. I-441, I-461 Wu, Xiaojun II-408 Wu, Yi I-738, II-486 Xiang, Xiang
I-11, I-124
Yan, Ming I-1, I-33 Yang, Huei-Fang I-371, I-400 Yang, Sejung II-617 Yang, Yong II-398, II-597 Yang, Yu-Bin II-328 Yin, Lijun II-148 Yoon, Sang Min II-367 Yu, Jingyi II-486 Zabulis, X. II-104 Zachmann, Gabriel I-112 Zaharia, Titus I-224, I-607 Zemˇc´ık, Pavel I-421 Zhang, Bo-Ping II-597 Zhang, Mabel Mengzi II-24 Zhang, Tong II-627 Zhang, Yao II-328 Zhou, Minqi II-428 Zhu, Ying I-431 Zhuo, Huilong II-627 Zweng, Andreas II-446
783